Seqanswers Leaderboard Ad

**ffinkernagel** · 03-05-2012, 01:45 PM

a) update your Topat. Starting with 1.4.0 there's an 'align to transcriptome first' mode that should bring you quite a bit of improved performance
b) I guess, really consider whether you need to find novel splice junctions - though with the number of reads you have, I guess that's your aim in the first place.

**dvanic** · 03-05-2012, 03:37 PM

Offtopic question:
After you (finally) run Tophat, are you by any chance planning on running Cufflinks? I'd be curious to know how successful transcript assembly ends up being when you have a very high number of reads mapping to the same locus (we've had a few issues here, and tweaking the parameters doesn't seem to help)...

**Mark.hz** · 03-05-2012, 04:43 PM

Originally posted by ffinkernagel View Post

a) update your Topat. Starting with 1.4.0 there's an 'align to transcriptome first' mode that should bring you quite a bit of improved performance
b) I guess, really consider whether you need to find novel splice junctions - though with the number of reads you have, I guess that's your aim in the first place.

Thank you for the suggestion.

a. I've tried new version of tophat. but when you need to find novel junctions, it still takes a lot of time on the step of "Searching for junctions via segment mapping". Please see logs below:

For a dataset of around 150 million 50bp PE reads:

Tophat (v1.3.2) spent 77.4h in this step:
[Thu Jan 26 23:26:53 2012] Searching for junctions via segment mapping
[Mon Jan 30 04:49:38 2012] Retrieving sequences for splices

TopHat run (v1.4.0) spent 120h:
[Wed Feb 1 09:10:51 2012] Searching for junctions via segment mapping
[Mon Feb 6 09:09:06 2012] Retrieving sequences for splices

b. novel exons/junctions is a part of my aim. So I guess, the only option is to run through tophat lane by lane separately, though some information will be lost.

Mark

P.S. I attached a table shows the running time is dramatically increased as
the input reads increased, especially for junction searching step.

Attached Files

time breakdown.png (12.2 KB, 89 views)

**ffinkernagel** · 03-05-2012, 10:27 PM

Hm. You have passed 1.4 the -g option (or was it -G), right?
I understood that should dramatically lower the number of reads entering the junction detection step, and therefore runtime. Must admit I haven't tried it though.

**dietmar13** · 03-05-2012, 10:51 PM

you don't habe to stick to tophat,

there are other mappers which can find novel junctions:

RUM

Chris Stoeckert’s home page

http://www.cbil.upenn.edu/RUM/userguide.php

Computational Biology and Informatics Lab (CBIL)

STAR

http://gingeraslab.cshl.edu/STAR/

the former very accurate the latter very fast.

**arvid** · 03-05-2012, 10:56 PM

You might want to try GSNAP as well, I often find it faster and more sensitive than TopHat. Its output (when writing SAM) is compatible with Cufflinks.

**Mark.hz** · 03-06-2012, 10:19 AM

Originally posted by arvid View Post

You might want to try GSNAP as well, I often find it faster and more sensitive than TopHat. Its output (when writing SAM) is compatible with Cufflinks.

I also tried GSNAP+cufflinks. But when I fed cufflinks with SAM generated by GSNAP, a lot of genes are missed. Attached picture shows gene AR were fully missed, although there were massive reads mapped on it. (due to limit for attachment, sorry for the figure quality)
Any ideas what's wrong? Thanks,

Mark

Attached Files

GSNAP_Cufflinks_IGV_Gene_AR_.png (142.4 KB, 98 views)

**arvid** · 03-07-2012, 12:49 AM

Originally posted by Mark.hz View Post

I also tried GSNAP+cufflinks. But when I fed cufflinks with SAM generated by GSNAP, a lot of genes are missed. Attached picture shows gene AR were fully missed, although there were massive reads mapped on it. (due to limit for attachment, sorry for the figure quality)
Any ideas what's wrong? Thanks,

Mark

Interesting, no idea what is going on there. Did you look for some correlation between the transcripts missed by Cufflinks and mapping qualities, strand or some other property of the read alignments?

**billstevens** · 03-08-2012, 03:47 PM

Sorry to jump in here, but I am also concerned about the runtime for Tophat. I have literally just gotten started in RNA-Seq, and I have three samples that ran on one lane that are 136 M reads, PE, 100bp.

I need to align to hg19, and the only PC I have only has one processor with 4GB of RAM. Is this even doable? What would be a respectable computing power to use, and how long can I expect to wait? Is Tophat just too slow for large datasets?

**alexdobin** · 03-08-2012, 04:25 PM

Star

Hi Mark,

if you are willing to try something new, I would recommend out RNA mapper called STAR. We developed it specifically for large datasets. We routinely run it on 100-200M PE reads for ENCODE transcriptome production. For 100b PE reads the speed can be as high as ~20M pairs per CPU-hour, however it requires relatively large amount of RAM, ~27GB for human genome. In our assessment it is more accurate (has lower FPR/FNR) than Tophat.
The latest version is here:
ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.0.2/

As to running the alignments through Cufflinks, our experience is mixed. For relatively simple samples, like cytoplazm A+ RNA, it works quite well and is fast, but for more complex samples, especially A-, we had to manually remove some loci that were too complicated for Cufflinks.

Cheers
Alex

**arvid** · 03-08-2012, 11:58 PM

Originally posted by alexdobin View Post

Hi Mark,

if you are willing to try something new, I would recommend out RNA mapper called STAR. We developed it specifically for large datasets. We routinely run it on 100-200M PE reads for ENCODE transcriptome production. For 100b PE reads the speed can be as high as ~20M pairs per CPU-hour, however it requires relatively large amount of RAM, ~27GB for human genome. In our assessment it is more accurate (has lower FPR/FNR) than Tophat.
The latest version is here:
ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.0.2/

As to running the alignments through Cufflinks, our experience is mixed. For relatively simple samples, like cytoplazm A+ RNA, it works quite well and is fast, but for more complex samples, especially A-, we had to manually remove some loci that were too complicated for Cufflinks.

Cheers
Alex

Hi Alex,

A couple of questions on STAR:

1. Does it do indels as well?
2. I didn't find any parameters for min and max intron size in the manual; are there hidden defaults or is it not possible to set such parameters at all? I work on plant genomes, some with very short introns and intergenic regions, so such parameters are important for me...

**arvid** · 03-09-2012, 12:11 AM

Originally posted by billstevens View Post

Sorry to jump in here, but I am also concerned about the runtime for Tophat. I have literally just gotten started in RNA-Seq, and I have three samples that ran on one lane that are 136 M reads, PE, 100bp.

I need to align to hg19, and the only PC I have only has one processor with 4GB of RAM. Is this even doable? What would be a respectable computing power to use, and how long can I expect to wait? Is Tophat just too slow for large datasets?

RAM up to ~32 GB is quite cheap; this week I bought another 8 GB for my workstation, which cost us less than US$ 100 (computer brand memory, I guess you can get it much cheaper). I'd invest a few hundred dollars to get a 8 core 16 GB machine, with that you can do expression analysis quite well, in my opinion (no de novo assembly stuff, however)... Or buy cloud computing time.

**alexdobin** · 03-09-2012, 08:21 AM

Originally posted by arvid View Post

Hi Alex,

A couple of questions on STAR:

1. Does it do indels as well?
2. I didn't find any parameters for min and max intron size in the manual; are there hidden defaults or is it not possible to set such parameters at all? I work on plant genomes, some with very short introns and intergenic regions, so such parameters are important for me...

1. Yes, STAR detects insertions and deletions
2. The minimum intron size is determined by the --scoreDelLmax (=20 by default). If the genomic gap is below that value, it's considered deletion, otherwise - intron. The maximum intron size is approximately determined by winAnchorDistNbins*2^winBinNbits = 9*2^16~600kbases by default, which we think works well for mammalian genomes, but you can increase it at will by increasing --winAnchorDistNbins. There is yet another parameter --winFlankNbins that determines maximum gap for lower confidence introns: winFlankNbins*2^winBinNbits.

**swaraj** · 03-09-2012, 09:41 AM

I would suggest subread as a read aligner. http://sourceforge.net/projects/subread/.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

SOS, Tophat is too slow for a large dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News