SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bwa sampe very slow natpokah Bioinformatics 25 08-13-2013 10:18 AM
Is Bioinformatics dying a slow death? bbrij87 General 11 08-18-2011 02:23 AM
GATK CountCovariates running very slow indapa Bioinformatics 1 06-30-2011 05:46 AM
samtools colorspace calmd is incredibly slow najoshi Bioinformatics 1 03-31-2011 07:33 PM
Roche Software - Slow GUI andpet 454 Pyrosequencing 0 03-01-2009 06:40 AM

Reply
 
Thread Tools
Old 11-08-2010, 09:26 AM   #1
caddymob
Member
 
Location: USA

Join Date: Apr 2009
Posts: 36
Default tophat too slow for HiSeq

I am trying to use tophat to map HiSeq RNA-Seq reads ... Only problem is that I have a 72-hour walltime limit on our cluster computer and my jobs get killed before completion. I have 8 lanes of data and a few of the lanes with less reads are just barely finishing. These are paired 105mers.

Code:
 Paried_READS	STATUS
 55,048,179 	walltime_limit
 52,548,024 	finished
 31,202,440 	finished
 38,586,234 	finished
 111,308,978 	walltime_limit
 62,615,443 	walltime_limit
 68,295,975 	walltime_limit
 54,115,329 	walltime_limit
Here is my command (rg-tag options omitted for clarity):
Code:
tophat -r 325 --output-dir MSC --num-threads 8 --coverage-search --microexon-search $ref LANE3_1.fastq LANE3_2.fastq
I submit each lane to its own node, and I can only give a single node 8 cores, so I use the --num-threads 8 option.

Any suggestions on how to get this data mapped faster? I thought about splitting my reads up into more FASTQs and mapping and merging at the end, I just worry that I will my lose junctions in rare transcripts.

I also wonder about the --microexon-search and --coverage-search options, do they slow this down considerably? They seem like a good thing to do, but are they hurting me?

I'm using x86_64 TopHat 1.1.2 and bowtie 0.12.7

Thanks~
caddymob is offline   Reply With Quote
Old 11-09-2010, 05:54 PM   #2
lry198010
Member
 
Location: Wuhan China

Join Date: Aug 2008
Posts: 13
Default

I think more information when Taphat the abort before complete required to diagnose the problem!
lry198010 is offline   Reply With Quote
Old 11-09-2010, 06:57 PM   #3
caddymob
Member
 
Location: USA

Join Date: Apr 2009
Posts: 36
Default

Good point.. The jobs all die in segment_juncs (v1.1.2 (1643)).
caddymob is offline   Reply With Quote
Old 11-10-2010, 06:10 AM   #4
GKM
Member
 
Location: Pasadena, CA

Join Date: May 2009
Posts: 45
Default

It is not a good idea to map each lane separately, mapping each lane individually is not the same as mapping them together, you will get better results if you ran them all at the same time.

Which make things even slower, but the only solution to this is to not have the 72 hours limit.
GKM is offline   Reply With Quote
Old 11-10-2010, 06:11 AM   #5
GKM
Member
 
Location: Pasadena, CA

Join Date: May 2009
Posts: 45
Default

Quote:
Originally Posted by caddymob View Post
Good point.. The jobs all die in segment_juncs (v1.1.2 (1643)).
Segmenting junctions takes a long time in general, you probably simply enter that phase at the same time that your run-time limit runs out. I don't think there is anything wrong with that.
GKM is offline   Reply With Quote
Old 11-10-2010, 06:26 AM   #6
caddymob
Member
 
Location: USA

Join Date: Apr 2009
Posts: 36
Default

Thanks GKM. This is HiSeq data so I am getting upwards of 110 million paired reads (so 220+ million single end reads) in a single lane, and each lane is a single sample... Kinda funny, now we are generating more data than we can handle!!

I agree with you though that a single sample should be run at once, and that is why I an NOT splitting my FASTQs into smaller chunks and mapping those separately like we might do if this was just genome alignments.

Indeed segmenting juncs is where it is dying, just taking too long and I cannot find a good way to parallelize the task... Attempting to get access to a machine with 64 cores on a single node to see if that gets me there faster.

Any other bright ideas are welcome... If tophat/bowtie supported MPI then I wouldn't be having this problem -- but I understand that is a tall order
caddymob is offline   Reply With Quote
Old 11-10-2010, 04:26 PM   #7
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Have you looked at myrna? It seems to split up read files into small chunks, so I would think they must've tackled the problem of low abundance splice sites getting lost (hopefully)
frozenlyse is offline   Reply With Quote
Old 11-11-2010, 01:14 AM   #8
caddymob
Member
 
Location: USA

Join Date: Apr 2009
Posts: 36
Default

Thanks frozenlyse -- I looked at myrna when it first came out and didn't like the amazon cloud stuff, since this isn't an option for me. However, I admit that I did not look closely at it, which I have just done. I think this may be a solution, and something I will have to test, but it looks like a bit of a task... Also, from the FAQs, it doesn't do everything I need:
Quote:
There are many tools that handle different aspects of analyzing RNA-seq data, but each tool usually has a specialty. Myrna's cloud mode and statistical models make it especially appropriate for very large datasets and datasets consisting of many biological replicates. Myrna's biggest drawback is that it does not attempt to align reads across junctions, assemble isoforms, or otherwise analyze on the isoform or junction level.
Exclusion of junction mapping seems to be an issue for HiSeq paired 105mers. If a read fails to align across and exon junction, you can loose that expression signal, or that count. myrna was benchmarked with 35mers, so the probability of reading across a junction is much lower than on 105mers. Sure we could trim these reads, but that's not what we paid for!

You do bring up a good point though, myrna is parallelizing the task, so it must be possible -- and they say it is in the paper. On the to-do list is adding junction mapping... I just can't wait that long...
caddymob is offline   Reply With Quote
Old 11-11-2010, 05:01 AM   #9
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Seems like you have a tough nut. One possibility would be to first use bowtie against a transcript database to suck out all the stuff that looks like known messages. Of course, you'd have to integrate those back in & there is still the risk of losing some useful information. But, it might also be a useful diagnostic to run to see if there are some high abundance messages you could remove first before analyzing the rest.

Perhaps turning off the microexon search would speed things up?

The other not entirely pleasant alternative would be to dig into the tophat code so you can divide the different stages into different jobs -- and then hope each one finishes in under the 72 hour time limit (which seems like a very Mordac-ian rule, if applied inflexibly)

Also, I believe Myrna can run on any Hadoop system, not just Amazon EC2.
krobison is offline   Reply With Quote
Old 11-15-2010, 08:26 AM   #10
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by caddymob View Post

I also wonder about the --microexon-search and --coverage-search options, do they slow this down considerably? They seem like a good thing to do, but are they hurting me?

Thanks~
Those options, particularly --coverage-search, are going to drastically slow down the computation, and probably aren't buying you much in terms of sensitivity with these reads. Coverage search is designed for reads shorter than 50bp, and is much slower (and less accurate) than the other methods. Microexon search will also slow things down. I'd try leaving both off until you can get a successful run going.
Cole Trapnell is offline   Reply With Quote
Old 08-23-2012, 01:05 PM   #11
davider
Junior Member
 
Location: Berkeley, CA

Join Date: Aug 2010
Posts: 1
Default

Quote:
Originally Posted by caddymob View Post
I agree with you though that a single sample should be run at once, and that is why I an NOT splitting my FASTQs into smaller chunks and mapping those separately like we might do if this was just genome alignments.
Hi all,

I'm sorry to bring back an old thread but there is something not totally clear to me.
If one runs Tophat with the -G and the --no-novel-juncs option then is it ok to split a single sample in smaller FASTQ and align each FASTQ independently?

I understand that the junction discovery is influenced by the number of concordant reads, but if one only wants annotated junctions it should be equivalent to a straightforward transcriptome + genome alignment. Is this correct?
davider is offline   Reply With Quote
Reply

Tags
bowtie, hiseq, rna-seq, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO