Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat too slow for HiSeq

    I am trying to use tophat to map HiSeq RNA-Seq reads ... Only problem is that I have a 72-hour walltime limit on our cluster computer and my jobs get killed before completion. I have 8 lanes of data and a few of the lanes with less reads are just barely finishing. These are paired 105mers.

    Code:
     Paried_READS	STATUS
     55,048,179 	walltime_limit
     52,548,024 	finished
     31,202,440 	finished
     38,586,234 	finished
     111,308,978 	walltime_limit
     62,615,443 	walltime_limit
     68,295,975 	walltime_limit
     54,115,329 	walltime_limit
    Here is my command (rg-tag options omitted for clarity):
    Code:
    tophat -r 325 --output-dir MSC --num-threads 8 --coverage-search --microexon-search $ref LANE3_1.fastq LANE3_2.fastq
    I submit each lane to its own node, and I can only give a single node 8 cores, so I use the --num-threads 8 option.

    Any suggestions on how to get this data mapped faster? I thought about splitting my reads up into more FASTQs and mapping and merging at the end, I just worry that I will my lose junctions in rare transcripts.

    I also wonder about the --microexon-search and --coverage-search options, do they slow this down considerably? They seem like a good thing to do, but are they hurting me?

    I'm using x86_64 TopHat 1.1.2 and bowtie 0.12.7

    Thanks~

  • #2
    I think more information when Taphat the abort before complete required to diagnose the problem!

    Comment


    • #3
      Good point.. The jobs all die in segment_juncs (v1.1.2 (1643)).

      Comment


      • #4
        It is not a good idea to map each lane separately, mapping each lane individually is not the same as mapping them together, you will get better results if you ran them all at the same time.

        Which make things even slower, but the only solution to this is to not have the 72 hours limit.

        Comment


        • #5
          Originally posted by caddymob View Post
          Good point.. The jobs all die in segment_juncs (v1.1.2 (1643)).
          Segmenting junctions takes a long time in general, you probably simply enter that phase at the same time that your run-time limit runs out. I don't think there is anything wrong with that.

          Comment


          • #6
            Thanks GKM. This is HiSeq data so I am getting upwards of 110 million paired reads (so 220+ million single end reads) in a single lane, and each lane is a single sample... Kinda funny, now we are generating more data than we can handle!!

            I agree with you though that a single sample should be run at once, and that is why I an NOT splitting my FASTQs into smaller chunks and mapping those separately like we might do if this was just genome alignments.

            Indeed segmenting juncs is where it is dying, just taking too long and I cannot find a good way to parallelize the task... Attempting to get access to a machine with 64 cores on a single node to see if that gets me there faster.

            Any other bright ideas are welcome... If tophat/bowtie supported MPI then I wouldn't be having this problem -- but I understand that is a tall order

            Comment


            • #7
              Have you looked at myrna? It seems to split up read files into small chunks, so I would think they must've tackled the problem of low abundance splice sites getting lost (hopefully)

              Comment


              • #8
                Thanks frozenlyse -- I looked at myrna when it first came out and didn't like the amazon cloud stuff, since this isn't an option for me. However, I admit that I did not look closely at it, which I have just done. I think this may be a solution, and something I will have to test, but it looks like a bit of a task... Also, from the FAQs, it doesn't do everything I need:
                There are many tools that handle different aspects of analyzing RNA-seq data, but each tool usually has a specialty. Myrna's cloud mode and statistical models make it especially appropriate for very large datasets and datasets consisting of many biological replicates. Myrna's biggest drawback is that it does not attempt to align reads across junctions, assemble isoforms, or otherwise analyze on the isoform or junction level.
                Exclusion of junction mapping seems to be an issue for HiSeq paired 105mers. If a read fails to align across and exon junction, you can loose that expression signal, or that count. myrna was benchmarked with 35mers, so the probability of reading across a junction is much lower than on 105mers. Sure we could trim these reads, but that's not what we paid for!

                You do bring up a good point though, myrna is parallelizing the task, so it must be possible -- and they say it is in the paper. On the to-do list is adding junction mapping... I just can't wait that long...

                Comment


                • #9
                  Seems like you have a tough nut. One possibility would be to first use bowtie against a transcript database to suck out all the stuff that looks like known messages. Of course, you'd have to integrate those back in & there is still the risk of losing some useful information. But, it might also be a useful diagnostic to run to see if there are some high abundance messages you could remove first before analyzing the rest.

                  Perhaps turning off the microexon search would speed things up?

                  The other not entirely pleasant alternative would be to dig into the tophat code so you can divide the different stages into different jobs -- and then hope each one finishes in under the 72 hour time limit (which seems like a very Mordac-ian rule, if applied inflexibly)

                  Also, I believe Myrna can run on any Hadoop system, not just Amazon EC2.

                  Comment


                  • #10
                    Originally posted by caddymob View Post

                    I also wonder about the --microexon-search and --coverage-search options, do they slow this down considerably? They seem like a good thing to do, but are they hurting me?

                    Thanks~
                    Those options, particularly --coverage-search, are going to drastically slow down the computation, and probably aren't buying you much in terms of sensitivity with these reads. Coverage search is designed for reads shorter than 50bp, and is much slower (and less accurate) than the other methods. Microexon search will also slow things down. I'd try leaving both off until you can get a successful run going.

                    Comment


                    • #11
                      Originally posted by caddymob View Post
                      I agree with you though that a single sample should be run at once, and that is why I an NOT splitting my FASTQs into smaller chunks and mapping those separately like we might do if this was just genome alignments.
                      Hi all,

                      I'm sorry to bring back an old thread but there is something not totally clear to me.
                      If one runs Tophat with the -G and the --no-novel-juncs option then is it ok to split a single sample in smaller FASTQ and align each FASTQ independently?

                      I understand that the junction discovery is influenced by the number of concordant reads, but if one only wants annotated junctions it should be equivalent to a straightforward transcriptome + genome alignment. Is this correct?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      7 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      7 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      66 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X