Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SOS, Tophat is too slow for a large dataset

    Dear all,
    I am a fan of Tophat, have been using it forever. Now, I met a problem that Tophat is too slow for a large dataset.
    I have 3 lanes HiSeq data for each sample, about 670 million 100bp-PE-reads per sample. So I want to process 3 lanes' reads together by Tophat.
    Mapping to genome by bowtie was fast, but when Tophat reached the step of "Searching for junctions via segment mapping", it has been running for almost 2 weeks. And the log "segment_juncs.log" shows that only chromosome 4-9 have been processed, which means only 1/5 of whole genome is done by last 2 weeks. 68G memory is claimed on this step, but only one thread.

    following are options that I used. I know option "--coverage-search --microexon-search" will slow it down, but I am not sure how much:
    tophat -o tophat_${d}_PE -F 0.05 -i 50 -p 32 --library-type fr-unstranded --mate-std-dev 110 -g 30 --coverage-search --microexon-search --initial-read-mismatches 3

    Any suggestion to speed up tophat, especially for this "Searching for junctions via segment mapping" step. It looks to me, each chromosome is analyzed individually on this step. Any method to make it into Multi-process mode?

    I have more than 20 samples in hands now, which seems like mission impossible.

    Thanks,

    Mark

    p.s. log of tophat so far:
    [Sat Feb 18 23:36:38 2012] Beginning TopHat run (v1.3.2)
    -----------------------------------------------
    [Sat Feb 18 23:36:38 2012] Preparing output location tophat_702LP_PE/
    [Sat Feb 18 23:36:38 2012] Checking for Bowtie index files
    [Sat Feb 18 23:36:38 2012] Checking for reference FASTA file
    [Sat Feb 18 23:36:38 2012] Checking for Bowtie
    Bowtie version: 0.12.7.0
    [Sat Feb 18 23:36:38 2012] Checking for Samtools
    Samtools Version: 0.1.18
    [Sat Feb 18 23:36:38 2012] Generating SAM header for /mnt/enclosure/mofan/database/HG19/Homo_sapiens.GRCh37.62.dna.chromosome
    [Sat Feb 18 23:37:01 2012] Preparing reads
    format: fastq
    quality scale: phred33 (default)
    [Sat Feb 18 23:37:01 2012] Reading known junctions from GTF file
    Left reads: min. length=100, count=667886430
    Right reads: min. length=100, count=667737832
    [Sun Feb 19 07:37:30 2012] Mapping left_kept_reads against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie
    [Sun Feb 19 14:11:45 2012] Processing bowtie hits
    [Mon Feb 20 01:51:44 2012] Mapping left_kept_reads_seg1 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (1/4)
    [Mon Feb 20 04:47:11 2012] Mapping left_kept_reads_seg2 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (2/4)
    [Mon Feb 20 07:50:27 2012] Mapping left_kept_reads_seg3 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (3/4)
    [Mon Feb 20 11:21:12 2012] Mapping left_kept_reads_seg4 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (4/4)
    [Mon Feb 20 14:42:48 2012] Mapping right_kept_reads against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie
    [Mon Feb 20 20:52:41 2012] Processing bowtie hits
    [Tue Feb 21 09:09:31 2012] Mapping right_kept_reads_seg1 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (1/4)
    [Tue Feb 21 12:36:23 2012] Mapping right_kept_reads_seg2 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (2/4)
    [Tue Feb 21 15:54:08 2012] Mapping right_kept_reads_seg3 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (3/4)
    [Tue Feb 21 19:50:05 2012] Mapping right_kept_reads_seg4 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (4/4)
    [Tue Feb 21 23:22:35 2012] Searching for junctions via segment mapping

  • #2
    a) update your Topat. Starting with 1.4.0 there's an 'align to transcriptome first' mode that should bring you quite a bit of improved performance
    b) I guess, really consider whether you need to find novel splice junctions - though with the number of reads you have, I guess that's your aim in the first place.

    Comment


    • #3
      Offtopic question:
      After you (finally) run Tophat, are you by any chance planning on running Cufflinks? I'd be curious to know how successful transcript assembly ends up being when you have a very high number of reads mapping to the same locus (we've had a few issues here, and tweaking the parameters doesn't seem to help)...

      Comment


      • #4
        Originally posted by ffinkernagel View Post
        a) update your Topat. Starting with 1.4.0 there's an 'align to transcriptome first' mode that should bring you quite a bit of improved performance
        b) I guess, really consider whether you need to find novel splice junctions - though with the number of reads you have, I guess that's your aim in the first place.
        Thank you for the suggestion.

        a. I've tried new version of tophat. but when you need to find novel junctions, it still takes a lot of time on the step of "Searching for junctions via segment mapping". Please see logs below:

        For a dataset of around 150 million 50bp PE reads:

        Tophat (v1.3.2) spent 77.4h in this step:
        [Thu Jan 26 23:26:53 2012] Searching for junctions via segment mapping
        [Mon Jan 30 04:49:38 2012] Retrieving sequences for splices

        TopHat run (v1.4.0) spent 120h:
        [Wed Feb 1 09:10:51 2012] Searching for junctions via segment mapping
        [Mon Feb 6 09:09:06 2012] Retrieving sequences for splices

        b. novel exons/junctions is a part of my aim. So I guess, the only option is to run through tophat lane by lane separately, though some information will be lost.

        Mark

        P.S. I attached a table shows the running time is dramatically increased as
        the input reads increased, especially for junction searching step.
        Attached Files

        Comment


        • #5
          Hm. You have passed 1.4 the -g option (or was it -G), right?
          I understood that should dramatically lower the number of reads entering the junction detection step, and therefore runtime. Must admit I haven't tried it though.

          Comment


          • #6
            you don't habe to stick to tophat,

            there are other mappers which can find novel junctions:

            RUM


            STAR


            the former very accurate the latter very fast.

            Comment


            • #7
              You might want to try GSNAP as well, I often find it faster and more sensitive than TopHat. Its output (when writing SAM) is compatible with Cufflinks.

              Comment


              • #8
                Originally posted by arvid View Post
                You might want to try GSNAP as well, I often find it faster and more sensitive than TopHat. Its output (when writing SAM) is compatible with Cufflinks.
                I also tried GSNAP+cufflinks. But when I fed cufflinks with SAM generated by GSNAP, a lot of genes are missed. Attached picture shows gene AR were fully missed, although there were massive reads mapped on it. (due to limit for attachment, sorry for the figure quality)
                Any ideas what's wrong? Thanks,

                Mark
                Attached Files

                Comment


                • #9
                  Originally posted by Mark.hz View Post
                  I also tried GSNAP+cufflinks. But when I fed cufflinks with SAM generated by GSNAP, a lot of genes are missed. Attached picture shows gene AR were fully missed, although there were massive reads mapped on it. (due to limit for attachment, sorry for the figure quality)
                  Any ideas what's wrong? Thanks,

                  Mark
                  Interesting, no idea what is going on there. Did you look for some correlation between the transcripts missed by Cufflinks and mapping qualities, strand or some other property of the read alignments?

                  Comment


                  • #10
                    Sorry to jump in here, but I am also concerned about the runtime for Tophat. I have literally just gotten started in RNA-Seq, and I have three samples that ran on one lane that are 136 M reads, PE, 100bp.

                    I need to align to hg19, and the only PC I have only has one processor with 4GB of RAM. Is this even doable? What would be a respectable computing power to use, and how long can I expect to wait? Is Tophat just too slow for large datasets?

                    Comment


                    • #11
                      Star

                      Hi Mark,

                      if you are willing to try something new, I would recommend out RNA mapper called STAR. We developed it specifically for large datasets. We routinely run it on 100-200M PE reads for ENCODE transcriptome production. For 100b PE reads the speed can be as high as ~20M pairs per CPU-hour, however it requires relatively large amount of RAM, ~27GB for human genome. In our assessment it is more accurate (has lower FPR/FNR) than Tophat.
                      The latest version is here:
                      ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.0.2/

                      As to running the alignments through Cufflinks, our experience is mixed. For relatively simple samples, like cytoplazm A+ RNA, it works quite well and is fast, but for more complex samples, especially A-, we had to manually remove some loci that were too complicated for Cufflinks.

                      Cheers
                      Alex

                      Comment


                      • #12
                        Originally posted by alexdobin View Post
                        Hi Mark,

                        if you are willing to try something new, I would recommend out RNA mapper called STAR. We developed it specifically for large datasets. We routinely run it on 100-200M PE reads for ENCODE transcriptome production. For 100b PE reads the speed can be as high as ~20M pairs per CPU-hour, however it requires relatively large amount of RAM, ~27GB for human genome. In our assessment it is more accurate (has lower FPR/FNR) than Tophat.
                        The latest version is here:
                        ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.0.2/

                        As to running the alignments through Cufflinks, our experience is mixed. For relatively simple samples, like cytoplazm A+ RNA, it works quite well and is fast, but for more complex samples, especially A-, we had to manually remove some loci that were too complicated for Cufflinks.

                        Cheers
                        Alex
                        Hi Alex,

                        A couple of questions on STAR:

                        1. Does it do indels as well?
                        2. I didn't find any parameters for min and max intron size in the manual; are there hidden defaults or is it not possible to set such parameters at all? I work on plant genomes, some with very short introns and intergenic regions, so such parameters are important for me...

                        Comment


                        • #13
                          Originally posted by billstevens View Post
                          Sorry to jump in here, but I am also concerned about the runtime for Tophat. I have literally just gotten started in RNA-Seq, and I have three samples that ran on one lane that are 136 M reads, PE, 100bp.

                          I need to align to hg19, and the only PC I have only has one processor with 4GB of RAM. Is this even doable? What would be a respectable computing power to use, and how long can I expect to wait? Is Tophat just too slow for large datasets?
                          RAM up to ~32 GB is quite cheap; this week I bought another 8 GB for my workstation, which cost us less than US$ 100 (computer brand memory, I guess you can get it much cheaper). I'd invest a few hundred dollars to get a 8 core 16 GB machine, with that you can do expression analysis quite well, in my opinion (no de novo assembly stuff, however)... Or buy cloud computing time.

                          Comment


                          • #14
                            Originally posted by arvid View Post
                            Hi Alex,

                            A couple of questions on STAR:

                            1. Does it do indels as well?
                            2. I didn't find any parameters for min and max intron size in the manual; are there hidden defaults or is it not possible to set such parameters at all? I work on plant genomes, some with very short introns and intergenic regions, so such parameters are important for me...
                            1. Yes, STAR detects insertions and deletions
                            2. The minimum intron size is determined by the --scoreDelLmax (=20 by default). If the genomic gap is below that value, it's considered deletion, otherwise - intron. The maximum intron size is approximately determined by winAnchorDistNbins*2^winBinNbits = 9*2^16~600kbases by default, which we think works well for mammalian genomes, but you can increase it at will by increasing --winAnchorDistNbins. There is yet another parameter --winFlankNbins that determines maximum gap for lower confidence introns: winFlankNbins*2^winBinNbits.

                            Comment


                            • #15
                              I would suggest subread as a read aligner. http://sourceforge.net/projects/subread/.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X