Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK pipeline runtime

    I wonder if I'm doing the things right; it takes so long.

    I'm going to find SNPs for about 100 giga bps for the human genome.

    After alignment, I processed my reads with samtools, picard and GATK.


    1. Converting (divided for alignment efficiency) SAM to BAM format. [samtools]
    ~500 min.

    2. Sorting the BAM format. [samtools]
    ~12 h.

    3. Merging divided BAM formats to 1 BAM format per sample-run. [picard MergeSamFiles.jar]

    4. Adding read groups which I didn't include in the SAM files. [picard AddOrReplaceReadGroups.jar]

    5. Indexing sorted & merged BAM output. [samtools]

    6. Create suspicious intervals for realignment. [GATK RealignTargetCreator]

    7. Realign to remove false-positive SNPs and correct for false-negative indels - if this is the right explanation. [GATK IndelRealigner]

    8. Marking plausible PCR duplicates. [picard MarkDuplicates.jar]

    9. Re-indexing sorted-merged-realigned-deduplicated BAM. [samtools]

    10. Recalibrating base quality [GATK CountCovariates, TableRecalibration]

    11. Re-indexing sorted-merged-realigned-deduplicated-recalibrated BAM. [samtools]

    12. Calling SNPs and indels [-glm BOTH] with GATK Bayesian caller. [GATK UnifiedGenotyper.jar]
    ~3.2 hours for my toy; ~1/500 of my real sample size.

    13. Filter SNVs ## I haven't reached here yet.


    Is it normal to take this long for pre-processing map/alignment results?

    Before I go into troubleshooting I came here to ask for comments.

    Hope you guys give me some comments.

    Have a great day!!
    Last edited by alexbmp; 11-03-2011, 10:33 PM.

  • #2
    No it shoudln't take that long ... but this is proportional to your computer's capabilities ... What is your config (specially processors)? Are you using threads in GATK ? How many ?

    Comment


    • #3
      Originally posted by raonyguimaraes View Post
      No it shoudln't take that long ... but this is proportional to your computer's capabilities ... What is your config (specially processors)? Are you using threads in GATK ? How many ?
      model name : AMD Opteron(tm) Processor 6172
      cpu MHz : 2100.034

      I think samtools sort takes so long probably because I divided my .bam alignment output into 500 pieces... (before alignment and the output is still divided)

      There are 48 cores; I gave -nt value 24, but it seems to utilize about 6 cores.

      Comment


      • #4
        We use similar machines and we find BWA scales perfectly, in that 48 cores enable mapping nearly 48 times faster than 1 core. We load data directly to the computer's internal RAID because otherwise the I/O is unbearably slow. We regularly map human genomes and memory has not been a bottleneck for us. We don't divide our SAM files. Are you making SAM files and then converting to BAM? If so, that can be sped up by directly streaming the mapper to samtools for conversion to BAM...
        Last edited by adaptivegenome; 11-04-2011, 04:26 PM. Reason: typos

        Comment


        • #5
          @genericforms

          You mean 1) mappers like bwa, gsnap etc. can stream its output to samtools,

          or 2) we could add another line after bwa aln, samse/pe to stream its output to samtools?

          It really is a pain treating divided files...

          Simple merging by picard takes more than 6 hours.

          Comment


          • #6
            Yes, I meant #1. This saves on having to first write a SAM file and then write it again as a BAM file. I don't think you should split the files and again I would recommend running on a local drive.

            Comment


            • #7
              It would be something like this:
              bwa sampe [parameters] | samtools view -bt [reference] > out.bam

              Comment


              • #8
                Thank you all I'm trying it the other way.

                The splicing seems to have taken most of the time.

                Comment


                • #9
                  Good luck. Post an update on whether things work better for you.

                  Comment


                  • #10
                    Not using splitted files really make the speed.

                    Still, it seems too slow to follow the pipeline GATK suggests...

                    I've changed my pipeline a little bit.

                    I reckoned that almost all the time indices are required.

                    My process stopped when MarkDuplicates.jar make an error:
                    (Any ideas about this error?)

                    Code:
                    [Thu Nov 10 15:08:17 KST 2011] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 328.99 minutes.
                    Runtime.totalMemory()=3538878464
                    Exception in thread "main" net.sf.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
                            at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:228)
                            at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
                            at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:290)
                            at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:117)
                            at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
                            at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:101)
                    Caused by: java.io.IOException: No space left on device
                            at java.io.FileOutputStream.write(Native Method)
                            at org.xerial.snappy.SnappyOutputStream.writeInt(SnappyOutputStream.java:105)
                            at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:126)
                            at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100)
                            at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137)
                            at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:219)
                            ... 5 more
                    1. Sorting the BAM format. [samtools]
                    ~12 h.

                    2. Adding read groups which I didn't include in the SAM files. [picard AddOrReplaceReadGroups.jar]
                    ~6 h.

                    3. Indexing BAM output. [samtools]
                    ~32 min.

                    4. Reordering BAM; my reads didn't seem to be in karyotypic order. [picard ReorderSam.jar]
                    ~6.5 h.

                    5. Indexing BAM output. [samtools]
                    ~24 min.

                    *STOPPED HERE*6. Marking plausible PCR duplicates. [picard MarkDuplicates.jar]
                    ~5.5 h.

                    7. Indexing BAM output. [samtools]

                    8. Create suspicious intervals for realignment. [GATK RealignTargetCreator]

                    9. Realign to remove false-positive SNPs and correct for false-negative indels - if this is the right explanation. [GATK IndelRealigner]

                    10. Indexing BAM output. [samtools]

                    11. Recalibrating base quality [GATK CountCovariates, TableRecalibration]

                    12. Indexing BAM output. [samtools]

                    13. Calling SNPs and indels [-glm BOTH] with GATK Bayesian caller. [GATK UnifiedGenotyper.jar]
                    ~3.2 hours for my toy; ~1/500 of my real sample size.

                    14. Filter SNVs [GATK VariantFiltration]


                    I'm looking for a way to solve the error given by picard.

                    Anyways, I wanted to update my work here, so others don't get in to the same trouble I did.

                    If you have any ideas about my plan or my question, I'd be most welcome.
                    Last edited by alexbmp; 11-10-2011, 04:38 AM.

                    Comment


                    • #11
                      Originally posted by alexbmp View Post
                      My process stopped when MarkDuplicates.jar make an error:
                      (Any ideas about this error?)
                      Have you set TMP_DIR to a place with enough disk space. You can set it using something lijke this: TMP_DIR=`pwd`/tmp or whatever you like.

                      Comment


                      • #12
                        @ vang42:
                        Actually, I haven't. That really is a good point! Thank you!

                        I'll try it on and update if it works

                        Comment


                        • #13
                          Well, it works! Thank you all!

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          68 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X