Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Threaded Sam to Bam conversion with Samtools View

    I am attempting to speed up the sam to bam conversion of a whole genome (paired end) alignment with Samtools0-1.19. Alignment was done with bwa-0.7.7 mem algorithm
    Commands:
    samtools view -bS -o .bam pe.sam (default)
    samtools view -bS -@ 10 -m 2G -o .bam pe.sam (threaded)

    Comparing the output .bam files there is a 0.4G difference in file size.
    I ran samtools flagstat on both bam files.
    Differences:
    6,026,490 QC passed reads
    6,026,490 paired in sequencing
    779,134 read 1
    5,247,356 read 2
    all other metrics are identical

    Can anyone explain why threading would give less reads on the same input .sam file? I assume it may have to do with the merging of thread data?

    Is there a way to correct this issue without losing the speed increase provided by threading? (Currently without threading conversion takes 6 hours. With threading Conversion takes 1 hour 15 minutes)

  • #2
    That really shouldn't happen. It makes sense that you could get different sized files, but they should contain identical number of reads. Does the same thing happen if you use the new version?

    Comment


    • #3
      MAYBE ...

      The gzip compression used by samtools binary BAM version will be using different input streams depending on threading or non-threading.

      The non-threaded version will discover a different optimization that the threaded version.

      Can both BAMS versions be re-transformed back to identical SAM files?

      dpryan is right about you should have the same read counts.
      Last edited by Richard Finney; 08-21-2014, 02:06 PM.

      Comment


      • #4
        Thanks. I am currently testing a comparison run with the latest version of samtools. Hope to have an update soon.

        Comment


        • #5
          I completed the same test with the latest version of samtools. Again there is a 0.4G difference in file size:
          10 Threads 2G:
          1367501828 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (100.00%:nan%)
          1367501828 + 0 paired in sequencing
          685896751 + 0 read1
          681605077 + 0 read2
          1343716094 + 0 properly paired (98.26%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Single Thread, default settings:
          1373528318 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (99.56%:nan%)
          1373528318 + 0 paired in sequencing
          686675885 + 0 read1
          686852433 + 0 read2
          1343716094 + 0 properly paired (97.83%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Differences:
          6,026,490 QC passed reads
          6,026,490 paired in sequencing
          779,134 read 1
          5,247,356 read 2

          Any ideas? Is it just not possible to use multiple threads for this without losing data?

          Comment


          • #6
            This is troubling. Can you post the exact commands you used?

            Would it be possible for you to gzip the original SAM file and post is somewhere (google drive, dropbox, etc.) so we can track down exactly why this is happening?

            Comment


            • #7
              I should note that it's interesting that the differences are due to unmapped reads.

              Comment


              • #8
                Below are the commands that I used to run the sort. I am unable to provide the sam file as it is human data that is not involved in a public study.

                threaded
                Run on our cluster, all threads on one server
                samtools view -bS -@ 10 -m 2G -o {sample}.bam {sample}_pe.sam

                Run on our cluster, same server as above
                samtools view -bS -o {sample}.bam {sample}_pe.sam

                Comment


                • #9
                  Fair enough (can't argue with HIPAA), I'll have to see if I can reproduce this with a dataset locally (since I only keep BAM files around, I'll have to realign something). There's certainly no reason those commands should produce this behavior.

                  Comment


                  • #10
                    thank you for looking into this. Greatly appreciated.

                    Comment


                    • #11
                      If it is of any use the sam file was generated as follows:
                      bwa mem -M -t 8 -R ''$RG'' $bwaREF $read1 $read2 > ${sample}_pe.sam

                      Comment


                      • #12
                        You read my mind, thanks

                        Comment


                        • #13
                          I should have looked more carefully at the command you were using. You're mixing up the samtools view and samtools sort options. It looks like you're trying to allow each compression thread to have up to 2 gigs of memory (-m 2G), but that's not what the -m option does in samtools view. What that does is in samtools view is filter by query read length according to the CIGAR string. Since unmapped reads have no CIGAR string (they have an * there), you end up filtering them out. This is the cause of the difference you're seeing.

                          BTW, there's no way to tell the compressor threads how much memory to use when doing the conversion.

                          Comment


                          • #14
                            Ah, thank you!

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM
                            • seqadmin
                              The Impact of AI in Genomic Medicine
                              by seqadmin



                              Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                              02-26-2024, 02:07 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-14-2024, 06:13 AM
                            0 responses
                            34 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-08-2024, 08:03 AM
                            0 responses
                            72 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-07-2024, 08:13 AM
                            0 responses
                            81 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-06-2024, 09:51 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X