Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Threaded Sam to Bam conversion with Samtools View

    I am attempting to speed up the sam to bam conversion of a whole genome (paired end) alignment with Samtools0-1.19. Alignment was done with bwa-0.7.7 mem algorithm
    Commands:
    samtools view -bS -o .bam pe.sam (default)
    samtools view -bS -@ 10 -m 2G -o .bam pe.sam (threaded)

    Comparing the output .bam files there is a 0.4G difference in file size.
    I ran samtools flagstat on both bam files.
    Differences:
    6,026,490 QC passed reads
    6,026,490 paired in sequencing
    779,134 read 1
    5,247,356 read 2
    all other metrics are identical

    Can anyone explain why threading would give less reads on the same input .sam file? I assume it may have to do with the merging of thread data?

    Is there a way to correct this issue without losing the speed increase provided by threading? (Currently without threading conversion takes 6 hours. With threading Conversion takes 1 hour 15 minutes)

  • #2
    That really shouldn't happen. It makes sense that you could get different sized files, but they should contain identical number of reads. Does the same thing happen if you use the new version?

    Comment


    • #3
      MAYBE ...

      The gzip compression used by samtools binary BAM version will be using different input streams depending on threading or non-threading.

      The non-threaded version will discover a different optimization that the threaded version.

      Can both BAMS versions be re-transformed back to identical SAM files?

      dpryan is right about you should have the same read counts.
      Last edited by Richard Finney; 08-21-2014, 02:06 PM.

      Comment


      • #4
        Thanks. I am currently testing a comparison run with the latest version of samtools. Hope to have an update soon.

        Comment


        • #5
          I completed the same test with the latest version of samtools. Again there is a 0.4G difference in file size:
          10 Threads 2G:
          1367501828 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (100.00%:nan%)
          1367501828 + 0 paired in sequencing
          685896751 + 0 read1
          681605077 + 0 read2
          1343716094 + 0 properly paired (98.26%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Single Thread, default settings:
          1373528318 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (99.56%:nan%)
          1373528318 + 0 paired in sequencing
          686675885 + 0 read1
          686852433 + 0 read2
          1343716094 + 0 properly paired (97.83%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Differences:
          6,026,490 QC passed reads
          6,026,490 paired in sequencing
          779,134 read 1
          5,247,356 read 2

          Any ideas? Is it just not possible to use multiple threads for this without losing data?

          Comment


          • #6
            This is troubling. Can you post the exact commands you used?

            Would it be possible for you to gzip the original SAM file and post is somewhere (google drive, dropbox, etc.) so we can track down exactly why this is happening?

            Comment


            • #7
              I should note that it's interesting that the differences are due to unmapped reads.

              Comment


              • #8
                Below are the commands that I used to run the sort. I am unable to provide the sam file as it is human data that is not involved in a public study.

                threaded
                Run on our cluster, all threads on one server
                samtools view -bS -@ 10 -m 2G -o {sample}.bam {sample}_pe.sam

                Run on our cluster, same server as above
                samtools view -bS -o {sample}.bam {sample}_pe.sam

                Comment


                • #9
                  Fair enough (can't argue with HIPAA), I'll have to see if I can reproduce this with a dataset locally (since I only keep BAM files around, I'll have to realign something). There's certainly no reason those commands should produce this behavior.

                  Comment


                  • #10
                    thank you for looking into this. Greatly appreciated.

                    Comment


                    • #11
                      If it is of any use the sam file was generated as follows:
                      bwa mem -M -t 8 -R ''$RG'' $bwaREF $read1 $read2 > ${sample}_pe.sam

                      Comment


                      • #12
                        You read my mind, thanks

                        Comment


                        • #13
                          I should have looked more carefully at the command you were using. You're mixing up the samtools view and samtools sort options. It looks like you're trying to allow each compression thread to have up to 2 gigs of memory (-m 2G), but that's not what the -m option does in samtools view. What that does is in samtools view is filter by query read length according to the CIGAR string. Since unmapped reads have no CIGAR string (they have an * there), you end up filtering them out. This is the cause of the difference you're seeing.

                          BTW, there's no way to tell the compressor threads how much memory to use when doing the conversion.

                          Comment


                          • #14
                            Ah, thank you!

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            29 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X