Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools rmdup output / number of reads?

    Dear all,

    I am quite new to this community but have already found lots of helpful answers scanning through the threads here. But now I have come to a topic which I didn't found covered yet.

    I used samtools rmdup to remove duplicate reads from my bwa alignments. I also analysed the data including duplicates, i.e. with all reads, but want to see the difference upon using duplicate-free .bam files...

    Using rmdup, it gives me the following output:

    samtools rmdup -s S1_sorted.bam S1_unique.bam
    [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '

    Does this mean that there were initially 18,248,499 reads, and that 6,479,447 were suspected to be duplicate and thus removed?

    That would mean, the number of "uniquely" mapped reads should be 18,248,499 - (minus) 6,479,447 = 11,769,052 reads.

    Since I am not sure how to interpret the rmdupse shell comment, I thought of another possibility to get the number of uniquely mapped reads:

    1) samtools index S1_unique.bam
    2) samtools view S1_unique.bam | wc -l 11897943

    I.E. just counting the lines of the resulting "unique.bam". However, there is a slight difference between the two numbers (11,897,943 vs 11,769,05).

    Anyone able to explain it?

    Ideas are very appreciated!

    Thanks in advance.

  • #2
    Are there unmapped reads in your file? Are you counting the header lines?

    Comment


    • #3
      Hi nilshomer,

      I guess there are unmapped reads in my file, since no alignment is perfect?! I checked for unmapped reads:

      samtools view -f4 -c S1_unique.bam
      128891

      But still, I do not understand why unmapped reads should explain the different in read numbers by a) wc -l and b) output from rmdupse?

      And another question: Is there any need to remove unmapped reads before further processing, e.g. mpileup?

      Concerning header lines in my .bam files, I don't think there are any:

      samtools view S1_unique.bam | head

      gives the following output:


      61WG6AAXX_1:63:18124:15753 16 chr1 10023 0 76M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAeedbaeeecedbcc`b[abaLdacdd
      eeeeeeeecefffffeeeeedeeeedddddeeeeefffffffffffffcf XT:A:R NM:i:0 X0:i:368 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:54:8505:15971 16 chr1 10024 0 76M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACdbdbddb`\^b_b^`\`bb\b^ccc^
      b_ceddddddddddcffdffeeeeedfefefefffefdffeffffeffff XT:A:R NM:i:0 X0:i:372 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:3:3180:2050 0 chr1 10034 0 76M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACffffffffffefffffeeffffcfff
      fffffedfefddddedd_ded`d^c`cccdddddddbdddddYdaaaXaB XT:A:U NM:i:0 X0:i:1 X1:i:373 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:85:5647:17815 16 chr1 10044 18 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCY\b`b`a`\bb\b`dde`dbeeeceb
      eee_dcdffcedeeeaeceeececeefcfddffdfdffffeeeeeeecee XT:A:U NM:i:1 X0:i:1 X1:i:3 XM:i:1 XO:i:0 XG:i:0 MD:Z:73C2 XA:Z:chr4,+191044156,76M,2;chr15,+102521285,76M
      ,2;chr4,+191043973,76M,2;
      61WG6AAXX_1:9:11361:10428 16 chr1 10164 0 76M * 0 0 AACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTa`aaaS`^```\b`dde_bded_abb
      dd\dbddd\ddddd^eceeeaeceeedeceffdfceeedeceeeeddddd XT:A:R NM:i:2 X0:i:3 X1:i:2 XM:i:2 XO:i:0 XG:i:0 MD:Z:16T53C5
      61WG6AAXX_1:102:19247:8028 0 chr1 10364 23 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACcffeffffeffffcffffceffffef
      cfffddfdfcfdafcddadddfddddddfddddddddddcbc`YY_b`b\ XT:A:U NM:i:1 X0:i:1 X1:i:1 XM:i:1 XO:i:0 XG:i:0 MD:Z:74A1 XA:Z:chr20,-62918671,76M,2;
      61WG6AAXX_1:87:9895:13370 0 chr1 14063 0 76M * 0 0 GATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTAffdcdfffefffffffffffffffff
      ffffefffffefffedfffffeffffefdffdfefffdfffddffdeddf XT:A:R NM:i:2 X0:i:6 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:74C0A0
      61WG6AAXX_1:94:13388:6107 0 chr1 15960 0 76M * 0 0 GGTGGGGAACAGGGCAAGGAGGAAAGGCTGCTCAGGCAGGGCTGGGGAAGCTTeededffffffffffffffedcffdd
      affffffadc`edeffddd`edTeddfccffdfdd`YdbdbY\bedadd^ XT:A:R NM:i:0 X0:i:7 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:16:13382:17255 16 chr1 19197 15 76M * 0 0 TAGCTCCCCTACCTCCAAGAGCCCAGCCCTTGCCCACAGGGCCACACTCCACGY`aaTaaaaY\^b`cbcedYdcff^f
      dafacefffdfcffafdefffefffffdfceffffffffffefffffdff XT:A:U NM:i:0 X0:i:1 X1:i:6 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:104:17203:5703 0 chr1 19556 0 76M * 0 0 CCCCAGAGAGCCAATTTCACAATCCAGAAGTCCCCGTGCCCTAAAGGGTCTGCffffffefffffffffffffefffff
      fffffffffffffffffffdeddffefffffdedffddfdddddeddddd XT:A:R NM:i:0 X0:i:3 X1:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      Last edited by Marie_Noir; 07-22-2011, 12:36 AM.

      Comment


      • #4
        Hi Marie,
        Things like that can be quite confusing at times, just check this, it might solve it.

        Read info on my original dataset:
        16562593 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        14493755 + 0 mapped (87.51%:nan%)

        After removing unmapped reads (samtools view -bq 1 file.bam)
        11965422 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11965401 + 0 mapped (100.00%:nan%)

        Using rmdup on this unique data: (3457 / 11965401 = 0.0003)
        11961965 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11961944 + 0 mapped (100.00%:nan%)


        So, 11965401-3457=11961944, which the original number of completely aligned reads (100%) from the previous file, but when you do 'wc -l', it tells you the uniquely+non-uniquely aligned reads, which equals to 11961965. There is a difference of 21 reads in there. My data set is small and little non-redundant, when the depth is increased, this difference becomes big, thats why you might have this diff. or for a dataset which is multiplexed and the starting material is small, there are more number of duplicates introduced during the amplification step.

        I will suggest to use 'samtools flagstat file.bam' to get info about the file.

        I hope that helps a bit.

        Sukhi

        Comment


        • #5
          Originally posted by Marie_Noir View Post
          Dear all,

          samtools rmdup -s S1_sorted.bam S1_unique.bam
          [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '
          6479447 represents the number of filtered reads.
          18248499 stands for the mapped reads.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:49 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          61 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X