Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools flagstat reportin incorrect input count for Bowtie2 mapping

    I am using bowtie2 (default settings) to map ~1.95 million reads against a small reference. The input queries I fed to bowtie2 were all fastq. There were 4 total files:

    2 paired fastq (1523516 total reads ... 761758 pairs)
    2 'orphan' files (not paired ... file1: 232282 reads, file2: 193747 reads)

    In total my query files contain 1,949,545 reads. I ran all this in a single bowtie2 command using -1, -2 & -U with the 2 orphan files being comma-delimited list for -U.

    Bowtie2 finished without complaint, and it reports these mapping stats:
    +++++++++++++++++++++
    1187787 reads; of these:
    761758 (64.13%) were paired; of these:
    471690 (61.92%) aligned concordantly 0 times
    290068 (38.08%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    471690 pairs aligned concordantly 0 times; of these:
    1089 (0.23%) aligned discordantly 1 time
    ----
    470601 pairs aligned 0 times concordantly or discordantly; of these:
    941202 mates make up the pairs; of these:
    746446 (79.31%) aligned 0 times
    194756 (20.69%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
    426029 (35.87%) were unpaired; of these:
    290590 (68.21%) aligned 0 times
    135439 (31.79%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
    46.81% overall alignment rate
    +++++++++++++++++++++

    Those numbers are wrong. My input had exactly 1,949,545 reads. Bowtie2 is reporting 1187787 total reads.

    Then I took the output sam file and ran samtools flagstats, getting these numbers:
    1252499 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    215463 + 0 mapped (17.20%:-nan%)
    961790 + 0 paired in sequencing
    519797 + 0 read1
    441993 + 0 read2
    212926 + 0 properly paired (22.14%:-nan%)
    215046 + 0 with itself and mate mapped
    298 + 0 singletons (0.03%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    this is also wrong. Then I tried running samstats, and when I add up all the MAPQ category numbers (including the unmapped), it sums up to:

    1252499 reads


    So, then I tried a simple samtools view command to try and count the total # reads reported:

    samtools view <bam> | cut -f1 | sort | uniq | wc -l

    and I get:

    868833 reads



    None of these methods agree with any other method. And none come close to my actual input read number. My understanding is that bowtie2 should (by default) only be reporting 1 alignment per query, so the total numbers reported shouldn't be confused by issues of supplemental & non-primary alignments.

    Can anyone tell me what is going on? I see no evidence of truncation or failure in any of my outputs. And even if there had been, I would have expected these methods to report the same number of total reads. Help?

  • #2
    This is bowtie2's behavior. But to check:


    1) What version of bowtie2 are you using? Is it the most recent?
    1a) Ditto with samtools.

    2) Please copy and paste the *exact* command line you are using.

    3) My version 2.2.7 shows output like the following (for 50 pairs, 25 singletons, and another 15 singletons. Note that the total starting number is exactly like yours: #_of_pairs plus #_of_singletons.

    Your numbers are 761758 + 232282 + 193747 = 1187787

    Mine are: 50 + 25 + 15 = 85

    Code:
    bowtie2 -x coli.fasta -1 R1.fastq -2 R2.fastq -U single.fastq,single2.fastq  -S r.sam
    
    
    85 reads; of these:
      50 (58.82%) were paired; of these:
        39 (78.00%) aligned concordantly 0 times
        10 (20.00%) aligned concordantly exactly 1 time
        1 (2.00%) aligned concordantly >1 times
        ----
        39 pairs aligned concordantly 0 times; of these:
          3 (7.69%) aligned discordantly 1 time
        ----
        36 pairs aligned 0 times concordantly or discordantly; of these:
          72 mates make up the pairs; of these:
            71 (98.61%) aligned 0 times
            1 (1.39%) aligned exactly 1 time
            0 (0.00%) aligned >1 times
      35 (41.18%) were unpaired; of these:
        18 (51.43%) aligned 0 times
        17 (48.57%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
    34.07% overall alignment rate
    4) As for the samtools issue this does seem to be a bit odd but I'd like to get a version check first before diving deeper into the problem.

    Comment


    • #3
      I think by 1187787 reads, Bowtie2 means 1187787 "read pairs and singlets", i.e. the paired reads are only counted once. Note that 1187787+761758=1949545

      The rest of the figures do suggest a truncated file (or possibly a file mix-up). Have a look at the last line to check it is complete. Samtools flagstat should give the correct read count (1949545), and your samtools view command should give the same count as bowtie2 (assuming paired reads have the same id), so 1187787.

      Comment


      • #4
        Thanks for the replies & testing done on your end:

        @westerman:
        1) I am using bowtie2 v2.1.0. So it is an old version. I don't have permissions that allow me to do installations, but I'll see if I can somehow get the latest version and try that.

        1a) samtools is also old. v0.1.18. I'll look into getting a new version of that as well

        2) The exact command line (minus local paths) was this:

        bowtie22.1.0 -x <db> -q -1 <fastq_PE_end1> -2 <fastq_PE_end2 -U <fastq_fragmentreads_end1>,<fastq_fragmentreads_end2> -S <output_name>.sam --rg-id <RGID> --rg ID:<RGID> --rg PL:illumina --rg PU:<UNIT> --rg LB:<LIB> --rg SM:<SAMPLE>

        3) So bowtie2 output is reporting mapping % for PAIRS for the paired end input, and as FRAGMENTS for the fragment input. In hindsight I guess that makes sense. In the old version at least the reported numbers are not very clear. But I do see buried inside the report numbers that add up to my exact input.

        4) The other methods for looking at these numbers still are confusing to me too, but I do have an old version of samtools. If I can get an updated version maybe at least that would agree w/ my input


        @Roy:
        Now that you & westerman say that I do see it. The organization & format are confusing to me, but in fairness I am using an old version of bowtie2.

        I've been scrutinizing my files for truncation, and I don't see anything. But that was my first thought too. I am re-mapping the data and will confirm that I get the same results.

        I still don't have any idea why the other counting methods are also not working to me, nor seem to agree with my input # or the bowtie2 reported numbers. Maybe running fragments and pairs in the same alignment is beyond what those other methods can handle?

        Comment


        • #5
          I wanted to share my findings after reviewing & doing some small test runs using bowtie2's output for getting mapping %.

          First let me say that bowtie2 is awesome, now that I see what its doing, I like it! The other methods I tried seem confused by a mapping that includes both fragment reads and read pairs in the same starting query pool. But bowtie2's alignment report does a good job of sorting things out once I knew what I was looking at.

          Here is a typical report (using bowtie v2.2.5 ... not quite the newest, but I think the report format is the same):
          ++++++++++++++++++++++++++++
          1187787 reads; of these:
          761758 (64.13%) were paired; of these:
          471690 (61.92%) aligned concordantly 0 times
          290068 (38.08%) aligned concordantly exactly 1 time
          0 (0.00%) aligned concordantly >1 times
          ----
          471690 pairs aligned concordantly 0 times; of these:
          1089 (0.23%) aligned discordantly 1 time
          ----
          470601 pairs aligned 0 times concordantly or discordantly; of these:
          941202 mates make up the pairs; of these:
          746446 (79.31%) aligned 0 times
          194756 (20.69%) aligned exactly 1 time
          0 (0.00%) aligned >1 times
          426029 (35.87%) were unpaired; of these:
          290590 (68.21%) aligned 0 times
          135439 (31.79%) aligned exactly 1 time
          0 (0.00%) aligned >1 times
          46.81% overall alignment rate
          ++++++++++++++++++++++++++++

          By counting my input fastq query, I have exactly 1,949,545 reads being mapped. The report very diligently breaks down the component PE and FRAGMENT reads, and reports on them separately in these sections:

          PE:
          761758 (64.13%) were paired; of these:
          471690 (61.92%) aligned concordantly 0 times
          290068 (38.08%) aligned concordantly exactly 1 time
          0 (0.00%) aligned concordantly >1 times

          FRAGMENT:
          426029 (35.87%) were unpaired; of these:
          290590 (68.21%) aligned 0 times
          135439 (31.79%) aligned exactly 1 time
          0 (0.00%) aligned >1 times

          The number in the 'PE' block are PAIRS, while the numbers in the 'FRAGMENT' block are single reads. So 761758x2 + 426029 = 1,949,545 reads, which is exactly my input.

          What had confused me is that Bowtie2's report then breaks down and reports all everything that does NOT align as PAIRS in series under the 'PE' block:

          ----
          471690 pairs aligned concordantly 0 times; of these:
          1089 (0.23%) aligned discordantly 1 time
          ----
          470601 pairs aligned 0 times concordantly or discordantly; of these:
          941202 mates make up the pairs; of these:
          746446 (79.31%) aligned 0 times
          194756 (20.69%) aligned exactly 1 time
          0 (0.00%) aligned >1 times

          This section is basically reviewing all the PAIRS that didn't map concordantly. And its saying that of those pairs, it found 1089 PAIRS that mapped discordantly (which I believe could mean having too much space between them, or possibly in the wrong orientation). Then of the PAIRS that still don't map, it looks at them as single reads ('mates') and it reports on them mapping as single reads. The 470601 PAIRS that don't align (either concordantly or disconcordantly) make up 941202 READS (mates). And of those guys, 194756 reads map by themselves.

          So in total, we have this count of mapped reads:

          471690 PAIRS mapped concordantly
          1089 PAIRS mapped disconcordantly
          194756 READS (that had been memberes of a PAIR) map by themselves
          135439 READS (my FRAGMENT input) map

          290068x2 + 1089x2 + 194756 + 135439 = 912509 reads mapping

          So 912509 (mapped reads) / 1,949,545 (input reads) = 0.468062548

          Which is exactly what the bowtie2 report lists as the "overall alignment rate" listed at the bottom of the report (46.81%).

          Thanks again for the help!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X