Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA-MEM: output mapped reads larger than input reads

    I have some bam files generated from BWA-MEM using these command:
    Code:
    bwa mem -t 8 ref1 R1.fa.gz R2.fa.gz | samtools view -S -h -F 4 -b - | samtools sort - bwaaln-sorted
    So i should get only mapped reads in my output. I'm using bwa v0.7.5a.

    Using samtools flagstat, i retrieved the number of mapped reads from the sorted bam file, and in some cases, this was higher than the number of original input paired-end reads.

    Now this could be due to reads aligning to multiple places, but I don't have any "XA" tags, nor "XT" tags for that matter.

    I already saw this post about getting uniquely mapping reads from BWA.
    http://seqanswers.com/forums/showthread.php?t=30120

    I want to get stats on unique and multi-mapping reads.
    Is using MapQ=0 a reliable way of flagging multi-mapping reads?
    And why am I not getting XA and XT tags at all?

    Here are a couple of lines of my output - I am not seeing any XT/XA tags, nor X0 or X1 tags.
    Code:
    HWI-ST226:220:D0AU7ACXX:5:1306:2199:194196	83	ref1	1	58	10S78M	=	25	-54	CAGACGTGTGCTCTTCCGATCTCGTAAATTTAGTACAACCAAGGGGTAAAAATCCACACTACAACAAGCCGCTCACTATTAGACCCAA	*	NM:i:0	AS:i:78	XS:i:67
    HWI-ST226:220:D0AU7ACXX:5:2206:3059:40310	163	ref1	21	0	88M	=	42	109	TAGTACAACCAAGGGGTAAAAATCCAGACTACAACAAGCCGCTCACTATTAGACCCAAATGTCCAAAAACCCTTCACAAAATCGTCAA	*	NM:i:1	AS:i:83	XS:i:83
    HWI-ST226:220:D0AU7ACXX:5:1306:2199:194196	163	ref1	25	40	66M22S	=	1	54	ACAACCAAGGGGTAAAAATCCACACTACAACAAGCCGCTCACTATTAGACCCAAATGTCCAAAAACAGATCGGAAGAGCGTCGTGTAG	*	NM:i:0	AS:i:66	XS:i:66
    HWI-ST226:220:D0AU7ACXX:5:2206:3059:40310	83	ref1	42	0	88M	=	21	-109	ATCCACACTACAACAAGCCGCTCACTATTAGACCCAAATGTCCAAAAACCCTTCACAAAATCGTCAATCTTCTCACTTTCCTGGCGTC	*	NM:i:0	AS:i:88	XS:i:88

    thanks

  • #2
    Regarding the read number mismatch, are you counting both mates of a pair in one case and not the other? That and multimapping are the most likely causes (I haven't a clue on the XA/XT tag issue).

    Comment


    • #3
      Caused by chimeric alignments.

      Comment


      • #4
        Thanks.

        But why am I not getting any XT,XA,X0 and X1 tags?
        I do have some SA tags, maybe 1% of reads.
        Any recommendations to extract unique and multi mapping reads? Should I just use mapQ=0?

        Comment


        • #5
          Just check mapQ.

          For long reads and local alignment, X0/X1 do not make much sense any more. If you want multiple mappings, use "-a".

          Comment


          • #6
            Originally posted by lh3 View Post
            Just check mapQ.

            For long reads and local alignment, X0/X1 do not make much sense any more. If you want multiple mappings, use "-a".
            I tried getting all alignments with mapQ>0 and I got 214,694,901 reads (alignments), which should be unique mappings (?) since bwa only outputs the primary hit by default for mapQ > 0 (default options used here).

            Just to check, i coupled the read name with the flag and counted how many uniq ones there were;
            samtools view -q1 sorted.bam | awk ' { print $1"-"$2 } ' | sort | uniq | wc -l
            214694478

            and so there are around 420 reads which appear duplicated even with mapQ>0.
            Is it possible for a read with mapQ>0 to get reported twice for some reason?

            Thanks.

            Comment


            • #7
              Originally posted by Kennels View Post
              I tried getting all alignments with mapQ>0 and I got 214,694,901 reads (alignments), which should be unique mappings (?) since bwa only outputs the primary hit by default for mapQ > 0 (default options used here).

              Just to check, i coupled the read name with the flag and counted how many uniq ones there were;
              samtools view -q1 sorted.bam | awk ' { print $1"-"$2 } ' | sort | uniq | wc -l
              214694478

              and so there are around 420 reads which appear duplicated even with mapQ>0.
              Is it possible for a read with mapQ>0 to get reported twice for some reason?

              Thanks.
              I had a closer look at the output, and it would appear that these duplicated reads are due to chimeric reads as you mentioned Heng. I understand now what you meant. For example, the reads below were output by '-q1' option from samtools view.

              Code:
              HWI-ST226:220:D0AU7ACXX:5:1101:11456:146339	2193	Oa_Locus_2615_Transcript_18	2194	44	42M46H	=	2182	-54	AATTGAGCTACCAAAAACCCTAACCCAAAAATTTGTAGCGTC	*	NM:i:2	AS:i:35	XS:i:0	SA:Z:Oa_Locus_2615_Transcript_14,1923,+,21S43M24S,60,0;Oa_Locus_2615_Transcript_14,1976,-,50S38M,55,0;
              HWI-ST226:220:D0AU7ACXX:5:1101:11456:146339	2193	Oa_Locus_2615_Transcript_14	1976	55	50H38M	Oa_Locus_2615_Transcript_18	2182	0	GGGGGTTTTAGTAGCTCTCTCTCTGTGGCCAGAAATGG	*	NM:i:0	AS:i:38	XS:i:0	SA:Z:Oa_Locus_2615_Transcript_14,1923,+,21S43M24S,60,0;Oa_Locus_2615_Transcript_18,2194,-,42M46S,44,2;
              This the second read in a pair, but gets reported twice even though their mapQ is not 0.
              I suppose using the '-M' option will flag both of these reads as secondary.

              So to summarize, to get unique mappings:

              1. use '-q1' option to get mapQ>0 alignments. This will include reads which are chimeric, and so will include a small level of 'multi-mapping' reads
              2. from step 1, filter out reads which have "SA:" tag. This should give unique mappings in the context of bwa-mem

              Is this correct?
              Last edited by Kennels; 08-14-2013, 10:30 PM.

              Comment


              • #8
                That is about right. Just beware that sometimes chimeric reads are informative to break points/SVs.

                Comment


                • #9
                  Hi,

                  still question here:
                  for this pair reads, i have sort the bam by the query name, they were uniquely and properply mapped. However, with 0 mapQ. According to the above mentioned method, they will be filtered for unique reads calculation.

                  bwa-0.7.5a mem was used here.

                  Code:
                  HISEQ700708:142:C2624ACXX:1:1101:10002:122014   99      scaffold2651    547291  0       100M    =       547537  346     ATCAAGGCCGTCATTACGGCCAGACGTGATTGTTCTGGGTACTGATTTCTCTGGATCTATACACCTAAACTGCGTGAAAATGTAGTGTAGCAATTTTAAT  CCCFFFFFHHHHHJJIHIJJJJJJJJHIIJJJIIJIIJJ?DHIGGGIJJJJJJIIJJJIJIHGHHHFFFFFFEDD@DDDDDDDFEDDDEDDDDDDDDEDD  NM:i:0  AS:i:100        XS:i:100
                  HISEQ700708:142:C2624ACXX:1:1101:10002:122014   147     scaffold2651    547537  0       100M    =       547291  -346    ATACAGCTAGTTTCTCCTCAGTTCGAATCGGCTTTGACGCGTTGCATGCCTCACTCTCAGTTTGATTTTTCATGAGACCATGCACTTCTTGGAATTGATC  D@:DCDDDCDDBDDCDDDDDDDEDDDDDDFFFFHHGIIJJIJJJJJIIFGJHFIIJIIJJJJJJJJJJJJJIIJIHHJJJIIIHJJJGGHHHFFFFFCCC  NM:i:0  AS:i:100        XS:i:100

                  Comment


                  • #10
                    I understand this, there only output one record for multiple mapped reads.

                    Originally posted by pengchy View Post
                    Hi,

                    still question here:
                    for this pair reads, i have sort the bam by the query name, they were uniquely and properply mapped. However, with 0 mapQ. According to the above mentioned method, they will be filtered for unique reads calculation.

                    bwa-0.7.5a mem was used here.

                    Code:
                    HISEQ700708:142:C2624ACXX:1:1101:10002:122014   99      scaffold2651    547291  0       100M    =       547537  346     ATCAAGGCCGTCATTACGGCCAGACGTGATTGTTCTGGGTACTGATTTCTCTGGATCTATACACCTAAACTGCGTGAAAATGTAGTGTAGCAATTTTAAT  CCCFFFFFHHHHHJJIHIJJJJJJJJHIIJJJIIJIIJJ?DHIGGGIJJJJJJIIJJJIJIHGHHHFFFFFFEDD@DDDDDDDFEDDDEDDDDDDDDEDD  NM:i:0  AS:i:100        XS:i:100
                    HISEQ700708:142:C2624ACXX:1:1101:10002:122014   147     scaffold2651    547537  0       100M    =       547291  -346    ATACAGCTAGTTTCTCCTCAGTTCGAATCGGCTTTGACGCGTTGCATGCCTCACTCTCAGTTTGATTTTTCATGAGACCATGCACTTCTTGGAATTGATC  D@:DCDDDCDDBDDCDDDDDDDEDDDDDDFFFFHHGIIJJIJJJJJIIFGJHFIIJIIJJJJJJJJJJJJJIIJIHHJJJIIIHJJJGGHHHFFFFFCCC  NM:i:0  AS:i:100        XS:i:100

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X