Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract uniquely mapped reads from bam/sam produced by bwa-mem?

    There's no 'XT:A:U' tag for bwa-mem outputs. How can I get the uniquely mapped reads?

  • #2
    Code:
    samtools view -q 10 -b aln.bam > aln.uniq.bam

    Comment


    • #3
      MAPQ (mapping quality - describes the uniqueness of the alignment, 0=non-unique, >10 probably unique)

      I hope it's seriously uniquely mapped read.

      Comment


      • #4
        Keep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
        So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.

        Comment


        • #5
          Originally posted by dariober View Post
          Keep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
          So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.
          Thanks for your reply. But it seems that I can't say that the reads with high mapq are the reads only mapped to one location in the genome. In my study, I only need the reads that were just mapped to one location and my collaborators don't wanna see it elsewhere. My fastq files are pair-end ones.
          I have checked many resources and I know it's hard to definite the 'unique' reads. So is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?

          Comment


          • #6
            You just have to define what you mean by "just mapped to one location". All reads/pairs are multimappers if you allow a high enough edit distance...

            Comment


            • #7
              Originally posted by Yamol View Post
              So is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?
              No, that will eliminate huge numbers of reads that have variations, or errors, or are mapped to somewhat repetitive areas, and generally eliminate coverage from a lot of places and give a severe ref-bias everywhere else, so the results will not be trustworthy.

              If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.

              Comment


              • #8
                Originally posted by Brian Bushnell View Post
                No, that will eliminate huge numbers of reads that have variations, or errors, or are mapped to somewhat repetitive areas, and generally eliminate coverage from a lot of places and give a severe ref-bias everywhere else, so the results will not be trustworthy.
                I agree that going above q30 is probably unnecessarily harsh. However there might be cases where you prefer to loose coverage in some regions rather than accepting uncertain read positions.

                If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.
                Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.

                Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:

                Code:
                samtools view -c -q0 grm056_i1_KO_carcass.bam 18
                8184447
                samtools view -c -q1 grm056_i1_KO_carcass.bam 18
                8039114
                samtools view -c -q3 grm056_i1_KO_carcass.bam 18
                7838063
                samtools view -c -q5 grm056_i1_KO_carcass.bam 18
                7816288
                samtools view -c -q10 grm056_i1_KO_carcass.bam 18
                7707885
                samtools view -c -q20 grm056_i1_KO_carcass.bam 18
                7635921
                samtools view -c -q30 grm056_i1_KO_carcass.bam 18
                7313124
                samtools view -c -q40 grm056_i1_KO_carcass.bam 18
                7106571

                Comment


                • #9
                  Originally posted by dariober View Post
                  I agree that going above q30 is probably unnecessarily harsh. However there might be cases where you prefer to loose coverage in some regions rather than accepting uncertain read positions.



                  Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.

                  Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:

                  Code:
                  samtools view -c -q0 grm056_i1_KO_carcass.bam 18
                  8184447
                  samtools view -c -q1 grm056_i1_KO_carcass.bam 18
                  8039114
                  samtools view -c -q3 grm056_i1_KO_carcass.bam 18
                  7838063
                  samtools view -c -q5 grm056_i1_KO_carcass.bam 18
                  7816288
                  samtools view -c -q10 grm056_i1_KO_carcass.bam 18
                  7707885
                  samtools view -c -q20 grm056_i1_KO_carcass.bam 18
                  7635921
                  samtools view -c -q30 grm056_i1_KO_carcass.bam 18
                  7313124
                  samtools view -c -q40 grm056_i1_KO_carcass.bam 18
                  7106571
                  I think the reason why there's little difference is that your data quality and map quality are hyper good.

                  Comment


                  • #10
                    Originally posted by dariober View Post
                    Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.
                    The question was specifically about multimapping, not mismapping. If a read has a QV>=3, the mapper thinks it has a >50% chance that the site is correct, therefore there cannot be any other sites to which the read maps equally well. It's true that you will eliminate more incorrectly mapped reads by setting the threshold higher, but there's no reason to do that if the goal is strictly to eliminate multimapped reads.

                    Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!

                    Comment


                    • #11
                      Originally posted by Brian Bushnell View Post
                      The question was specifically about multimapping, not mismapping.
                      Actually I interpreted the question as asking for surely mapped reads but anyway, let's leave this to the OP, hopefully s/he got the message.

                      Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!
                      I see your point and I agree that increasing mapq to much becomes detrimental, but I'm not sure that lowering the mapq threshold helps. As a thought experiment, imagine a region of 100bp with several SNP relative to the reference, say 20. In order to map a read there you need to lower the mapq score. This could mean that a read truly coming from that region could map even better somewhere else and reads truly mapping somewhere else could end up in this critical region. (I don't do SNP calling so I may get this wrong...)

                      Comment


                      • #12
                        Originally posted by dariober View Post
                        As a thought experiment, imagine a region of 100bp with several SNP relative to the reference, say 20. In order to map a read there you need to lower the mapq score. This could mean that a read truly coming from that region could map even better somewhere else and reads truly mapping somewhere else could end up in this critical region.
                        That's true, but it should only be a problem if you use an incomplete reference. If the reference is complete the read should map to the correct location.

                        It's always possible to take a read from some region, apply a bunch of mutations to it, and have the resulting read map perfectly to somewhere else, while having too many differences to map to the correct location at all. These events are very unlikely and tend to be ignored; increasing the mapq threshold will not protect you from them. Of course, there's always some gray area in between. But generally, the higher the mapq threshold, the more reference-bias you will get, which greatly interferes with variant calling.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X