Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why alignment of sequence reads to a reference can be inaccurate towards the ends of

    I often heard that "alignment of sequence reads to a reference can be inaccurate towards the ends of a read", does anybody know the exact reason of this?

  • #2
    As you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.

    I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.
    Last edited by Brian Bushnell; 12-11-2014, 09:05 PM.

    Comment


    • #3
      I take it you chop the first and last 10 bp of the alignment read segment rather than trimming before alignment, since otherwise you just shift where in the read the uninformative variants are, right? What do you use to do that?
      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

      Comment


      • #4
        Yes, after alignment. I wrote a couple tools for calling variations and found they were more accurate if I ignored the outer ~10bp of reads. So, the coverage was calculated from only the middle (length-20bp), and variations were called using the rate of variations in the middle (length-20bp) versus the coverage of that region. I did not use any 3rd-party software and do not know of any that takes that approach; and unfortunately, the software I wrote for that is owned by UTSW, which doesn't use it.

        If you look at this image:


        This is a histogram of match, substitution, insertion, deletion, and so forth rates by position in a read. "Other" means "clipped", as in, it went off the end of a contig, so it's obvious why that would increase toward the ends (this was a draft assembly). But also notice how the insertion rate increases toward the ends. Illumina reads generally do not suffer from artificial indels, so that is probably mis-called substitutions - specifically, if there are a lot of mismatches near the end of a read, it is cheaper to classify them as a single long insertion event rather than a bunch of isolated substitutions interspersed with matches.

        To compensate for that, I banned indels at the very end of reads, which is why the insertion and deletions rates suddenly drop to zero. This greatly improved overall accuracy. But still, the closer you get to the ends, the less information you have to classify a base as match/substitution/insertion/deletion. So you'll always increase accuracy if, when you have multiple reads mapped over an area, you call variants based on the middles of the reads and ignore the outermost parts.

        Since I don't currently have a piece of software that does this and I don't know of any, I can at least suggest that if you want to do so, just map the reads and then post-process the sam file, trimming the outermost X bases on each end (adjusting the bases, qualities, cigar string, pos, and tossing the MD tag) to pretend that it was 10bp shorter on each end, before feeding it to a variant caller.

        Comment


        • #5
          Thanks, I was curious because I pretty much deal with RAD-like (or RAD-ish, as I like to say) libraries, so the reads stack up and a variant will always have the same nucleotide position in a read... and it is therefore pretty easy to deal with "end effects".

          Unfortunate about the other-owned software. I think your solution of the sam file would certainly work.
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment


          • #6
            Hi Brian,

            Thank you very much for your prompt, detailed and clear answer, now I understand why the accuracy drops at both ends of the read.

            However, I don't very understand one point in your answer that is when we approach to the middle of a long read, it is easier to make a accurate conclusion. Could you please explain more about that?

            Comment


            • #7
              Originally posted by Brian Bushnell View Post
              As you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.

              I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.
              To avoid AB question, I think I should make more clear where my question comes from. After I reading this paper which is about targeted amplicon sequencing.

              In this paper, their points are:

              1. Alignments of sequencing reads to a reference can be inaccurate towards the ends of a read.
              2. This (point 1) is especially true when there is a variant near an edge of the read. The mismatch due to the variant might cause all the bases from the variant to the edge of the read to be excluded from the alignment, otherwise, soft-clipped.

              For point 1, you have already gave an excellent answer, for point 2, do you have any idea what the difference is between variants near an edge of the read or in other positions of the read, why the variants near edge are more evil?

              bless~

              Comment


              • #8
                Originally posted by Brian Bushnell View Post
                As you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.

                I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.
                Sorry for annoying questions. As a newbie, an expert just like a god.

                If I have already do local indel-realigment on my data using GATK RealignTargetCreator and IndelRealigner, should I still chop my reads? The read in your answer means the one which has already been trimmed by remove any primers and adaptors, am I right?

                Comment


                • #9
                  Local aligners will soft-clip the ends of reads if there are too many variants there. Global aligners won't do that. So, point 2 regarding soft-clipping is irrelevant to global aligners, which are preferred for variant-calling.

                  But the most important point is that, still, at the very ends of reads you cannot distinguish between match, mismatch, substitution, or deletion - they can (mostly) look the same. In the middle of a read, a 1bp substitution is obviously a 1bp substitution. A 1bp deletion, in the middle of a 150bp read, will shift 75 bases inward; this will mean the aligner can call a 1bp deletion, or (0.75*75)=56.25 substitutions on average, which is (with most scoring functions) far more expensive. The nearer to the end, the fewer bases are moved, and thus the fewer substitutions are generated by a miscalled indel, therefore the harder it is to distinguish between an indel and a substitution, or any other event.

                  Comment


                  • #10
                    Originally posted by blaboon View Post
                    Sorry for annoying questions. As a newbie, an expert just like a god.

                    If I have already do local indel-realigment on my data using GATK RealignTargetCreator and IndelRealigner, should I still chop my reads? The read in your answer means the one which has already been trimmed by remove any primers and adaptors, am I right?
                    Trimming adapters/primers should be done before alignment. "Chopping" the reads should be done post-alignment. Realignment around indels can to some extent mitigate the problem, because it uses consensus of multiple reads rather than aligning based on individual reads. Still, if you are at all interested in accurately mapping indels, I suggest you use BBMap rather than relying on indel-realignment.

                    That said, I have not rigorously benchmarked BBMap with or without "chopping" versus GATK with indel-realignment, so to some extent this is my opinion rather than something I can state as fact.

                    I will state these two things:
                    1) Calling variations directly from the consensus of a pileup from BBMap is more accurate if you chop the ends off the aligned reads.
                    2) BBMap is more accurate with indels than any other aligner I have tested, regardless of whether the read ends are chopped off.

                    However, once you include additional variables like GATK with or without indel realignment, plus an arbitrary choice of aligner, there are too many variables to be certain anymore, and I have not tested all of those combinations. I am also not sure whether "chopping" is better or not when you do indel realignment or local reassembly around suspected indels. Probably not, if the indel realigner / local assembler is good. However, with a simple consensus and no realignment, chopping should generally be better.
                    Last edited by Brian Bushnell; 12-12-2014, 12:03 AM.

                    Comment


                    • #11
                      The end problem was first mentioned in a 2008 paper and then rediscovered multiple times after that. The problem subsequently motivated the development of GATK realignment, SRMA, samtools BAQ and stampy per-base alignment score. In 2009, someone already required variants to have sufficient support from the middle one third of a read (though in 2012 some others were still publishing a CNS paper mainly caused by this artifact).

                      In theory, perfect GATK-like realignment should satisfactorily solve the end problem for WGS. For all the pairwise NGS mappers I know of, none of them could replace such realignment because they don't have the powerful consensus information during the mapping phase. However, the implementations of SRMA and GATK realignment are not optimal. They are a bit slow and do not do the best job all the time.

                      The continuing indel calling problem leads to the development of new-generation variant callers, GATK-HC, platypus, freebayes and soapindel among the many. These modern callers call variants simultaneously with realignment or reassembly. Some of them even completely ignore the alignment. They are usually very robust to the end-indel issue. These new tools are on the right line.

                      The end problem is also alleviated by longer reads because the fraction of reads ends is reduced. For deep WGS, the misalignment of a few ends will not lead to a wrong call. GATK realignment is not much needed nowadays, especially when we use a modern caller. End chopping is not recommended, either, because it hurts sensitivity. Nonetheless, for RAD and amplicon sequencing where it is much harder to fix the end problem with consensus, end chopping still has its uses. In addition, GATK and samtools have long been computing summary statistics about where a variant is supported on reads. They serve a similar goal to end chopping.

                      Comment


                      • #12
                        Originally posted by Brian Bushnell View Post
                        Trimming adapters/primers should be done before alignment. "Chopping" the reads should be done post-alignment. Realignment around indels can to some extent mitigate the problem, because it uses consensus of multiple reads rather than aligning based on individual reads. Still, if you are at all interested in accurately mapping indels, I suggest you use BBMap rather than relying on indel-realignment.

                        That said, I have not rigorously benchmarked BBMap with or without "chopping" versus GATK with indel-realignment, so to some extent this is my opinion rather than something I can state as fact.

                        I will state these two things:
                        1) Calling variations directly from the consensus of a pileup from BBMap is more accurate if you chop the ends off the aligned reads.
                        2) BBMap is more accurate with indels than any other aligner I have tested, regardless of whether the read ends are chopped off.

                        However, once you include additional variables like GATK with or without indel realignment, plus an arbitrary choice of aligner, there are too many variables to be certain anymore, and I have not tested all of those combinations. I am also not sure whether "chopping" is better or not when you do indel realignment or local reassembly around suspected indels. Probably not, if the indel realigner / local assembler is good. However, with a simple consensus and no realignment, chopping should generally be better.
                        Hi Brian,

                        Thank you for your detailed response.

                        Now my question is why we should trim primers before alignment? Since the Qiagen paper said, the alignment quality is low to the ends of reads, so if we keep primers when we do alignment, I think we can move "low quality part" (just one end) into primer binding region, so our "real" reads will achieve higher alignment quality.

                        Comment


                        • #13
                          Nice idea, but unfortunately, you can't resolve the problem by sticking (or keeping) random bases on the end of the reads and then taking them back off after mapping. In fact, that will only make the problem worse as the random bases will have some coincidental matches with the genomic sequence, potentially altering the alignment. It's the ends of the genomic portions of the reads that are aligned inaccurately, regardless of whether there are nongenomic bases extending beyond them.

                          However, trimming low-quality (but genomic) bases from the read ends could indeed be done after alignment, which would help reduce the problem somewhat.

                          Comment


                          • #14
                            If you are working with haploid data, another approach to get better aligned ends is to run iterative mapping where you take the consensus from the previous mapping iteration and map the reads to that consensus on successive iterations. This also has the advantage of being able to map reads that would normally not get mapped and make variant calls in regions that would otherwise have no coverage because no reads directly map there.

                            I implemented that approach in Geneious (commerical software) and described it in a white paper at http://assets.geneious.com/documenta...ReadMapper.pdf

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-27-2024, 06:37 PM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-27-2024, 06:07 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X