SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
IGV export reads alignment sequence sadiexiaoyu Bioinformatics 2 06-17-2013 05:58 AM
Hard clip reads that overhang the reference sequence in SAM file DJParker Bioinformatics 0 03-19-2013 10:06 AM
reads position in reference sequence(sam file) me91 Bioinformatics 4 08-27-2012 11:27 PM
Webinar on Alignment of Raw Sequence Reads - FREE Strand SI Events / Conferences 1 11-23-2011 07:14 AM
how to map a sanger reads to a reference sequence? anyone1985 Bioinformatics 5 06-23-2009 08:15 AM

Reply
 
Thread Tools
Old 12-11-2014, 08:57 PM   #1
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default Why alignment of sequence reads to a reference can be inaccurate towards the ends of

I often heard that "alignment of sequence reads to a reference can be inaccurate towards the ends of a read", does anybody know the exact reason of this?
blaboon is offline   Reply With Quote
Old 12-11-2014, 09:03 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

As you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.

I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.

Last edited by Brian Bushnell; 12-11-2014 at 09:05 PM.
Brian Bushnell is offline   Reply With Quote
Old 12-11-2014, 09:14 PM   #3
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

I take it you chop the first and last 10 bp of the alignment read segment rather than trimming before alignment, since otherwise you just shift where in the read the uninformative variants are, right? What do you use to do that?
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 12-11-2014, 09:33 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes, after alignment. I wrote a couple tools for calling variations and found they were more accurate if I ignored the outer ~10bp of reads. So, the coverage was calculated from only the middle (length-20bp), and variations were called using the rate of variations in the middle (length-20bp) versus the coverage of that region. I did not use any 3rd-party software and do not know of any that takes that approach; and unfortunately, the software I wrote for that is owned by UTSW, which doesn't use it.

If you look at this image:


This is a histogram of match, substitution, insertion, deletion, and so forth rates by position in a read. "Other" means "clipped", as in, it went off the end of a contig, so it's obvious why that would increase toward the ends (this was a draft assembly). But also notice how the insertion rate increases toward the ends. Illumina reads generally do not suffer from artificial indels, so that is probably mis-called substitutions - specifically, if there are a lot of mismatches near the end of a read, it is cheaper to classify them as a single long insertion event rather than a bunch of isolated substitutions interspersed with matches.

To compensate for that, I banned indels at the very end of reads, which is why the insertion and deletions rates suddenly drop to zero. This greatly improved overall accuracy. But still, the closer you get to the ends, the less information you have to classify a base as match/substitution/insertion/deletion. So you'll always increase accuracy if, when you have multiple reads mapped over an area, you call variants based on the middles of the reads and ignore the outermost parts.

Since I don't currently have a piece of software that does this and I don't know of any, I can at least suggest that if you want to do so, just map the reads and then post-process the sam file, trimming the outermost X bases on each end (adjusting the bases, qualities, cigar string, pos, and tossing the MD tag) to pretend that it was 10bp shorter on each end, before feeding it to a variant caller.
Brian Bushnell is offline   Reply With Quote
Old 12-11-2014, 09:42 PM   #5
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

Thanks, I was curious because I pretty much deal with RAD-like (or RAD-ish, as I like to say) libraries, so the reads stack up and a variant will always have the same nucleotide position in a read... and it is therefore pretty easy to deal with "end effects".

Unfortunate about the other-owned software. I think your solution of the sam file would certainly work.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 12-11-2014, 11:05 PM   #6
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default

Hi Brian,

Thank you very much for your prompt, detailed and clear answer, now I understand why the accuracy drops at both ends of the read.

However, I don't very understand one point in your answer that is when we approach to the middle of a long read, it is easier to make a accurate conclusion. Could you please explain more about that?
blaboon is offline   Reply With Quote
Old 12-11-2014, 11:22 PM   #7
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default

Quote:
Originally Posted by Brian Bushnell View Post
As you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.

I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.
To avoid AB question, I think I should make more clear where my question comes from. After I reading this paper which is about targeted amplicon sequencing.

In this paper, their points are:

1. Alignments of sequencing reads to a reference can be inaccurate towards the ends of a read.
2. This (point 1) is especially true when there is a variant near an edge of the read. The mismatch due to the variant might cause all the bases from the variant to the edge of the read to be excluded from the alignment, otherwise, soft-clipped.

For point 1, you have already gave an excellent answer, for point 2, do you have any idea what the difference is between variants near an edge of the read or in other positions of the read, why the variants near edge are more evil?

bless~
blaboon is offline   Reply With Quote
Old 12-11-2014, 11:36 PM   #8
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default

Quote:
Originally Posted by Brian Bushnell View Post
As you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.

I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.
Sorry for annoying questions. As a newbie, an expert just like a god.

If I have already do local indel-realigment on my data using GATK RealignTargetCreator and IndelRealigner, should I still chop my reads? The read in your answer means the one which has already been trimmed by remove any primers and adaptors, am I right?
blaboon is offline   Reply With Quote
Old 12-11-2014, 11:43 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Local aligners will soft-clip the ends of reads if there are too many variants there. Global aligners won't do that. So, point 2 regarding soft-clipping is irrelevant to global aligners, which are preferred for variant-calling.

But the most important point is that, still, at the very ends of reads you cannot distinguish between match, mismatch, substitution, or deletion - they can (mostly) look the same. In the middle of a read, a 1bp substitution is obviously a 1bp substitution. A 1bp deletion, in the middle of a 150bp read, will shift 75 bases inward; this will mean the aligner can call a 1bp deletion, or (0.75*75)=56.25 substitutions on average, which is (with most scoring functions) far more expensive. The nearer to the end, the fewer bases are moved, and thus the fewer substitutions are generated by a miscalled indel, therefore the harder it is to distinguish between an indel and a substitution, or any other event.
Brian Bushnell is offline   Reply With Quote
Old 12-11-2014, 11:59 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by blaboon View Post
Sorry for annoying questions. As a newbie, an expert just like a god.

If I have already do local indel-realigment on my data using GATK RealignTargetCreator and IndelRealigner, should I still chop my reads? The read in your answer means the one which has already been trimmed by remove any primers and adaptors, am I right?
Trimming adapters/primers should be done before alignment. "Chopping" the reads should be done post-alignment. Realignment around indels can to some extent mitigate the problem, because it uses consensus of multiple reads rather than aligning based on individual reads. Still, if you are at all interested in accurately mapping indels, I suggest you use BBMap rather than relying on indel-realignment.

That said, I have not rigorously benchmarked BBMap with or without "chopping" versus GATK with indel-realignment, so to some extent this is my opinion rather than something I can state as fact.

I will state these two things:
1) Calling variations directly from the consensus of a pileup from BBMap is more accurate if you chop the ends off the aligned reads.
2) BBMap is more accurate with indels than any other aligner I have tested, regardless of whether the read ends are chopped off.

However, once you include additional variables like GATK with or without indel realignment, plus an arbitrary choice of aligner, there are too many variables to be certain anymore, and I have not tested all of those combinations. I am also not sure whether "chopping" is better or not when you do indel realignment or local reassembly around suspected indels. Probably not, if the indel realigner / local assembler is good. However, with a simple consensus and no realignment, chopping should generally be better.

Last edited by Brian Bushnell; 12-12-2014 at 12:03 AM.
Brian Bushnell is offline   Reply With Quote
Old 12-12-2014, 06:28 AM   #11
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

The end problem was first mentioned in a 2008 paper and then rediscovered multiple times after that. The problem subsequently motivated the development of GATK realignment, SRMA, samtools BAQ and stampy per-base alignment score. In 2009, someone already required variants to have sufficient support from the middle one third of a read (though in 2012 some others were still publishing a CNS paper mainly caused by this artifact).

In theory, perfect GATK-like realignment should satisfactorily solve the end problem for WGS. For all the pairwise NGS mappers I know of, none of them could replace such realignment because they don't have the powerful consensus information during the mapping phase. However, the implementations of SRMA and GATK realignment are not optimal. They are a bit slow and do not do the best job all the time.

The continuing indel calling problem leads to the development of new-generation variant callers, GATK-HC, platypus, freebayes and soapindel among the many. These modern callers call variants simultaneously with realignment or reassembly. Some of them even completely ignore the alignment. They are usually very robust to the end-indel issue. These new tools are on the right line.

The end problem is also alleviated by longer reads because the fraction of reads ends is reduced. For deep WGS, the misalignment of a few ends will not lead to a wrong call. GATK realignment is not much needed nowadays, especially when we use a modern caller. End chopping is not recommended, either, because it hurts sensitivity. Nonetheless, for RAD and amplicon sequencing where it is much harder to fix the end problem with consensus, end chopping still has its uses. In addition, GATK and samtools have long been computing summary statistics about where a variant is supported on reads. They serve a similar goal to end chopping.
lh3 is offline   Reply With Quote
Old 12-14-2014, 06:45 PM   #12
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default

Quote:
Originally Posted by Brian Bushnell View Post
Trimming adapters/primers should be done before alignment. "Chopping" the reads should be done post-alignment. Realignment around indels can to some extent mitigate the problem, because it uses consensus of multiple reads rather than aligning based on individual reads. Still, if you are at all interested in accurately mapping indels, I suggest you use BBMap rather than relying on indel-realignment.

That said, I have not rigorously benchmarked BBMap with or without "chopping" versus GATK with indel-realignment, so to some extent this is my opinion rather than something I can state as fact.

I will state these two things:
1) Calling variations directly from the consensus of a pileup from BBMap is more accurate if you chop the ends off the aligned reads.
2) BBMap is more accurate with indels than any other aligner I have tested, regardless of whether the read ends are chopped off.

However, once you include additional variables like GATK with or without indel realignment, plus an arbitrary choice of aligner, there are too many variables to be certain anymore, and I have not tested all of those combinations. I am also not sure whether "chopping" is better or not when you do indel realignment or local reassembly around suspected indels. Probably not, if the indel realigner / local assembler is good. However, with a simple consensus and no realignment, chopping should generally be better.
Hi Brian,

Thank you for your detailed response.

Now my question is why we should trim primers before alignment? Since the Qiagen paper said, the alignment quality is low to the ends of reads, so if we keep primers when we do alignment, I think we can move "low quality part" (just one end) into primer binding region, so our "real" reads will achieve higher alignment quality.
blaboon is offline   Reply With Quote
Old 12-15-2014, 10:16 AM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Nice idea, but unfortunately, you can't resolve the problem by sticking (or keeping) random bases on the end of the reads and then taking them back off after mapping. In fact, that will only make the problem worse as the random bases will have some coincidental matches with the genomic sequence, potentially altering the alignment. It's the ends of the genomic portions of the reads that are aligned inaccurately, regardless of whether there are nongenomic bases extending beyond them.

However, trimming low-quality (but genomic) bases from the read ends could indeed be done after alignment, which would help reduce the problem somewhat.
Brian Bushnell is offline   Reply With Quote
Old 12-15-2014, 03:14 PM   #14
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

If you are working with haploid data, another approach to get better aligned ends is to run iterative mapping where you take the consensus from the previous mapping iteration and map the reads to that consensus on successive iterations. This also has the advantage of being able to map reads that would normally not get mapped and make variant calls in regions that would otherwise have no coverage because no reads directly map there.

I implemented that approach in Geneious (commerical software) and described it in a white paper at http://assets.geneious.com/documenta...ReadMapper.pdf
Matt Kearse is offline   Reply With Quote
Reply

Tags
alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:54 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO