SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
NGS papers:Microindel detection in short-read sequence data KevinLam Bioinformatics 0 03-16-2010 03:01 AM
Running MAQ SNP/Indel detection/Assembly Tools on short aligners zee Bioinformatics 4 12-11-2009 02:41 PM
Next Gene Software for Solexa - SNP Indel detection EdK General 2 10-06-2009 09:51 AM
Comparing SNP/indel detection tools on 454 dePhi 454 Pyrosequencing 2 03-07-2009 05:41 AM
MAQ and indel detection fadista Bioinformatics 1 09-03-2008 01:07 AM

Reply
 
Thread Tools
Old 02-15-2010, 09:36 AM   #1
krawitz
Member
 
Location: Bonn

Join Date: Feb 2010
Posts: 30
Default Microindel Detection: defining the equivalent indel region

Hi folks,
we have written a small paper about microindel detection in short sequence reads. There is a little twist in it: the position of an indel is often not unambiguously defined by a single position (think of homopolymers). We used a simple algorithm to define an unambiguous equivalent indel region (eir). Using the eir, you can increase the sensitivity of indel detection. Its very simple to implement. Folks interested in microindel detection should think about analysing the eir in their sequence alignments:
http://www.ncbi.nlm.nih.gov/pubmed/2...m&ordinalpos=1
best,
peter
krawitz is offline   Reply With Quote
Old 02-15-2010, 05:41 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by krawitz View Post
Hi folks,
we have written a small paper about microindel detection in short sequence reads. There is a little twist in it: the position of an indel is often not unambiguously defined by a single position (think of homopolymers). We used a simple algorithm to define an unambiguous equivalent indel region (eir). Using the eir, you can increase the sensitivity of indel detection. Its very simple to implement. Folks interested in microindel detection should think about analysing the eir in their sequence alignments:
http://www.ncbi.nlm.nih.gov/pubmed/2...m&ordinalpos=1
best,
peter
When positioning an indel, it is important for variant calling to have a "left-justify" or similar rule. This way, the alignment is consistent. Re-alignment or local re-assembly will sure help.

A few criticisms of the paper coming from an alignment author (BFAST):

- All the evaluated aligners do not report indels in ABI SOLiD data.
- Aligners such as BFAST and SHRiMP that perform gapped local alignment were not evaluated.
- Some aligners compared do inherently detect indels and are mysteriously included.
nilshomer is offline   Reply With Quote
Old 02-16-2010, 11:32 AM   #3
krawitz
Member
 
Location: Bonn

Join Date: Feb 2010
Posts: 30
Default

Hi Nils,

you are right, a "left-justify" rule would work - however, isn't that somehow unsatisfying if a crystal clear definition is available? Think about biologist annotating indels, one group is a "left justifier group" the other a "right justifier group" - it may take a while until they understand, that they are talking about the same. This might sound ridiculous to you. But things like that happen all the time! So if possible one has to use clear cut definitions.

I would also like to comment on your criticism. The paper was not intended as a benchmarking study of alignment tools. It is just not possible to thoroughly analyze and compare all short read mapping tools. In fact BFAST may be as accurate as BWA or Novoalign (and I am looking forward to test it on our next data sets).

So instead of comparing a plethora of mapping tools, we focused on few widely used ones. We tested a fast gapped mapper, BWA and a very accurate one, Novoalign. The bottom line for the biologist that is actually running the experiment is the following: It's possible to detect microindels in short read data (Harismendy et. al. where apparently not aware) and your sensitivity and positive predictive values will mainly profit from longer reads and good coverage (the alignment tool can give you only little extra percent)

Another important message of the paper is, that the microindel frequency in human genomes is probably around 1/10000. For this reason it makes a lot sense to use gapped alignment tools even if you are screening for SNPs, because of a reduction of the false positive error in SNP calling.

I would like to emphasize again, that it was not possible for us to benchmark all existing mapping tools. I hope the message to take home for the reader is the following: stop using ungapped alignment tools in resequencing projects for mutation screening, there are plenty of gapped aligner and BFAST is one of them.

do you agree?

cheers,

Peter
krawitz is offline   Reply With Quote
Old 02-16-2010, 02:18 PM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Yes. I second this: gapped alignment is very important to SNP discovery (firstly emphasized in a 2008 paper and rediscovered several times). In addition, the human indel mutation rate accessible to 70bp reads is higher than 1/10,000, more likely to be ~1/7,000. Not doing gapped alignment throws away >10% of important mutations. I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations. I do not know how gapped alignment affect ChIP-seq/RNA-seq, but using a gapped aligner is recommended anyway.
lh3 is offline   Reply With Quote
Old 02-17-2010, 01:25 PM   #5
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by lh3 View Post
I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations.
Is this effect due to mismapping alone? Id est ungapped alignment causes mismapping leading to false positives in breakpoint analysis?

I ask because it is not obvious to me why sporadic mismapping of PE reads would deceive breakpoint detection.
Michael.James.Clark is offline   Reply With Quote
Old 02-17-2010, 01:43 PM   #6
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Say the correct place of a read pair is on chr1. One read can be mapped correctly and uniquely but the mate has an 3bp indel. Once you have this 3bp indel, the best position of the mate is chr2 instead of chr1. All reads containing the indel will be placed to chr2. This looks like a strong signal of translocation, but in fact due to wrong mapping. You will find no breakpoint. I have seen this on simulated data. It should occur much more often on real data.
lh3 is offline   Reply With Quote
Old 02-18-2010, 01:15 AM   #7
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

On subject of left or right justifying indels, the position of an insert can be affected by base qualities (at least in Novoalign) as a better score can be had by inserting a low quality base than a high quality one.
sparks is offline   Reply With Quote
Old 02-18-2010, 08:54 AM   #8
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by lh3 View Post
Yes. I second this: gapped alignment is very important to SNP discovery (firstly emphasized in a 2008 paper and rediscovered several times). In addition, the human indel mutation rate accessible to 70bp reads is higher than 1/10,000, more likely to be ~1/7,000. Not doing gapped alignment throws away >10% of important mutations. I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations. I do not know how gapped alignment affect ChIP-seq/RNA-seq, but using a gapped aligner is recommended anyway.
Heng can you please pass us the link to the paper?
__________________
-drd
drio is offline   Reply With Quote
Old 02-18-2010, 08:56 AM   #9
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by nilshomer View Post
- All the evaluated aligners do not report indels in ABI SOLiD data.
Hi krawitz,

Any thoughts on that one? I would love to see some results against
ABi SOLiD data.
__________________
-drd
drio is offline   Reply With Quote
Old 02-18-2010, 09:07 AM   #10
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Sequencing of natural strains of Arabidopsis thaliana with short reads by Ossowski et al. (2008). It claims that maq is bad at their single-end data set because maq does not do gapped alignment for such data. It is a fair claim, but actually in maq paper, we have already noticed the importance of indels and applied the "indel filter". The NA18507 and YanHuang papers both applied the indel filter. The 1000 genomes projects found undetected indel is the leading cause of false SNPs.

SOLiD data makes no difference. If you do not do gapped alignment, you will get clustered false SNPs and clustered false alignments.

EDIT: the attached plot (on simulated data) is additional evidence that ungapped alignment leads to more false SNPs. It is true that most of these false SNPs around a true indel, but we really want to see the indel rather than a row of false SNPs.
Attached Files
File Type: pdf 108v-bw.pdf (9.4 KB, 36 views)

Last edited by lh3; 02-18-2010 at 09:19 AM.
lh3 is offline   Reply With Quote
Old 02-18-2010, 10:18 AM   #11
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lh3 View Post
Sequencing of natural strains of Arabidopsis thaliana with short reads by Ossowski et al. (2008). It claims that maq is bad at their single-end data set because maq does not do gapped alignment for such data. It is a fair claim, but actually in maq paper, we have already noticed the importance of indels and applied the "indel filter". The NA18507 and YanHuang papers both applied the indel filter. The 1000 genomes projects found undetected indel is the leading cause of false SNPs.

SOLiD data makes no difference. If you do not do gapped alignment, you will get clustered false SNPs and clustered false alignments.

EDIT: the attached plot (on simulated data) is additional evidence that ungapped alignment leads to more false SNPs. It is true that most of these false SNPs around a true indel, but we really want to see the indel rather than a row of false SNPs.
Heng is right about indels causing errors with ungapped alignment.

Also, novoalign is correct to use qualities when aligning each read independently. However, what we really want is for all reads to "agree" where the indel occurs to create an accurate assembly. Having a "justification rule" will solve this, whereas aligning an indel based on quality will add noise (some reads will support the right or left positioning based on quality).

In any case, I think that some type of local re-alignment/re-assembly is required since all short read aligners align reads independently.
nilshomer is offline   Reply With Quote
Old 03-12-2010, 06:31 AM   #12
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by krawitz View Post
Hi Nils,

I would also like to comment on your criticism. The paper was not intended as a benchmarking study of alignment tools. It is just not possible to thoroughly analyze and compare all short read mapping tools. In fact BFAST may be as accurate as BWA or Novoalign (and I am looking forward to test it on our next data sets).
I would also be interested in BFAST/SSAHA2 comparison, but as everyone points out - there are just too many aligners out there (last count I had was at 35 programs!!!). And nobody has the time to compare them all.
NGSfan is offline   Reply With Quote
Old 03-12-2010, 08:26 AM   #13
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by nilshomer View Post
Heng is right about indels causing errors with ungapped alignment.

Also, novoalign is correct to use qualities when aligning each read independently. However, what we really want is for all reads to "agree" where the indel occurs to create an accurate assembly. Having a "justification rule" will solve this, whereas aligning an indel based on quality will add noise (some reads will support the right or left positioning based on quality).

In any case, I think that some type of local re-alignment/re-assembly is required since all short read aligners align reads independently.
What is your opinion of the Broad's GATK approach? I think they use some multiple sequence alignment to follow up the pairwise alignments done with an aligner and then put the reads in agreement.
NGSfan is offline   Reply With Quote
Old 09-17-2010, 02:08 AM   #14
maricu
Junior Member
 
Location: denmark

Join Date: Apr 2010
Posts: 8
Default

"The 1000 genomes projects found undetected indel is the leading cause of false SNPs."
I arrived quite late to this thread but I'd like to know if there is a reference for such statement. Thanks
maricu is offline   Reply With Quote
Old 09-17-2010, 10:30 AM   #15
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

krawitz has given that
lh3 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:58 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO