SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Option "calmd"; Reporting indels and Somatic mutations for Whole Exome Seq data: angerusso Bioinformatics 0 01-10-2012 03:32 PM
Relatively large proportion of "LOWDATA", "FAIL" of FPKM_status running cufflink ruben6um Bioinformatics 3 10-12-2011 12:39 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 07:55 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 12:19 PM
SEQanswers second "publication": "How to map billions of short reads onto genomes" ECO Literature Watch 0 06-29-2009 11:49 PM

Reply
 
Thread Tools
Old 05-21-2010, 04:24 AM   #1
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default What about mutations in the "twilight zone"?

Dear fellow NGSers,

From what I've seen so far, I am able to use available tools (samtools, gatk, etc) to find SNVs and INDELs. For example:


TGACTTGCTGA Reference
TGACTCGCTGA Read 1
TGACTCGCTGA Read 2 etc..



TGACTTGCTGA Reference
TGACT---TGA Read 1
TGACT---TGA Read 2 etc..


Regarding the detection of mutations, how does one handle mutations that are not Single Nucleotide Variants and that are not Insertion/Deletions (Indels) ?

What if, for example, you have a two neighboring SNV mutations detected inside your reads?


AGACTAGATCA Reference
AGACACGATCA Read 1
AGACACGATCA Read 2 etc..


Are these recorded in samtools pileup as two seperate SNVs? or can they be detected as belonging together?

Or what about a deletion of sequence and the insertion of new sequence?


AGACTAGA-TCA Reference
AGAGATAAGTCA Read 1
AGAGATAAGTCA Read 2 etc..


It seems to me that most of the tools out there are can handle identifying the simple SNV/indel scenarios but do not take into account such cases. Does samtools pileup capture these kinds of mutations?

Perhaps my assumption is wrong and some of the available tools handle them?

Thanks for any input.
NGSfan is offline   Reply With Quote
Old 05-21-2010, 07:32 AM   #2
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

New paper out (not yet in Medline! -- that's how new it is) addresses this issue but does appear to contain specific code


Nucleic Acids Research, doi:10.1093/nar/gkq408

Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing

Jeffrey A. Rosenfeld1,2,*, Anil K. Malhotra1,2,3 and Todd Lencz1,2,3

Genomic sequence comparisons between individuals are usually restricted to the analysis of single nucleotide polymorphisms (SNPs). While the interrogation of SNPs is efficient, they are not the only form of divergence between genomes. In this report, we expand the scope of polymorphism detection by investigating the occurrence of double nucleotide polymorphisms (DNPs) and triple nucleotide polymorphisms (TNPs), in which two or three consecutive nucleotides are altered compared to the reference sequence. We have found such DNPs and TNPs throughout two complete genomes and eight exomes. Within exons, these novel polymorphisms are over-represented amongst protein-altering variants; nearly all DNPs and TNPs result in a change in amino acid sequence and, in some cases, two adjacent amino acids are changed. DNPs and TNPs represent a potentially important new source of genetic variation which may underlie human disease and they should be included in future medical genetics studies. As a confirmation of the damaging nature of xNPs, we have identified changes in the exome of a glioblastoma cell line that are important in glioblastoma pathogenesis. We have found a TNP causing a single amino acid change in LAMC2 and a TNP causing a truncation of HUWE1.
krobison is offline   Reply With Quote
Old 05-21-2010, 10:08 AM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I think there's two subquestions there; will an ailgner align reads that are more disparate than a single base change, and what will a variant parser make of them?

DNPs will probably be fine, even TNPs if your aligner handles 3 mismatches in a read. And I don't see why a variant parser would have a hard time with that.

Your last example is the hard one, as most aligners just wouldn't align reads with a 5 base discrepancy. What you'd see is a steep drop off in coverage just over the change, possibly with the edges of the discrepancy called as SNPs, as some reads will land in exactly the right place that they just cover it, and will align with only 1 or 2 discrepancies at the end of the read. In theory, if you fixed your reference genome to match at those two letters, and then realigned, you'd get more reads aligning, and maybe you'd cover the whole region with reads after enough iterations. But aligning a second time probably isn't feasable for many genomes.

De novo would catch those kinds of things, if your sample was mostly clonal or homozygous for that large change. Compare your de novo to your reference, and you'd see the discrepancy fine.
swbarnes2 is offline   Reply With Quote
Old 05-21-2010, 01:03 PM   #4
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

krobison:

wow! thanks a lot for sharing this paper with me - this is definitely hot off the press and on topic!

swbarnes2:

>I think there's two subquestions there; will an ailgner align reads that are >more disparate than a single base change, and what will a variant parser >make of them?

Yes this is true - there are two parts to detection - alignment and variant parser. I would think that new aligners such as BWA/BFAST/Novoalign can handle mismatches and indels >3bp . Bowtie maxes out at 3bp.

>DNPs will probably be fine, even TNPs if your aligner handles 3 mismatches in >a read. And I don't see why a variant parser would have a hard time with that.

The variant parser is really where I am concerned - because the pileup output from samtools, looks like neighboring SNVs will get treated separately than as being together. The point is that if your short reads capture two SNV in one read-span length, then you can assign these two mutations as going together into a allele. In heterzygous situations, treating them separately could mean that one allele has mutation 1 and the other allele has mutation 2

Please correct me if my genetic vocabulary use is wrong.

Thanks for joining the discussion and sharing your input. It's great to bounce off ideas and hear back from others

Last edited by NGSfan; 05-25-2010 at 12:12 AM.
NGSfan is offline   Reply With Quote
Old 05-24-2010, 05:43 PM   #5
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by NGSfan View Post
What if, for example, you have a two neighboring SNV mutations detected inside your reads?

...

Or what about a deletion of sequence and the insertion of new sequence?

...

It seems to me that most of the tools out there are can handle identifying the simple SNV/indel scenarios but do not take into account such cases. Does samtools pileup capture these kinds of mutations?

Perhaps my assumption is wrong and some of the available tools handle them?

Thanks for any input.
In pure sequence terms, I don't think there is a difference between two SNVs right next to each other and a complex indel where two neighboring bases are removed and replaced with two other bases. Those two events will look identical when two sequences are side by side.

I believe many aligners and at least samtools for variant calling are indeed robust to these types of events and that they will usually mark them as indels because they do not necessarily constrain indels to a particular size, but they likely do constrain SNVs to one base (since they are, after all, single nucleotide variants). I guess that if a variant caller sees a spot where two expected bases in a row are missing, it flags that spot as a deletion, and if it sees a spot where two unexpected bases are present, it flags that as an insertion.

Therefore, it seems reasonable that such an events will be flagged as a deletion and an insertion directly adjacent to each other. (In fact, in the back of my mind, I feel like I've seen that very type of thing before in our own whole genome alignments... maybe just in aberrant reads, though.)

As for the indel example, as long as your aligner is robust against that (gapped aligners should be), that spot will similarly be flagged as both a deletion and an insertion adjacent to each other.

Also, for the case where there are repetitive elements that make the exact position of that sort of event ambiguous, I believe people generally either left-justify them or randomly position them.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 05-25-2010, 12:29 AM   #6
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by Michael.James.Clark View Post
In pure sequence terms, I don't think there is a difference between two SNVs right next to each other and a complex indel where two neighboring bases are removed and replaced with two other bases. Those two events will look identical when two sequences are side by side.

I believe many aligners and at least samtools for variant calling are indeed robust to these types of events and that they will usually mark them as indels because they do not necessarily constrain indels to a particular size, but they likely do constrain SNVs to one base (since they are, after all, single nucleotide variants). I guess that if a variant caller sees a spot where two expected bases in a row are missing, it flags that spot as a deletion, and if it sees a spot where two unexpected bases are present, it flags that as an insertion.
You're right that in pure sequence terms, it will not make a difference, since you are just recording changes. But it will make a difference perhaps, when you want to distinguish alleles:

The following would get reported as g.6T>C and g.7T>G:


TGACTTTGCTGA Reference
TGACTCTGCTGA Read 1
TGACTCTGCTGA Read 2
TGACTCTGCTGA Read 3
TGACTTGGCTGA Read 4
TGACTTGGCTGA Read 5
TGACTTGGCTGA Read 6 etc..



And if my understanding of samtools pileup is correct, so would this case:


TGACTTTGCTGA Reference
TGACTCGGCTGA Read 1
TGACTCGGCTGA Read 2
TGACTCGGCTGA Read 3
TGACTTTGCTGA Read 4
TGACTTTGCTGA Read 5
TGACTTTGCTGA Read 6 etc..
etc..



So while, both are recorded as g.6T>C and g.7T>G at the end of the day, the problem is that they are really different kind of mutation, one from the other. However one alignment is telling you that an allele carries both, while the other tells you there are two alleles each carrying a different mutation. I think it is important to distinguish this, no?


Quote:
Originally Posted by Michael.James.Clark View Post
Therefore, it seems reasonable that such an events will be flagged as a deletion and an insertion directly adjacent to each other. (In fact, in the back of my mind, I feel like I've seen that very type of thing before in our own whole genome alignments... maybe just in aberrant reads, though.)

As for the indel example, as long as your aligner is robust against that (gapped aligners should be), that spot will similarly be flagged as both a deletion and an insertion adjacent to each other.

Also, for the case where there are repetitive elements that make the exact position of that sort of event ambiguous, I believe people generally either left-justify them or randomly position them.
These are definitely difficult alignment situations - because it deals with two events first a deletion, then an insertion. I am using BFAST, which for the most part handles indels pretty well. But just thinking of scenarios where the change is not just an Deletion *or* an Insertion but where both happened.
NGSfan is offline   Reply With Quote
Old 05-25-2010, 08:36 AM   #7
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by NGSfan View Post
You're right that in pure sequence terms, it will not make a difference, since you are just recording changes. But it will make a difference perhaps, when you want to distinguish alleles:

The following would get reported as g.6T>C and g.7T>G:


TGACTTTGCTGA Reference
TGACTCTGCTGA Read 1
TGACTCTGCTGA Read 2
TGACTCTGCTGA Read 3
TGACTTGGCTGA Read 4
TGACTTGGCTGA Read 5
TGACTTGGCTGA Read 6 etc..



And if my understanding of samtools pileup is correct, so would this case:


TGACTTTGCTGA Reference
TGACTCGGCTGA Read 1
TGACTCGGCTGA Read 2
TGACTCGGCTGA Read 3
TGACTTTGCTGA Read 4
TGACTTTGCTGA Read 5
TGACTTTGCTGA Read 6 etc..
etc..



So while, both are recorded as g.6T>C and g.7T>G at the end of the day, the problem is that they are really different kind of mutation, one from the other. However one alignment is telling you that an allele carries both, while the other tells you there are two alleles each carrying a different mutation. I think it is important to distinguish this, no?
But those aren't the same by sequence because they aren't occurring on the same haplotype. I doubt it would be reported as the same type of event because the first case should be called as two adjacent SNVs since they're happening on separate haplotypes while the second one is a deletion adjacent to an insertion because it's happening on the same haplotype.

Quote:
These are definitely difficult alignment situations - because it deals with two events first a deletion, then an insertion. I am using BFAST, which for the most part handles indels pretty well. But just thinking of scenarios where the change is not just an Deletion *or* an Insertion but where both happened.
Like I said, in the back of my mind I recall seeing reads like this without a problem, and that's using BFAST. I'm not actually sure about samtools calling a variant like this because I don't recall seeing a variant like this (I think the closest I've seen is a deletion adjacent to a SNV). I encourage you to test it with a simulation if you're concerned with it, though.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO