SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SNP positions in dbSNP and Cosmic databases kalinka23 Bioinformatics 4 03-14-2012 03:42 AM
PubMed: Enhanced mismatch mutation analysis: simultaneous detection of point mutation Newsbot! Literature Watch 0 12-14-2010 02:20 AM
Hunting in databases cascoamarillo Bioinformatics 0 11-05-2010 11:09 AM
SNP/Mutation Detection Using Illumina Paired-end Data qqcandy Bioinformatics 14 07-24-2009 07:43 AM
SNP/mutation Using Illumina Paired-end Data qqcandy General 0 10-01-2008 04:58 PM

Reply
 
Thread Tools
Old 02-09-2011, 07:54 AM   #1
fpepin
Member
 
Location: Berkeley

Join Date: Feb 2011
Posts: 30
Default very common mutation not in SNP databases

We've got DNA sequencing for exome-capture run for about 40 cancer cell lines. I'm finding some very common differences from the reference genome that isn't in dbSNP or 1000 Genomes and I was wondering if anyone could help explain what it is and if there is a reasonable way to filter off these features.

One example is A1BG with a T109G (ACC->CCC at position 325 [transcript ENST00000453054], reverse strand), chr19:58862835 T->G (hg19, chr19:63554647 for hg18). I have that mutation on 37/42 samples.

The various QC metrics are good and the reads are nicely clustering around exonic regions as expected. I checked with a known gene (p53) and the resulting mutations also match up.

The fact that it appears in so many samples and isn't in either dbSNP or 1000 genomes worries me a little bit. We have over 100 genes that have unknown mutations in over 30 samples, so I'd love to have a decent filter for them.
fpepin is offline   Reply With Quote
Old 02-09-2011, 08:51 AM   #2
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

Either you are very lucky to find a cancer (or at least cell line) specific mutation or perhaps due to the informatics. I'd Sanger sequence the mutation and see if it's indeed what you think it is. Another thing to consider would be pseudogenes and duplicated genomic regions. Both of which may have slight variation with the gene of interest but still pass informatic filters.
csoong is offline   Reply With Quote
Old 02-09-2011, 10:34 AM   #3
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

There is something going here. This does appear to be "common" in the many bamfiles I've looked at.

Are you seeing something like this in this region (image is centered on your "snp")? :


The SNP in question is in the middle of the di-sulfide bridge bond motif.
Notes for Summary of A1BG (from refseq)
The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins.
Richard Finney is offline   Reply With Quote
Old 02-09-2011, 10:59 AM   #4
fpepin
Member
 
Location: Berkeley

Join Date: Feb 2011
Posts: 30
Default

csoong: this seems improbable enough that I'm betting against it being a cancer gene (hence my posting here). It would be an incredible case of beginner's luck to find something that prevalent that no one has find before. As I said, this is one example of over a hundred. Resequencing seems like a decent suggestion. I'll check for pseudogenes or duplicated regions and post back.

Richard: It's more focused in my case, very little changes in the other bases. This sample 9/40 reads that have the mutation here. I'm glad to see that I'm not the only one seeing this.
Attached Images
File Type: jpg igv_snapshot.jpg (38.1 KB, 55 views)
fpepin is offline   Reply With Quote
Old 02-09-2011, 12:57 PM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

I don't think it's real. The noise I see is too consistent.

1) Is this Illumina ?
2) Is the SNP call all from one direction of aligned reads (all plus or all minus)?
3) Is the SNP being call only have evidence for it in reads with a sudden drop off in quality right near the called SNP?
Richard Finney is offline   Reply With Quote
Old 02-09-2011, 01:13 PM   #6
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

fpepin: What is it? I think its a false positive due to misalignment. How to filter it? depends on first question. I think its a false positive because I also found it in an exom sample with similar ~45 depth, ELAND alignment. 30% of reads the variant was at ends. 10% there was other mismatches withing 5bps from this position. I have it annotated differently from your annotation as (c.832T>G het p.T278TP not T109G can you check your annotation?) are you using illumina? what alignment algorithm did you use?

csoong: i have found handful of pseudogenes that cause false positive calls one of which I was able to confirm with sequencing. My experience been two and more variants within close proximity from each other <10bps usually in same read. I have some examples but I don't want to go into details. What platform are you using? what alignment algorithm did you use?

thanks for this post. This has been worrying me and I havn't found a good way to handle it. Is there other than BWT algorithms that have been used for exom alignment. I have used two different algorithms with same results both use BWT methodology mainly because of large reference one of them is ELAND but I like to try another method of alignment.

Last edited by husamia; 02-09-2011 at 01:17 PM.
husamia is offline   Reply With Quote
Old 02-09-2011, 02:20 PM   #7
fpepin
Member
 
Location: Berkeley

Join Date: Feb 2011
Posts: 30
Default

Quote:
Originally Posted by Richard Finney View Post
I don't think it's real. The noise I see is too consistent.

1) Is this Illumina ?
2) Is the SNP call all from one direction of aligned reads (all plus or all minus)?
3) Is the SNP being call only have evidence for it in reads with a sudden drop off in quality right near the called SNP?
1) Yes, it's a GAII.
2) Yes, that is correct, in this case. Checking with some of my other problem genes, this seems to be the case in all but 1 (which I'm willing to believe is actually interesting). What made you think that this might be the case?
3) Generally. There is a trend toward the end of the reads (last 5 bases) and other cases where the quality drops sharply and comes back up but there are also a fair number of cases where the quality remains strong for a while after (10+ bases).

It also seems that some of these reads are mis-aligned, 2 of my problematic genes are on chromosome Y. They account for the totality of the mutations called on chromosome Y on my breast cancer cell lines. These genes also have a lot of mutations close together, leading me to believe that those were mapping errors.

Looking more in depth, it looks like I've got 16 genes with the strange one-strand-only mutations and 193 that are more likely to be due to mapping errors.

Is there an explanation as to what would cause a such a consistent artifact like this only on reads from a given strand over dozens of samples?

--edit--
husamia: The alignment is done using BWA. You are correct, position 109 is wrong, it is indeed 278. Looks like the annotation code that I inherited is wrong.

Last edited by fpepin; 02-09-2011 at 02:29 PM. Reason: adding response to husamia
fpepin is offline   Reply With Quote
Old 02-09-2011, 02:33 PM   #8
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Try using BWA or Novoalign to realign this data. Call variants with samtools or GATK.

Edit: Okay. Which variant caller then?

Did not find the same thing in my exome data. Aligned with novoalign (hard clipping on) and used GATK for variant calling.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]

Last edited by Michael.James.Clark; 02-09-2011 at 03:01 PM.
Michael.James.Clark is offline   Reply With Quote
Old 02-09-2011, 02:35 PM   #9
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

I was getting some hints of a mutation there, in other Illumina samples I have.

Using bambino viewer tool, I see this :
http://img9.imageshack.us/i/bambino.png/

The little red nib at top between the T and Y is a measure of snpness.
All of the snpness is from reads in one direction and they are all on the border of some quality valley. The quality is likely bad the extra two reads in , just not showing up as bad. I asked the bambino author (tool picture shown) and he said :

I see some reads have runs of nucleotide quality 2, which smells like the Illumina pipeline masking issue ...

http://seqanswers.com/forums/showthread.php?t=4721
(comment from Torst, 04-24-2010, 09:21 PM)

I'm NOT sure if he's right or wrong,but that wall of quality drop off is very strange to look at. Maybe someone else knows what's up.
Richard Finney is offline   Reply With Quote
Old 02-09-2011, 03:55 PM   #10
fpepin
Member
 
Location: Berkeley

Join Date: Feb 2011
Posts: 30
Default

Quote:
Originally Posted by Michael.James.Clark View Post
Try using BWA or Novoalign to realign this data. Call variants with samtools or GATK.

Edit: Okay. Which variant caller then?

Did not find the same thing in my exome data. Aligned with novoalign (hard clipping on) and used GATK for variant calling.
This is also using GATK (java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -pl SOLEXA ...) for the variant calling. Is samtools likely to be different?

Could it be that slightly more stringent threshold for trimming reads might get rid of these artifacts, since they tend to be at the end of the reads?
fpepin is offline   Reply With Quote
Old 02-09-2011, 04:09 PM   #11
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by fpepin View Post
This is also using GATK (java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -pl SOLEXA ...) for the variant calling. Is samtools likely to be different?

Could it be that slightly more stringent threshold for trimming reads might get rid of these artifacts, since they tend to be at the end of the reads?
I think if you're seeing it with GATK, you'll probably see it with samtools, though you could try it.

Why not try clipping the ends off, realigning, and doing variant calling? Just trim five or ten bases off the end. This is something people often do.

Personally, I use Novoalign's hard clipping and adaptor stripping, and that seems to bump up specificity.

Also, for GATK, take a look at the -baq parameter (see their wiki). It's recommended to use that if you didn't.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 02-10-2011, 12:26 AM   #12
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

Were these data sequenced one sample per lane (which I expect), or multiple samples per lane?

Were other, non cancer cell line samples sequenced on the same run as cancer cell line samples? (I mean, in our facility there's more than one research group that uses the sequencer, there may be like 5 whole exomes for one group and 3 for another going on)

If that's the case, does that data show the same artefact?
Bruins is offline   Reply With Quote
Old 02-10-2011, 09:51 AM   #13
fpepin
Member
 
Location: Berkeley

Join Date: Feb 2011
Posts: 30
Default

Bruins:This is 2 lanes per sample. I haven't checked the results from the individual lanes in my haste to get results, but I'm running it now.

Also, the runs are basically all cancer-related. I could look for some normals were run with the same protocols, but the artificiality of the having all the mutations in reads on one strand strongly suggest that the problem istechnical in nature.

Michael: I have an old version of GATK in my pipeline (1.0.3185) that doesn't have -baq implemented. It looks like I would need to tweak things to get the newer version to work, as it complains about the ordering of my contigs (lexicographically: 10, 11 , ..., 1 ,20, 21, ..., instead of karyotypically: 1,2,3). It can't be hard to fix, though.

Having not followed the development of GATK, how important is it to be up to date?
fpepin is offline   Reply With Quote
Old 02-11-2011, 10:28 AM   #14
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by fpepin View Post
Michael: I have an old version of GATK in my pipeline (1.0.3185) that doesn't have -baq implemented. It looks like I would need to tweak things to get the newer version to work, as it complains about the ordering of my contigs (lexicographically: 10, 11 , ..., 1 ,20, 21, ..., instead of karyotypically: 1,2,3). It can't be hard to fix, though.

Having not followed the development of GATK, how important is it to be up to date?
It seems to be fairly important at this stage because of baq implementation and the -glm DINDEL feature of the UnifiedGenotyper (the UG is used to more sensitively detect indels than the IndelGenotyperV2!).
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 02-11-2011, 12:53 PM   #15
fpepin
Member
 
Location: Berkeley

Join Date: Feb 2011
Posts: 30
Default

Thanks Micheal, I'll update my pipeline then.

I also talked to more people about this issue and the answer is that this is a well-known issue when you have a specific tetramer (CCGG, if I remember correctly) that ends up causing bases to be skipped.

This explains why the differences are strand specific and why it happens in very specific places in the genome (including in completely different labs). The reads generally end up being clipped soon after because the indel, but there are cases that are repetitive enough that the few other bases match by chance, leading for the trimming to occur later and the read to be kept.

I'm not quite sure where would be best place to recognize this corner case and deal with it, but in my case I'll take the lazy way out and just filter them out at the end.

Thanks for all who offered suggestions.
fpepin is offline   Reply With Quote
Old 02-11-2011, 01:01 PM   #16
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Hey, thanks for figuring out that it was CCGG . I thought it might be something like that. It must be that a certain distance into the read if the CCGG happens it starts stuttering. I see that it is quality "masked" but that the "bad quality mask" is short by about 2-3 bps. The bad call is often in that "good quality" area, in the data I look at. The "snp call must have enough evidence in both directions" rule should save the day.
Richard Finney is offline   Reply With Quote
Old 02-11-2011, 01:53 PM   #17
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by fpepin View Post
I also talked to more people about this issue and the answer is that this is a well-known issue when you have a specific tetramer (CCGG, if I remember correctly) that ends up causing bases to be skipped.

This explains why the differences are strand specific and why it happens in very specific places in the genome (including in completely different labs). The reads generally end up being clipped soon after because the indel, but there are cases that are repetitive enough that the few other bases match by chance, leading for the trimming to occur later and the read to be kept.

I'm not quite sure where would be best place to recognize this corner case and deal with it, but in my case I'll take the lazy way out and just filter them out at the end.

Thanks for all who offered suggestions.
Ah yes! I recall a thread discussing that very issue here a while back now that you mention it. Like Richard suggested, insisting that there be evidence for the call on both strands will generally get rid of these, thankfully. (Edit: Maybe that thread wasn't here, as I can't seem to hunt it down. If anyone remember it, link us up!)
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]

Last edited by Michael.James.Clark; 02-11-2011 at 01:57 PM.
Michael.James.Clark is offline   Reply With Quote
Old 02-11-2011, 05:14 PM   #18
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

Thanks for all the detailed responses!
Could anyone further explain the difference of an early CCGG vs a late one? At the program level what is happening here?
csoong is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:20 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO