SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: Identification of high utility SNPs for population assignment and traceabilit Newsbot! Literature Watch 0 11-01-2011 03:00 AM
samtools rm ambiguous alignment EHC Bioinformatics 0 06-11-2010 09:50 AM
MAQ: SNPs interpretation, etc! nanelle Bioinformatics 0 08-25-2009 11:51 AM
SNPs calling accuracy: MAQ vs. SliderII nmalhis Bioinformatics 0 04-02-2009 10:51 AM
In Sequence: Study Shows 454, Illumina, ABI Can Profile SNPs in Whole Genomes at High Newsbot! SOLiD 0 09-16-2008 02:27 PM

Reply
 
Thread Tools
Old 07-19-2009, 01:07 AM   #1
lukemn
Junior Member
 
Location: Australia

Join Date: Jul 2009
Posts: 3
Default high Q ambiguous SNPs from Maq

Hello,

I'm doing mutation detection by ~30x Illumina genome resequencing on a haploid eukaryote.

Maq seems to be working fine otherwise, not that I have a great deal of experience here, but final SNP list includes MASSES of ambiguous calls (ie. C>M, G>R etc) many with max phred of 255. By masses I mean ~2/3, from ~1700 total filtered SNPs over the genome. From a haploid! And this is randomly distributed over the entire genome, 8 chromosomes, so it's not partial duplications or restricted to repetitive sequence.

I should say I'm manually filtering to advised thresholds (phred 40, depth 3, also looking at neighbouring quality and number of hits but these numbers are looking fine) rather than running SNPfilter, but I don't think this should matter AFAIK. Mostly using default maq settings, except for the consensus assembly (-s -q 30).

I'm moving to BWA/SAMtools to compare, but still, anyone know what could be going on here? I'm very happy to just throw these away if spurious, but not without knowing why they're getting through.

Thanks,
Luke
lukemn is offline   Reply With Quote
Old 07-19-2009, 07:21 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lukemn View Post
Hello,

I'm doing mutation detection by ~30x Illumina genome resequencing on a haploid eukaryote.

Maq seems to be working fine otherwise, not that I have a great deal of experience here, but final SNP list includes MASSES of ambiguous calls (ie. C>M, G>R etc) many with max phred of 255. By masses I mean ~2/3, from ~1700 total filtered SNPs over the genome. From a haploid! And this is randomly distributed over the entire genome, 8 chromosomes, so it's not partial duplications or restricted to repetitive sequence.

I should say I'm manually filtering to advised thresholds (phred 40, depth 3, also looking at neighbouring quality and number of hits but these numbers are looking fine) rather than running SNPfilter, but I don't think this should matter AFAIK. Mostly using default maq settings, except for the consensus assembly (-s -q 30).

I'm moving to BWA/SAMtools to compare, but still, anyone know what could be going on here? I'm very happy to just throw these away if spurious, but not without knowing why they're getting through.

Thanks,
Luke
You could convert the MAQ alignments to the SAM format and use the SAMtoolos SNP caller, which itself uses the MAQ consensus caller (written by the same author as MAQ). In SAMtools I believe you can specify the ploidy so the SNP calls will never be called heterozygous. There are also a number of other parameters that are useful to tune.

Just curious, but what are you doing about indels?
nilshomer is offline   Reply With Quote
Old 07-19-2009, 04:11 PM   #3
lukemn
Junior Member
 
Location: Australia

Join Date: Jul 2009
Posts: 3
Default

Thanks, I'll try that.

And yes another reason I'm going ahead with BWA/SAMtools is to use the handling of gapped alignments for single end reads (I had thought we were doing paired ends but it turns out not to be the case). This should reveal indels, rearrangements, I hope.
lukemn is offline   Reply With Quote
Old 07-19-2009, 06:31 PM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lukemn View Post
Thanks, I'll try that.

And yes another reason I'm going ahead with BWA/SAMtools is to use the handling of gapped alignments for single end reads (I had thought we were doing paired ends but it turns out not to be the case). This should reveal indels, rearrangements, I hope.
Both SHRiMP and BFAST also are able to search for indels with single end data by using a full smith waterman algorithm. Keep me updated on your progress, I would be interested in your assessment.
nilshomer is offline   Reply With Quote
Old 07-20-2009, 08:43 AM   #5
sungdugkim
Junior Member
 
Location: IN, US

Join Date: Jul 2009
Posts: 1
Default

I am new in this field and like to learn from the basic..

Can you recommend any web site ?

Thank you

SK
sungdugkim is offline   Reply With Quote
Old 07-20-2009, 10:27 AM   #6
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by lukemn View Post
Hello,

I'm doing mutation detection by ~30x Illumina genome resequencing on a haploid eukaryote.

Maq seems to be working fine otherwise, not that I have a great deal of experience here, but final SNP list includes MASSES of ambiguous calls (ie. C>M, G>R etc) many with max phred of 255. By masses I mean ~2/3, from ~1700 total filtered SNPs over the genome. From a haploid! And this is randomly distributed over the entire genome, 8 chromosomes, so it's not partial duplications or restricted to repetitive sequence.

I should say I'm manually filtering to advised thresholds (phred 40, depth 3, also looking at neighbouring quality and number of hits but these numbers are looking fine) rather than running SNPfilter, but I don't think this should matter AFAIK. Mostly using default maq settings, except for the consensus assembly (-s -q 30).

I'm moving to BWA/SAMtools to compare, but still, anyone know what could be going on here? I'm very happy to just throw these away if spurious, but not without knowing why they're getting through.

Thanks,
Luke
I've seen those too in bacteria, and the high quality ones have confirmed with Sanger sequencing. So probably, what you are seeing is really in the original DNA, and not a false positive. You should sanger check a few, then ask the people who prepped the DNA why there appear to be two templates in their sample.
swbarnes2 is offline   Reply With Quote
Old 07-20-2009, 07:35 PM   #7
lukemn
Junior Member
 
Location: Australia

Join Date: Jul 2009
Posts: 3
Default

I agree... there could be some contamination, especially of closely related progeny. But I would only have myself to blame for that!

Doing what I should have done in the first place before posting, manually inspecting the alignment (SAMtools tview), I see that most of these are probably just conservative variant calling by Maq... a few more than usual (say 3-5/average 30x coverage) seq errors that happen to fall on the same base, and are not representative of the consensus. Probably tunable but good to manually inspect as well I guess.

Also picking up a few extra SNPs from BWA relative to Maq.
lukemn is offline   Reply With Quote
Old 08-04-2009, 07:55 PM   #8
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Default

Quote:
Originally Posted by lukemn View Post
I agree... there could be some contamination, especially of closely related progeny.
Some of the bacterial samples we've sequenced suggest the existence of a sub-population in the mix.
Torst is offline   Reply With Quote
Reply

Tags
haploid, heterozygote, maq, resequencing, snp

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO