SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
Mapping SOLiD Reads using BFast. kasthuri Bioinformatics 0 06-04-2011 06:30 AM
BFAST solid reads newbietonextgen Bioinformatics 1 01-26-2011 08:36 PM
How to map SOLiD paired end reads by Bfast beliefbio Bioinformatics 1 12-29-2010 01:55 AM
How to map NCBI SOLiD reads with BWA Chandana Bioinformatics 3 12-04-2010 06:57 AM
map solid reads to ncRNA database crh SOLiD 2 07-19-2010 12:22 PM

Reply
 
Thread Tools
Old 12-11-2009, 08:41 AM   #1
jsun529
Member
 
Location: US

Join Date: Apr 2009
Posts: 52
Default SOLiD reads map using BFAST and Samtools

I am using BFAST and samtools for snp and indel detection on solid reads, it looks like the program returns more snps than expected(high false positive rate), are there any parameters to reset to control FPR/FNR, in BFAST and Samtools, except maximum converage cutoff?

Thanks,
jsun529 is offline   Reply With Quote
Old 12-11-2009, 08:57 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

It may be good to tune parameters like "-r". For example, I use "-r 0.0000007" since it gave the desired value when I looked at our ROC curve for whole-genome human resequencing. Remember, the samtools SNP caller was not designed for SOLiD data.
nilshomer is offline   Reply With Quote
Old 12-11-2009, 09:41 AM   #3
jsun529
Member
 
Location: US

Join Date: Apr 2009
Posts: 52
Default

Nils,
What is the -r stands for and at which command line can you specify? Also for target resequencing what would be a good value as 0.0000007?

Thanks for bring this up, I was about to ask if there is alternative program you would use to call snp and indels for solid data after running BFAST?

Thanks
jsun529 is offline   Reply With Quote
Old 12-11-2009, 11:27 AM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Several people who work on SOLiD data primarily have told me that they get good results from the samtools SNP caller. As long as we can produce accurate nucleotide bases and quality, I believe samtools is able to achieve comparable to the color-aware callers. Of course, how to generate accurate base and quality is another issue. I would love to see a head-to-head comparison between the AB caller and samtools. I am not sure what "-r" nils is referring to. If it is the pileup -r, I would suggest leave it as the default. You can set a high threshold on SNP quality to achieve a similar call set on -r 0.0000007.

Last edited by lh3; 12-11-2009 at 11:30 AM.
lh3 is offline   Reply With Quote
Old 12-11-2009, 12:24 PM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lh3 View Post
Several people who work on SOLiD data primarily have told me that they get good results from the samtools SNP caller. As long as we can produce accurate nucleotide bases and quality, I believe samtools is able to achieve comparable to the color-aware callers. Of course, how to generate accurate base and quality is another issue. I would love to see a head-to-head comparison between the AB caller and samtools. I am not sure what "-r" nils is referring to. If it is the pileup -r, I would suggest leave it as the default. You can set a high threshold on SNP quality to achieve a similar call set on -r 0.0000007.
I was referring to the "samtools pileup -r" parameter. I tested various values for the input parameters for "samtools pileup" and generated many, many ROC curves to find parameters that suited our data and our tolerable FP rate (1/100000<). I trained off of genotypes from an Illumina 1M SNP chip when SNP calling for whole-genome human resequencing.

I agree filtering on SNP quality is an absolute must.

I also agree that the SNP caller in samtools is sufficiently powerful. In fact, we use it for SNP calling (whole-genome human cancer resequencing) and it works extremely well (low fp, high tp) and that is thanks to lh3. However, there are artifacts that are produced (some false negatives) that are not due to coverage or error or quality, but as a result the SNP caller (it is also aligner independent in these case since we re-aligned). You only see these artifacts when surveying billions of positions. With a color-space-aware SNP caller, I think the SNP calling could be slightly improved (I've seen some comparisons between diBayes and "samtools pileup" using corona-lite alignments). I am hopeful that ABI's diBayes will support SAM soon so that it is aligner independent. In the meantime, I am a happy customer of "samtools pileup".
nilshomer is offline   Reply With Quote
Old 12-11-2009, 12:37 PM   #6
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Thanks, Nils. These all sound interesting. I never use samtools with SOLiD data by myself.
lh3 is offline   Reply With Quote
Old 12-11-2009, 05:33 PM   #7
jsun529
Member
 
Location: US

Join Date: Apr 2009
Posts: 52
Default

That is very helpful information. Thanks. I would have two more questions on that.
a) how to filtering on SNP quality?
b) What about indels with BFAST and samtools?
jsun529 is offline   Reply With Quote
Old 12-11-2009, 05:47 PM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by jsun529 View Post
That is very helpful information. Thanks. I would have two more questions on that.
a) how to filtering on SNP quality?
b) What about indels with BFAST and samtools?
a) Study http://samtools.sourceforge.net/cns0.shtml. Maybe Heng (lh3) can help you more.
b) Samtools will call indels when BFAST aligns an indel. I am not sure what you are asking.
nilshomer is offline   Reply With Quote
Old 12-14-2009, 07:50 AM   #9
jsun529
Member
 
Location: US

Join Date: Apr 2009
Posts: 52
Default

Thanks, from your link the snp quality(prob) are all 0s, what is the practical cutoff in general?

what I mean is if there is differences in parameters settings for calling SNP and Indels from both BFAST, and samtools?

Thanks.
jsun529 is offline   Reply With Quote
Old 12-14-2009, 08:29 AM   #10
jsun529
Member
 
Location: US

Join Date: Apr 2009
Posts: 52
Default

I need more help on this

a) When looking the samtools consensus link there seems one column explanation missing for example:

I got
chr1_1-10246 1832 A G 8 22 60 3 ,g^~, 3S,


As: chromosome (chr1_1-10246), chromosome coordinates (1832), reference base (A), consensus base (G), consensus quality (8), snp quality (22), maximum mapping quality of the reads covering the sites between the `reference base' and the `read bases' columns (60), then what is 3 stands for?

b) for the snp calling with default parameter setting for both BFAST and samtools, I did not get the known snps I am supposed to look for and when I look for the raw pileup result I noticed for each position that are two lines, can any one kindly explain what they are and why these two did not been picked up as surpossed to.

chr1_1-10246 8820 C M 72 228 60 2858 ,a$a$a$a$.$.$a$.$a$.$a$a$a$.$a$a$a$a$a$a$.$.$a$a$a$a$a$.$.$.$a$a$a$a$a$.$a$a$.$a$
chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

chr1_1-10246 8855 A R 228 228 60 4484 ,.,,$gg$.$.,.$.$g$,$g$,$.$.$g$.$g.$,$,$.$,$.$.$,.$,$g,$,$,$,$.$.$g$,$.$,$.$g$,$.$
chr1_1-10246 8855 * */* 11782 0 60 4484 * -C 4465 9 10 7 6

Any parameters can be tuned?

c) I have 50bp solid fragment reads, will BFAST and samtools detected 55bp deletion in this case? I known some other programs can pointed to the deletion site after several cycles of condensation?, if not what is the maximum deletion site the BFAST can pickup?

Thanks very much !
jsun529 is offline   Reply With Quote
Old 12-14-2009, 10:48 AM   #11
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by jsun529 View Post
I need more help on this

a) When looking the samtools consensus link there seems one column explanation missing for example:

I got
chr1_1-10246 1832 A G 8 22 60 3 ,g^~, 3S,


As: chromosome (chr1_1-10246), chromosome coordinates (1832), reference base (A), consensus base (G), consensus quality (8), snp quality (22), maximum mapping quality of the reads covering the sites between the `reference base' and the `read bases' columns (60), then what is 3 stands for?
The "3" stands for the coverage: the # of reads covering this base. See http://samtools.sourceforge.net/cns0.shtml for more details

Quote:
b) for the snp calling with default parameter setting for both BFAST and samtools, I did not get the known snps I am supposed to look for and when I look for the raw pileup result I noticed for each position that are two lines, can any one kindly explain what they are and why these two did not been picked up as surpossed to.

chr1_1-10246 8820 C M 72 228 60 2858 ,a$a$a$a$.$.$a$.$a$.$a$a$a$.$a$a$a$a$a$a$.$.$a$a$a$a$a$.$.$.$a$a$a$a$a$.$a$a$.$a$
chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

chr1_1-10246 8855 A R 228 228 60 4484 ,.,,$gg$.$.,.$.$g$,$g$,$.$.$g$.$g.$,$,$.$,$.$.$,.$,$g,$,$,$,$.$.$g$,$.$,$.$g$,$.$
chr1_1-10246 8855 * */* 11782 0 60 4484 * -C 4465 9 10 7 6

Any parameters can be tuned?
That is certainly odd. It looks like half the reads have the mutant allele. The dollar signs mean the reads end there. Having all of them ending at the same position is very unlikely unless... Have you removed PCR duplicates?

Quote:
c) I have 50bp solid fragment reads, will BFAST and samtools detected 55bp deletion in this case? I known some other programs can pointed to the deletion site after several cycles of condensation?, if not what is the maximum deletion site the BFAST can pickup?

Thanks very much !
A 55bp deletion in a 50bp reads with the default scoring matrix for the Smith Waterman most likely will not be detected. You could try adjusting the BFAST settings (scoring matrix) to make such indels more likely but this may increase the false mapping as well as the false indel discovery rate. I would suggest using paired end information to detect such indels.

These are all great questions, keep them coming!

Last edited by nilshomer; 12-14-2009 at 10:48 AM. Reason: Adding a link
nilshomer is offline   Reply With Quote
Old 12-14-2009, 11:22 AM   #12
jsun529
Member
 
Location: US

Join Date: Apr 2009
Posts: 52
Default

Thanks,
for b, you did not explain what each column/section is, and to me is not half are mutant allele, the coverage is 8458, trying to figure what other numbers stands for.

chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

In fact we did not remove the PCR duplicates, what would be the best approaches to do that?
jsun529 is offline   Reply With Quote
Old 12-14-2009, 11:33 AM   #13
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by jsun529 View Post
Thanks,
for b, you did not explain what each column/section is, and to me is not half are mutant allele, the coverage is 8458, trying to figure what other numbers stands for.

chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

In fact we did not remove the PCR duplicates, what would be the best approaches to do that?
For the output format, consult the link I provided previously. You can also ask samtools-help@lists.sourceforge.net

You can use the Picard to remove duplicates. I would guess a lot of the artifacts you are seeing are due to duplicates.
nilshomer is offline   Reply With Quote
Old 12-11-2010, 12:44 PM   #14
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

Quote:
Originally Posted by nilshomer View Post
For the output format, consult the link I provided previously. You can also ask samtools-help@lists.sourceforge.net

You can use the Picard to remove duplicates. I would guess a lot of the artifacts you are seeing are due to duplicates.

Hi, Nils.

I'm currently looking at BFAST and trying to determine the best way about aligning paired-end reads as a comparison to BioScope's progressive mapping software. I'm getting high fidelity ~ 80 percent or above in some situations using their recommended defaults. It'd be ideal to do a comparison between BFAST but I can't seem to figure out the paired-end step? Is there a manual for the BFAST-BWA hybrid for paired end data and it appears you're mapping each tag individually and then pairing? Is that the case? I notice BFAST is optimized for 50 bp reads? Is it correct for SOLiD paired end I map the F3 (5o bp) and then the F5 (35 bp) separately and with BFAST and BWA respectively?

Finally, I'm trying to find the most accurate SNP caller whether that be BioScope's internal diBayes software, SamTools or whatever. On high stringency across Chr21 I was able to tweak the params of diBayes to get ~96 percent concordance across baited-exon regions pertaining to Agilent's SureSelect 30 Mb and that was using only the min het cov, min hom cov parameters. I notice that diBayes calls homozygous SNPs with very high confidence but there is in fact a bias for the reference base in the SOLiD system and you tend to notice a range in which somewhere around 2/3s of your het-SNPs coverage and allele-counts end up being for the ref. Nonetheless, we still get a number of false positives and it's our hope we can get up to 99 percent concordance with some Complete Genomics data we have off the same run. Do you know of any better SNP callers out there or additional methods of tweaking our params for higher concordance? Additionally, we'd like to rescue some of our false negatives, but I'm unsure we can do this with a software tool and it might in fact be a coverage issue and not a software issue. I know this is a lot, but any help would be much appreciated. Thanks.

J
JohnK is offline   Reply With Quote
Old 12-11-2010, 01:05 PM   #15
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by JohnK View Post
Hi, Nils.

I'm currently looking at BFAST and trying to determine the best way about aligning paired-end reads as a comparison to BioScope's progressive mapping software. I'm getting high fidelity ~ 80 percent or above in some situations using their recommended defaults. It'd be ideal to do a comparison between BFAST but I can't seem to figure out the paired-end step? Is there a manual for the BFAST-BWA hybrid for paired end data and it appears you're mapping each tag individually and then pairing? Is that the case? I notice BFAST is optimized for 50 bp reads? Is it correct for SOLiD paired end I map the F3 (5o bp) and then the F5 (35 bp) separately and with BFAST and BWA respectively?

Finally, I'm trying to find the most accurate SNP caller whether that be BioScope's internal diBayes software, SamTools or whatever. On high stringency across Chr21 I was able to tweak the params of diBayes to get ~96 percent concordance across baited-exon regions pertaining to Agilent's SureSelect 30 Mb and that was using only the min het cov, min hom cov parameters. I notice that diBayes calls homozygous SNPs with very high confidence but there is in fact a bias for the reference base in the SOLiD system and you tend to notice a range in which somewhere around 2/3s of your het-SNPs coverage and allele-counts end up being for the ref. Nonetheless, we still get a number of false positives and it's our hope we can get up to 99 percent concordance with some Complete Genomics data we have off the same run. Do you know of any better SNP callers out there or additional methods of tweaking our params for higher concordance? Additionally, we'd like to rescue some of our false negatives, but I'm unsure we can do this with a software tool and it might in fact be a coverage issue and not a software issue. I know this is a lot, but any help would be much appreciated. Thanks.

J
For the BFAST+BWA mode, you align the 50bp end with BFAST and 35bp end with the BWA (within BFAST+BWA). This substitutes the "match" step. You then input the two BMF files into the "localalign" step, and proceed from there as before. There is not much documentation on this, and I would be happy to incorporate your feedback into a wiki page on the BFAST site going forward. There is always the option to post on bfast-help@lists.sourceforge.net, as there are many users that can help out there too.

For SNP calling, the implementation in SAMtools is sufficient, but is not color-aware like dibayes, though I have not seen enough evidence to convince me which is better. The mapper used by Bioscope is much improved, as such would be comparable to bfast. Make sure that you do appropriate filtering after SNP calling. A good guide to follow is what the 1000 Genomes project does. Please post your results, as they would benefit the community. I always appreciate your posts.
nilshomer is offline   Reply With Quote
Old 12-11-2010, 05:15 PM   #16
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

Quote:
Originally Posted by nilshomer View Post
For the BFAST+BWA mode, you align the 50bp end with BFAST and 35bp end with the BWA (within BFAST+BWA). This substitutes the "match" step. You then input the two BMF files into the "localalign" step, and proceed from there as before. There is not much documentation on this, and I would be happy to incorporate your feedback into a wiki page on the BFAST site going forward. There is always the option to post on bfast-help@lists.sourceforge.net, as there are many users that can help out there too.

For SNP calling, the implementation in SAMtools is sufficient, but is not color-aware like dibayes, though I have not seen enough evidence to convince me which is better. The mapper used by Bioscope is much improved, as such would be comparable to bfast. Make sure that you do appropriate filtering after SNP calling. A good guide to follow is what the 1000 Genomes project does. Please post your results, as they would benefit the community. I always appreciate your posts.

Hi, Nils.

I was perusing their website and wasn't sure if I was at the right place, but it was the analysis page. Looking there, I understand they use multiple SNP callers to get the overlap and this reduces up to 30% of false positives for low with coverage? Would you say in your own opinion that it's vital to have two different assemblies from different programs when comparing the SNP calls or would two separate SNP callers on one alignment dataset be, well, good? It's necessary I make these calls in the quickest amount of time.

I need to align my reads in a much quicker method as BioScope has a very poor runtime, but I do appreciate BioScope's high fidelity in percentage of mapped reads. What additional percentage of mapping would you expect with BFAST? I think outside of BioScope, your's is the best open-source alternative. As far as runs are concerned, I find it very difficult to gauge a run outside of error rate. Do you have any recommendations for gauging run quality other than heat maps and cycle scans? It seems there are so many variable factors that go into the prep that two runs may appear good quality and yet it doesn't map over > 80 % with the same parameters as a similar run. I'd love to hear your thoughts on all these topics, as you have much experience? Thank you.
JohnK is offline   Reply With Quote
Old 12-11-2010, 06:06 PM   #17
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Given your limited time, do what you think is best. Other SOLiD aligners include BWA and SHRiMP2. The latter is much faster than the previous version and is worth considering.
nilshomer is offline   Reply With Quote
Old 02-24-2011, 01:55 PM   #18
qtnguyen
Junior Member
 
Location: Bethesda, MD

Join Date: Feb 2011
Posts: 1
Default Filtering after BFAST and SNP calling

> Make sure that you do appropriate filtering after SNP calling. A good guide to follow is what the 1000 Genomes project does. [/QUOTE]

Could yo please elaborate, what would be an appropriate filtering procedure after SNP calling? How one can perform this?

Thanks,
qtnguyen is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO