Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SOLiD reads map using BFAST and Samtools

    I am using BFAST and samtools for snp and indel detection on solid reads, it looks like the program returns more snps than expected(high false positive rate), are there any parameters to reset to control FPR/FNR, in BFAST and Samtools, except maximum converage cutoff?

    Thanks,

  • #2
    It may be good to tune parameters like "-r". For example, I use "-r 0.0000007" since it gave the desired value when I looked at our ROC curve for whole-genome human resequencing. Remember, the samtools SNP caller was not designed for SOLiD data.

    Comment


    • #3
      Nils,
      What is the -r stands for and at which command line can you specify? Also for target resequencing what would be a good value as 0.0000007?

      Thanks for bring this up, I was about to ask if there is alternative program you would use to call snp and indels for solid data after running BFAST?

      Thanks

      Comment


      • #4
        Several people who work on SOLiD data primarily have told me that they get good results from the samtools SNP caller. As long as we can produce accurate nucleotide bases and quality, I believe samtools is able to achieve comparable to the color-aware callers. Of course, how to generate accurate base and quality is another issue. I would love to see a head-to-head comparison between the AB caller and samtools. I am not sure what "-r" nils is referring to. If it is the pileup -r, I would suggest leave it as the default. You can set a high threshold on SNP quality to achieve a similar call set on -r 0.0000007.
        Last edited by lh3; 12-11-2009, 11:30 AM.

        Comment


        • #5
          Originally posted by lh3 View Post
          Several people who work on SOLiD data primarily have told me that they get good results from the samtools SNP caller. As long as we can produce accurate nucleotide bases and quality, I believe samtools is able to achieve comparable to the color-aware callers. Of course, how to generate accurate base and quality is another issue. I would love to see a head-to-head comparison between the AB caller and samtools. I am not sure what "-r" nils is referring to. If it is the pileup -r, I would suggest leave it as the default. You can set a high threshold on SNP quality to achieve a similar call set on -r 0.0000007.
          I was referring to the "samtools pileup -r" parameter. I tested various values for the input parameters for "samtools pileup" and generated many, many ROC curves to find parameters that suited our data and our tolerable FP rate (1/100000<). I trained off of genotypes from an Illumina 1M SNP chip when SNP calling for whole-genome human resequencing.

          I agree filtering on SNP quality is an absolute must.

          I also agree that the SNP caller in samtools is sufficiently powerful. In fact, we use it for SNP calling (whole-genome human cancer resequencing) and it works extremely well (low fp, high tp) and that is thanks to lh3. However, there are artifacts that are produced (some false negatives) that are not due to coverage or error or quality, but as a result the SNP caller (it is also aligner independent in these case since we re-aligned). You only see these artifacts when surveying billions of positions. With a color-space-aware SNP caller, I think the SNP calling could be slightly improved (I've seen some comparisons between diBayes and "samtools pileup" using corona-lite alignments). I am hopeful that ABI's diBayes will support SAM soon so that it is aligner independent. In the meantime, I am a happy customer of "samtools pileup".

          Comment


          • #6
            Thanks, Nils. These all sound interesting. I never use samtools with SOLiD data by myself.

            Comment


            • #7
              That is very helpful information. Thanks. I would have two more questions on that.
              a) how to filtering on SNP quality?
              b) What about indels with BFAST and samtools?

              Comment


              • #8
                Originally posted by jsun529 View Post
                That is very helpful information. Thanks. I would have two more questions on that.
                a) how to filtering on SNP quality?
                b) What about indels with BFAST and samtools?
                a) Study http://samtools.sourceforge.net/cns0.shtml. Maybe Heng (lh3) can help you more.
                b) Samtools will call indels when BFAST aligns an indel. I am not sure what you are asking.

                Comment


                • #9
                  Thanks, from your link the snp quality(prob) are all 0s, what is the practical cutoff in general?

                  what I mean is if there is differences in parameters settings for calling SNP and Indels from both BFAST, and samtools?

                  Thanks.

                  Comment


                  • #10
                    I need more help on this

                    a) When looking the samtools consensus link there seems one column explanation missing for example:

                    I got
                    chr1_1-10246 1832 A G 8 22 60 3 ,g^~, 3S,


                    As: chromosome (chr1_1-10246), chromosome coordinates (1832), reference base (A), consensus base (G), consensus quality (8), snp quality (22), maximum mapping quality of the reads covering the sites between the `reference base' and the `read bases' columns (60), then what is 3 stands for?

                    b) for the snp calling with default parameter setting for both BFAST and samtools, I did not get the known snps I am supposed to look for and when I look for the raw pileup result I noticed for each position that are two lines, can any one kindly explain what they are and why these two did not been picked up as surpossed to.

                    chr1_1-10246 8820 C M 72 228 60 2858 ,a$a$a$a$.$.$a$.$a$.$a$a$a$.$a$a$a$a$a$a$.$.$a$a$a$a$a$.$.$.$a$a$a$a$a$.$a$a$.$a$
                    chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

                    chr1_1-10246 8855 A R 228 228 60 4484 ,.,,$gg$.$.,.$.$g$,$g$,$.$.$g$.$g.$,$,$.$,$.$.$,.$,$g,$,$,$,$.$.$g$,$.$,$.$g$,$.$
                    chr1_1-10246 8855 * */* 11782 0 60 4484 * -C 4465 9 10 7 6

                    Any parameters can be tuned?

                    c) I have 50bp solid fragment reads, will BFAST and samtools detected 55bp deletion in this case? I known some other programs can pointed to the deletion site after several cycles of condensation?, if not what is the maximum deletion site the BFAST can pickup?

                    Thanks very much !

                    Comment


                    • #11
                      Originally posted by jsun529 View Post
                      I need more help on this

                      a) When looking the samtools consensus link there seems one column explanation missing for example:

                      I got
                      chr1_1-10246 1832 A G 8 22 60 3 ,g^~, 3S,


                      As: chromosome (chr1_1-10246), chromosome coordinates (1832), reference base (A), consensus base (G), consensus quality (8), snp quality (22), maximum mapping quality of the reads covering the sites between the `reference base' and the `read bases' columns (60), then what is 3 stands for?
                      The "3" stands for the coverage: the # of reads covering this base. See http://samtools.sourceforge.net/cns0.shtml for more details

                      b) for the snp calling with default parameter setting for both BFAST and samtools, I did not get the known snps I am supposed to look for and when I look for the raw pileup result I noticed for each position that are two lines, can any one kindly explain what they are and why these two did not been picked up as surpossed to.

                      chr1_1-10246 8820 C M 72 228 60 2858 ,a$a$a$a$.$.$a$.$a$.$a$a$a$.$a$a$a$a$a$a$.$.$a$a$a$a$a$.$.$.$a$a$a$a$a$.$a$a$.$a$
                      chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

                      chr1_1-10246 8855 A R 228 228 60 4484 ,.,,$gg$.$.,.$.$g$,$g$,$.$.$g$.$g.$,$,$.$,$.$.$,.$,$g,$,$,$,$.$.$g$,$.$,$.$g$,$.$
                      chr1_1-10246 8855 * */* 11782 0 60 4484 * -C 4465 9 10 7 6

                      Any parameters can be tuned?
                      That is certainly odd. It looks like half the reads have the mutant allele. The dollar signs mean the reads end there. Having all of them ending at the same position is very unlikely unless... Have you removed PCR duplicates?

                      c) I have 50bp solid fragment reads, will BFAST and samtools detected 55bp deletion in this case? I known some other programs can pointed to the deletion site after several cycles of condensation?, if not what is the maximum deletion site the BFAST can pickup?

                      Thanks very much !
                      A 55bp deletion in a 50bp reads with the default scoring matrix for the Smith Waterman most likely will not be detected. You could try adjusting the BFAST settings (scoring matrix) to make such indels more likely but this may increase the false mapping as well as the false indel discovery rate. I would suggest using paired end information to detect such indels.

                      These are all great questions, keep them coming!
                      Last edited by nilshomer; 12-14-2009, 10:48 AM. Reason: Adding a link

                      Comment


                      • #12
                        Thanks,
                        for b, you did not explain what each column/section is, and to me is not half are mutant allele, the coverage is 8458, trying to figure what other numbers stands for.

                        chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

                        In fact we did not remove the PCR duplicates, what would be the best approaches to do that?

                        Comment


                        • #13
                          Originally posted by jsun529 View Post
                          Thanks,
                          for b, you did not explain what each column/section is, and to me is not half are mutant allele, the coverage is 8458, trying to figure what other numbers stands for.

                          chr1_1-10246 8820 * */* 8458 0 60 2858 * +T 2856 1 1 7 6

                          In fact we did not remove the PCR duplicates, what would be the best approaches to do that?
                          For the output format, consult the link I provided previously. You can also ask [email protected]

                          You can use the Picard to remove duplicates. I would guess a lot of the artifacts you are seeing are due to duplicates.

                          Comment


                          • #14
                            Originally posted by nilshomer View Post
                            For the output format, consult the link I provided previously. You can also ask [email protected]

                            You can use the Picard to remove duplicates. I would guess a lot of the artifacts you are seeing are due to duplicates.

                            Hi, Nils.

                            I'm currently looking at BFAST and trying to determine the best way about aligning paired-end reads as a comparison to BioScope's progressive mapping software. I'm getting high fidelity ~ 80 percent or above in some situations using their recommended defaults. It'd be ideal to do a comparison between BFAST but I can't seem to figure out the paired-end step? Is there a manual for the BFAST-BWA hybrid for paired end data and it appears you're mapping each tag individually and then pairing? Is that the case? I notice BFAST is optimized for 50 bp reads? Is it correct for SOLiD paired end I map the F3 (5o bp) and then the F5 (35 bp) separately and with BFAST and BWA respectively?

                            Finally, I'm trying to find the most accurate SNP caller whether that be BioScope's internal diBayes software, SamTools or whatever. On high stringency across Chr21 I was able to tweak the params of diBayes to get ~96 percent concordance across baited-exon regions pertaining to Agilent's SureSelect 30 Mb and that was using only the min het cov, min hom cov parameters. I notice that diBayes calls homozygous SNPs with very high confidence but there is in fact a bias for the reference base in the SOLiD system and you tend to notice a range in which somewhere around 2/3s of your het-SNPs coverage and allele-counts end up being for the ref. Nonetheless, we still get a number of false positives and it's our hope we can get up to 99 percent concordance with some Complete Genomics data we have off the same run. Do you know of any better SNP callers out there or additional methods of tweaking our params for higher concordance? Additionally, we'd like to rescue some of our false negatives, but I'm unsure we can do this with a software tool and it might in fact be a coverage issue and not a software issue. I know this is a lot, but any help would be much appreciated. Thanks.

                            J

                            Comment


                            • #15
                              Originally posted by JohnK View Post
                              Hi, Nils.

                              I'm currently looking at BFAST and trying to determine the best way about aligning paired-end reads as a comparison to BioScope's progressive mapping software. I'm getting high fidelity ~ 80 percent or above in some situations using their recommended defaults. It'd be ideal to do a comparison between BFAST but I can't seem to figure out the paired-end step? Is there a manual for the BFAST-BWA hybrid for paired end data and it appears you're mapping each tag individually and then pairing? Is that the case? I notice BFAST is optimized for 50 bp reads? Is it correct for SOLiD paired end I map the F3 (5o bp) and then the F5 (35 bp) separately and with BFAST and BWA respectively?

                              Finally, I'm trying to find the most accurate SNP caller whether that be BioScope's internal diBayes software, SamTools or whatever. On high stringency across Chr21 I was able to tweak the params of diBayes to get ~96 percent concordance across baited-exon regions pertaining to Agilent's SureSelect 30 Mb and that was using only the min het cov, min hom cov parameters. I notice that diBayes calls homozygous SNPs with very high confidence but there is in fact a bias for the reference base in the SOLiD system and you tend to notice a range in which somewhere around 2/3s of your het-SNPs coverage and allele-counts end up being for the ref. Nonetheless, we still get a number of false positives and it's our hope we can get up to 99 percent concordance with some Complete Genomics data we have off the same run. Do you know of any better SNP callers out there or additional methods of tweaking our params for higher concordance? Additionally, we'd like to rescue some of our false negatives, but I'm unsure we can do this with a software tool and it might in fact be a coverage issue and not a software issue. I know this is a lot, but any help would be much appreciated. Thanks.

                              J
                              For the BFAST+BWA mode, you align the 50bp end with BFAST and 35bp end with the BWA (within BFAST+BWA). This substitutes the "match" step. You then input the two BMF files into the "localalign" step, and proceed from there as before. There is not much documentation on this, and I would be happy to incorporate your feedback into a wiki page on the BFAST site going forward. There is always the option to post on [email protected], as there are many users that can help out there too.

                              For SNP calling, the implementation in SAMtools is sufficient, but is not color-aware like dibayes, though I have not seen enough evidence to convince me which is better. The mapper used by Bioscope is much improved, as such would be comparable to bfast. Make sure that you do appropriate filtering after SNP calling. A good guide to follow is what the 1000 Genomes project does. Please post your results, as they would benefit the community. I always appreciate your posts.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X