Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SHRiMP vs BFAST

    Hi all,

    I am working with 50-bases-length Solid RNA-seq data. I want to do both genotyping and RNA quantification. I am currently hesitating between SHRiMP and BFAST to perform the alignments. Both seems to me equivalent in term of mapping strategy. Does someone who experienced these two aligners can give me his opinion ?
    best,
    Mathieu
    Last edited by mathieu; 10-15-2010, 05:53 AM.

  • #2
    Hi Mathieu,
    I can only contribute a little; my data is fly genomic reads (nucleosome mapping) and the little I can say is that Shrimp seemed slow in my hands, if compared to bowtie. I have also used novoalignCS, which can deal with small indels, which to my knowledge, bowtie does not. You could also try BWA, which I think does colour space reads.

    Have you tried Bioscope? I assume you have access to this software if you have a SOLiD sequencer. I have thought about trying BFAST too but we are currently comparing Bioscope read mapping with bowtie/novoalign. (I think you need a licence for novoalign). RNA-seq reads may benefit from TopHat (now handles colour space) as it can also map reads that span splice junctions/introns.

    Kind regards,

    John.

    Comment


    • #3
      I had human RNA-Seq data 50 bp and tried BWA, BioScope, NovoalignCS, BFAST, and MOSAIK. I recommend BFAST for its high mapping rate and easy use (once you've created the indexes). BWA, NovoalignCS and MOSAIK have very low mapping rates. BioScope with the whole transcriptome pipeline can find splice junctions and gets rid of repeats but does not do gapped alignment (as Bowtie) and is a pain to install on a cluster.

      Comment


      • #4
        Hi epigen & John,
        Thanks for your advices. I tried to install bioscope on our cluster but I gave up... Concerning BWA, it is the first one I tried and I was quite disappointed by the results since it has been highly recommended with low mapping rate (22.6%). The first results I have using BFAST and ShRIMP are almost the same in term mapping rate (57.5% and 51.2% respectively). However ShRIMP was a bit faster.

        @epigen: For the SNP et InDels calling I am using samtools so far, but I am not very satisfied there are too many miscalls. What are your advices?

        Comment


        • #5
          Yes, BFAST might give a lot of false positives, therefore the developer advises to do local realignment before. I didn't because I was interested in SNPs that are already annotated in dbSNP so I filtered for them. I also used samtools, but required SNPs to be present in at least 20 reads, have a score of at least 20, and not be at the end of a read. The most recent version of samtools has improved SNP calling compared to the previous one.
          Now we want to find unknown, somatic SNPs for which we use SomaticCall from Broad, which of course only works if you have tumor-normal pairs. Otherwise, VarScan would be an option. For indels we use the indel genotyper from BROAD and Pindel.

          Comment


          • #6
            I think it is important to consider mapping accuracy over the number of reads aligned. Consider looking at how well the aligner does in terms of concordance with DBSNP or any other set of know reference SNP/Indel positions.
            We have developed NovoalignCS for this purpose of trying to get the best alignment for a read and it does come with a cost to performance. That said if you have enough cores the slower aligners like MOSAIK and Novoalign can run in a very short time and still give you more reliable alignments that lower the false discovery rate. This should also be tested on a case-by-case basis as the read quality and repeat content of the reference genome can influence how the aligner performs.

            Originally posted by epigen View Post
            I had human RNA-Seq data 50 bp and tried BWA, BioScope, NovoalignCS, BFAST, and MOSAIK. I recommend BFAST for its high mapping rate and easy use (once you've created the indexes). BWA, NovoalignCS and MOSAIK have very low mapping rates. BioScope with the whole transcriptome pipeline can find splice junctions and gets rid of repeats but does not do gapped alignment (as Bowtie) and is a pain to install on a cluster.

            Comment


            • #7
              Thanks for the advices. Unfortunately I am working with an organism for which no SNPs are known yet. Therefore, I have to rely only on the deep sequencing data. I am currently testing the GATK pipeline and .... it is very demanding in term of resources but the first results seems to far more realist than the samtools ones. I will have a try with VarScan. Epigen: did you ever try GATK versus VarScan?

              Comment


              • #8
                I have used GATK and samtools. Samtools has a new base alignment quality (BAQ) feature which Heng Li claims will greatly improve your ability to call SNPs more reliably.
                Both tools are very good and sometimes do have a steep learning curve but I think it's worth it. I have not used Varscan but I have heard good things about it.
                Have you tried using NovoalignCS?

                Comment


                • #9
                  @mathieu: Personally I have not compared GATK and VarScan, but my colleague. She says GATK is much better - no wonder since it uses sophisticated algorithms whereas VarScan just filters the output of samtools pileup. GATK is indeed very demanding. We run it for each chromosome separately.

                  @zee: I tried NovoalignCS but it was by far the slowest and still had a very low mapping rate. Now I have PE data and I'm thinking about trying it again. BFAST also becomes very slow for PE due to the localalign step.
                  Last edited by epigen; 10-26-2010, 08:01 AM. Reason: making clear what I refer to

                  Comment


                  • #10
                    On Illumina data, the choice of mappers does not matter too much to SNP calling. A 1000X better mapper on simulated data may only lead to a few percent differences in SNP accuracy. On SOLiD, I do not know. But you should beware bwa's default is not designed for SOLiD. One must increase the tolerant of mismatches (-n) to get acceptable results.

                    As to samtools' SNP calling, are you following the steps listed here:

                    Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.


                    SAMtools caller has been used in a few Nature/Plos genetics papers. If you count the papers using maq which samtools is derived from, much more. They cannot be all wrong.

                    So far as I know, VarScan is not a Bayesian model.

                    The BAQ computation is *strongly* recommended for SNP calling. Almost everyone I know (Umich, Broad/GATK, Sanger) who has tried it once immediately incorporates it into the production pipeline.
                    Last edited by lh3; 10-26-2010, 10:34 AM.

                    Comment


                    • #11
                      Originally posted by epigen View Post
                      @mathieu: Personally I have not compared GATK and VarScan, but my colleague. She says GATK is much better - no wonder since it uses sophisticated algorithms whereas VarScan just filters the output of samtools pileup. GATK is indeed very demanding. We run it for each chromosome separately.
                      Indeed, I've used all three (GATK, samtools and VarScan) and VarScan is basically a filtering/annotation tool, not a variant caller. GATK and samtools are both good. I found GATK to give even better variant counts than samtools pileup, but samtools is still good.

                      @zee: I tried NovoalignCS but it was by far the slowest and still had a very low mapping rate. Now I have PE data and I'm thinking about trying it again. BFAST also becomes very slow for PE due to the localalign step.
                      If BFAST is slow for you and you have access to a strong distributed cluster, try the bfast.submit.pl script that comes with it to make it more parallel and save a lot of wallclock time.
                      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                      Projects: U87MG whole genome sequence [Website] [Paper]

                      Comment


                      • #12
                        My results and your recommendations are in favor of using a BFAST+GATK pipeline. I have to say that I really like the GATK UnifiedGenotyper. Moreover it seems that the integration of a robust indel genotyper within the UnifiedGenotyper is in preparation. That will make the tool even more valuable.
                        The trick is now to have some good filtering after the raw snp calls. Do you guys have some advices?

                        Comment


                        • #13
                          GATK comes with the most sophisticated filtering. That is one of the reasons why it is good.

                          Comment


                          • #14
                            @lh3 : I agree. My main difficulty is that I do not have any prior knowledge of SNPs on the organism I am working on. Therefore, I cannot use the VariantRecalibrator... Therefoe, after having applied basic filtering and indel masking, it is more tricky to perform the good filtering... Do you have advices?

                            Comment


                            • #15
                              I see. Perhaps you may play around to get the expected ts/tv. I think all recalibrator needs is an expected ts/tv. If you have to do manual filtering, strand bias is believed to be the most effective filter. Depth filtering is also necessary. Also, run BAQ. The GATK group also apply BAQ to their projects and is planing to reimplement this in GATK.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X