Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • going from RNA seq TopHat output to variant calls

    I have RNA seq data aligned to a reference genome using TopHat. I would now like to take these SAM/BAM files as an input and get as an output information for where sequence variants are - chromosome, base pair coordinates, where the variants are (gene names, introns, non-genic regions, etc.), what type of mutations they are (SNPs, substitutions, deletions), what effect they have on amino acid sequences (frame shift, nonsense, missense, silent, etc.) and ideally also whether the variant has been reported as a SNP. DNAnexus will do these things but it's pretty expensive and (I believe) not ideal for RNA seq with splice junctions. Does anyone have suggestions for useful tools?

    Thank you.

    Eric

  • #2
    Any progress on this? would love to see how other people does mRNAseq variants calling with TopHat output.

    Comment


    • #3
      I use GATK to call variants from TopHat aligned BAM files. First, you'll need to add @RG information and sort using PICARD tools so as to configure the BAM files for GATK; otherwise, it will fail. Depending on what species you are interrogating, you can then realign around indels and recalibrate the quality scores, or go straight to the Unified Genotyper. Because you would be expecting differentially expressed genes with very low and variable coverage across exons, set the --stand_emit_conf and --stand_call_conf to something really low, like 2, and then use the variant annotater option (-A I think) in the Unified Genotyper to add the ReadPosRankSumTest quality score. Take the VCF file generated by GATK and run it through SNPeff (if human, submit the GATK vcf file to SeattleSNP)and then take that vcf file as raw input to GATK's VariantAnnotator to annotate the raw GATK vcf file. Now filter for what you are interested in. Enjoy.

      Comment


      • #4
        Originally posted by Dameon View Post
        I use GATK to call variants from TopHat aligned BAM files. First, you'll need to add @RG information and sort using PICARD tools so as to configure the BAM files for GATK; otherwise, it will fail. Depending on what species you are interrogating, you can then realign around indels and recalibrate the quality scores, or go straight to the Unified Genotyper. Because you would be expecting differentially expressed genes with very low and variable coverage across exons, set the --stand_emit_conf and --stand_call_conf to something really low, like 2, and then use the variant annotater option (-A I think) in the Unified Genotyper to add the ReadPosRankSumTest quality score. Take the VCF file generated by GATK and run it through SNPeff (if human, submit the GATK vcf file to SeattleSNP)and then take that vcf file as raw input to GATK's VariantAnnotator to annotate the raw GATK vcf file. Now filter for what you are interested in. Enjoy.
        Hi Dameon,

        Thanks very much. I've run GATK with DNA but not RNA. Do you see any problem with using GATK with RNA seq? The Broad Institute people are kind of ambiguous about whether it works with RNA seq. Anyway, I'll give it a try. Thanks for the detailed instructions.

        Best,

        Eric

        Comment


        • #5
          Originally posted by efoss View Post
          Hi Dameon,

          Thanks very much. I've run GATK with DNA but not RNA. Do you see any problem with using GATK with RNA seq? The Broad Institute people are kind of ambiguous about whether it works with RNA seq. Anyway, I'll give it a try. Thanks for the detailed instructions.

          Best,

          Eric
          The only problems I forsee of using GATK to call variants from RNA-seq data is the filtering. You want to set the Unified Genotyper as sensitive as possible, don't worry about this as GATK is very aggressive in calling SNPs by default, and then use as many options as possible from VariantAnnotator to whittle down the variants to what you believe to be true SNP calls. It would probably help to use --glm SNP so that you only have to worry about filtering for false positive SNP calls for now. Let me know how everything turns out.

          Comment


          • #6
            normalization of the aligned data

            do we not need to go for any normalization method before calling variations on mRNA Seq data?

            Comment


            • #7
              This is a tricky problem and simply using Tophat with GATK will give you an incredible amount of false positives.
              Read the comments on this paper to get an idea of the issues as well as some methods to deal with it:


              Here are several other papers that deal with this issue:


              A monthly journal publishing high-quality, peer-reviewed research on all topics related to RNA and its metabolism in all organisms


              There are more out there too, but the basic idea is that if you want to call variants from RNA Seq data, you have to be very careful.

              Comment


              • #8
                One question, how will the difference between single-end and paired-end seq effect SNPs call i mRNAseq?

                Comment


                • #9
                  you may have a look at this http://allaboutbioinfo.blogspot.com/...53107057687822

                  Comment


                  • #10
                    Thats fantastic!
                    Do you have any more good things to read like this one?

                    Thanks a lot!

                    Comment


                    • #12
                      Its so bad Im so tired of courses today, this was really inspiring! Will read it all tomorrow.

                      Thank you!!

                      Comment


                      • #13
                        Originally posted by Dameon View Post
                        I use GATK to call variants from TopHat aligned BAM files. First, you'll need to add @RG information and sort using PICARD tools so as to configure the BAM files for GATK; otherwise, it will fail. Depending on what species you are interrogating, you can then realign around indels and recalibrate the quality scores, or go straight to the Unified Genotyper. Because you would be expecting differentially expressed genes with very low and variable coverage across exons, set the --stand_emit_conf and --stand_call_conf to something really low, like 2, and then use the variant annotater option (-A I think) in the Unified Genotyper to add the ReadPosRankSumTest quality score. Take the VCF file generated by GATK and run it through SNPeff (if human, submit the GATK vcf file to SeattleSNP)and then take that vcf file as raw input to GATK's VariantAnnotator to annotate the raw GATK vcf file. Now filter for what you are interested in. Enjoy.
                        Why SeattleSNP instead of SNPeff for humans? And which software form the SeattleSNP are you referreing to?

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Advancing Precision Medicine for Rare Diseases in Children
                          by seqadmin




                          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                          12-16-2024, 07:57 AM
                        • seqadmin
                          Recent Advances in Sequencing Technologies
                          by seqadmin



                          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                          Long-Read Sequencing
                          Long-read sequencing has seen remarkable advancements,...
                          12-02-2024, 01:49 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 12-17-2024, 10:28 AM
                        0 responses
                        33 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-13-2024, 08:24 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-12-2024, 07:41 AM
                        0 responses
                        34 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-11-2024, 07:45 AM
                        0 responses
                        46 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X