Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions in calling SNPs from RNA-seq reads

    I'm trying to call SNPs from RNA-seq data.
    The workflow I'm trying to follow is: tophat2 --> remove duplicates --> local realignment and base quality recalibration by GATK --> call SNPs by samtools mpileup or GATK

    These are some specific questions:

    1. Should I filter the alignment from tophat2 to get only the uniquely mapped reads (reads that could only be mapped to single location on the genome)? If yes, how should I filter them?
    I know there is a --multihits (-g) parameter in tophat that can be set to 1 to report only one hit per mapped read. But this is not the strict definition of "unique mapping". I saw people saying that in the alignment sam file, if MAPQ=50, that means this read can only be mapped to a single position. Could I, for example, run tophat2 with g=20 (default), and then select only the aligned reads with MAPQ=50?

    2. GATK cannot accept N in cigar. How should I deal with the aligned reads with N in the cigar? GATK has two options: the first is to remove these reads; the second is to allow N in cigar. The former strategy may lose a lot of information, while the latter is at risk (I don't know what the risk is exactly).

    3. To use samtools mpileup, how should I set the -C (parameter for adjusting mapQ) parameter? I know it recommends 50 for bwa aligned reads, but what it is for tophat2 alignment?

    4. If you have experience in this kind of data analysis, can you see any other potential problems in the pipeline or the parameters of each program? Do you have any recommendation?

    By the way, I know there are a lot of caveats of using RNA-seq data to call SNPs. The major purpose of this project is to measure expression, but calling SNPs is to try to get more use of the data (or the money).

    Thanks a lot for any suggestions or advice!

  • #2
    I've done this a few times (call variants in RNAseq). From my experience it was 'easiest' to follow the GATK Best Practices (http://www.broadinstitute.org/gatk/guide/best-practices). I say 'easiest' because GATK is a bit of a hassle to run (the first time, anyway, second time was ok=).

    So I aligned with BWA (Tophat gives potential splice junctions; if you think they may be holding many important variants continue with Tophat, but the majority of SNPs in exon of interest (stops, non-synonymous) will not be in that region I think... and if they are you may be better trying the splice variant/differential splicing analysis...). I used UnifiedGenotyper and HaplotypeCaller to get my VCFs. I tried mpileup before, it was fine but everyone kept saying 'GATK is the gold standard' so I used it in the end: peer (review) pressure=D

    So I recommend: BWA, remove duplicates (Picard), realign, recalibrate, call variants, <other>.

    Hope that is of some help.

    Comment


    • #3
      Originally posted by bruce01 View Post
      I've done this a few times (call variants in RNAseq). From my experience it was 'easiest' to follow the GATK Best Practices (http://www.broadinstitute.org/gatk/guide/best-practices). I say 'easiest' because GATK is a bit of a hassle to run (the first time, anyway, second time was ok=).

      So I aligned with BWA (Tophat gives potential splice junctions; if you think they may be holding many important variants continue with Tophat, but the majority of SNPs in exon of interest (stops, non-synonymous) will not be in that region I think... and if they are you may be better trying the splice variant/differential splicing analysis...). I used UnifiedGenotyper and HaplotypeCaller to get my VCFs. I tried mpileup before, it was fine but everyone kept saying 'GATK is the gold standard' so I used it in the end: peer (review) pressure=D

      So I recommend: BWA, remove duplicates (Picard), realign, recalibrate, call variants, <other>.

      Hope that is of some help.
      Thanks for your information! It's helpful.

      Comment


      • #4
        Hopefully it is helpful, if you have ideas on how to use the variant data specifically in terms of DE genes or otherwise maybe post back?! Also have a look at SNPdat to annotate your VCF, works very well for me.

        Comment


        • #5
          Originally posted by bruce01 View Post
          Hopefully it is helpful, if you have ideas on how to use the variant data specifically in terms of DE genes or otherwise maybe post back?! Also have a look at SNPdat to annotate your VCF, works very well for me.
          Thanks, I'll take a look at SNPdat. We usually use SnipEff to annotate VCF.

          Comment


          • #6
            What are the specific settings (options, parameters) to use when mapping RNA seq data on human genome using BWA software?

            Comment


            • #7
              To quote Homer: "the two greatest words in the English language: de-fault!"

              That is, I use default settings. Also depends on your data, as per: http://bio-bwa.sourceforge.net/bwa.shtml

              Comment


              • #8
                OK thanks.

                But first I have to create the index. I'm using the human_g1k_v37 version. It has been 2 hours now that I've send the command and I have still this line:

                [bwa_index] Pack FASTA...

                Is everything okay?

                Comment


                • #9
                  Hi FrankiB, yes I would say everything is ok. As a general rule if there is no error given, and the rest of your computer is working ok, then you just need to wait for things to build. Have a look at 'screen' and 'top' if you don't know them, nice way to make sure your command is actually doing something!

                  Comment


                  • #10
                    Originally posted by bruce01 View Post
                    I've done this a few times (call variants in RNAseq). From my experience it was 'easiest' to follow the GATK Best Practices (http://www.broadinstitute.org/gatk/guide/best-practices). I say 'easiest' because GATK is a bit of a hassle to run (the first time, anyway, second time was ok=).

                    So I aligned with BWA (Tophat gives potential splice junctions; if you think they may be holding many important variants continue with Tophat, but the majority of SNPs in exon of interest (stops, non-synonymous) will not be in that region I think... and if they are you may be better trying the splice variant/differential splicing analysis...). I used UnifiedGenotyper and HaplotypeCaller to get my VCFs. I tried mpileup before, it was fine but everyone kept saying 'GATK is the gold standard' so I used it in the end: peer (review) pressure=D

                    So I recommend: BWA, remove duplicates (Picard), realign, recalibrate, call variants, <other>.

                    Hope that is of some help.
                    Hi bruce01, what did you use for your VCFs, UnifiedGenotyper or HaplotypeCaller? or did you use both?

                    Comment


                    • #11
                      I use(d) HaplotypeCaller which is now recommended by GATK support as the one to use in all circumstances.

                      Comment


                      • #12
                        Detecting mutations from RNA-Seq is not a typical approach to detect mutations, mainly due to the intrinsic complexity in the transcriptome (e.g., splicing, high/low gene expression level).

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X