Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling SNPS following Tophat alignment of RNA-SEQ reads

    Hi all,

    We have been assessing different methods of calling SNPs from our RNA-seq runs ranging from using the GA pipeline itself, to using MAQ to call the SNPs (using MAQ aligner to align the reference genome as well as a set of set of splice junction sequences derived from all known genes). Both these methods gave broadly similar results, which was great!

    However I have been interested in using Tophat to align our sequences as it would allow us to investigate novel transcripts or splice junctions not in the currently annotated gene set. The alignment using Tophat appeared to work fine, aside from seeming to output the quality scores for each aligned read in Solexa rather than the phred format I was expecting (using the --solexa1.3-quals option). Using cufflinks to quantitate transcript abundance gave results again broadly similar to the GA analysis pipeline, which again was super.

    However using Samtool to call SNPs in the Tophat aligned reads resulted in over four times as many SNPs being called as by the other two methods. Just eyeballing the alignments indicates that most of the new Samtools-called SNPs are called at the beginning and end of exons, where fewer reads have aligned back or where reads have been split between two adjacent exons. Using Samtools to call SNPs on the alignments previously generated by MAQ resulted in only marginally more SNPs called than by the MAQ SNP caller itself, leading us to believe that the problem is with the Tophat alignment rather than the SNP caller.

    All this may be a result of something stupid I am doing wrong but I figured there is no harm in finding out if others out there had encountered similar problems and if you have, how you dealt with them? I don't simply want to discard SNPs called at the beginning or end of exons for obvious reasons!

    Thanks for any help anyone can give me on this

  • #2
    Perhaps it might help if you increased the anchor length specification (-a/--min-anchor-length) for Tophat during the initial alignment? From my experience, Illumina RNA-Seq reads have increased error rates for the first 10-12 bases and then again towards the end, so using a longer anchor (and being relatively strict for the number of mismatches in the -m/--splice-mismatches setting) may help to keep the splice junction alignments more "real" and result in lesser but truer SNP calls.

    Comment


    • #3
      Why would you think that RNA-Seq reads have higher error rates for first 10-12 bases?
      --
      bioinfosm

      Comment


      • #4
        Based on the error plots generated by our installation of the Illumina pipeline, that is what it looks like (I am attaching two pictures from a RNA and DNA lane: the RNA graph is the one with the bump at the beginning and then again around cycle 50 at the beginning of the second read).
        Attached Files

        Comment


        • #5
          Hi cormicp,

          Do you figure out the solution for your doubt?
          Currently I'm facing the same problem as well.
          I have a Illumina RNA-seq pair-end read, reference transcriptome.
          However, I have no idea how to get the SNP result from my data set.
          Thanks for any advice.
          Last edited by edge; 05-30-2012, 09:19 AM. Reason: typo error

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 11:49 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-24-2024, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          61 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X