Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • why low mapping rates for RNA-seq with tophat2

    Hi everybody,

    As a beginner for RNA-seq analysis, I desperately need your help and will appreciate it very much.
    I did single end sequencing of Arabidopsis thaliana transcriptome with Hiseq2000. The read length is 51bp. The sequencing quality seemed to be quite good when checked with FASTQC. When I ran Tophat2, the resulting accepted_hits.bam file was about 38 M bite in its size while the unmapped.bam was about 280 MB. Although I haven't found out the exact mapping rate, judging from the sizes of the mapped and unmapped files it seems that the majority of the reads are not mapped to the genome. When I randomly picked up some reads from the unmapped file and blasted them against the Arabidopsis genome (-intron, +UTR), I found almost all the reads I checked can be perfectly blasted to a certain mRNA. I used genes.gtf and genome in the TAIR10 downloaded from iGenome. This low mapped rate happened no matter I used the following scrpit1 or 2. Does any one has any clue what the reason can be? Thanks for your suggestions.

    script1:
    tophat2 -p 8 -i 30 -g 5 --min-coverage-intron 30 --min-segment-intron 30 --b2-sensitive -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz

    script2:
    tophat2 -p 8 -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz
    Last edited by IceWater; 06-18-2012, 01:41 PM.

  • #2
    There are a lot of potential reasons. Poor quality sequence, contamination, ribosomal RNA, etc. I've had all of these affect my mapping at one time or another.
    With your short reads, have you tried just using Bowtie? You should be able to get a significant amount of them mapping. If not, it might indicate a sample problem rather than an alignment problem.
    You might also want to give STAR a try. I always get better mapping with it over Tophat.

    Comment


    • #3
      Tophat 1

      I am not a Tophat user, but I have heard from others that Tophat 2.0 changed from Tophat 1 in the sense that it maps only to annotated references, which reduces mapability. Maybe try Tophat 1? Some of our bioinformaticians have switched back. Don't quote me on any of this, just hearsay

      Comment


      • #4
        Originally posted by JeremyDay View Post
        I am not a Tophat user, but I have heard from others that Tophat 2.0 changed from Tophat 1 in the sense that it maps only to annotated references, which reduces mapability. Maybe try Tophat 1? Some of our bioinformaticians have switched back. Don't quote me on any of this, just hearsay
        Tophat2 will do an initial mapping to annotated transcripts sequences, but then it should map to back to the genome regardless of annotation. This actually speeds up the mapping drastically.

        Comment


        • #5
          Just trickling back from our mailing list... this user was answered on our mailing list by Daehwan. The main problem was that the reads were not passed in with a ',' in between them, they were separated by space. TopHat will interpret this command entirely differently.

          Originally posted by JeremyDay View Post
          I am not a Tophat user, but I have heard from others that Tophat 2.0 changed from Tophat 1 in the sense that it maps only to annotated references, which reduces mapability. Maybe try Tophat 1? Some of our bioinformaticians have switched back. Don't quote me on any of this, just hearsay
          I believe the change that you are referring to is in TopHat 1.4, where we changed transcriptome mapping if a GTF is given using the argument '-G'. Internally this program is called 'map2gtf'. The new method maps directly to the transcriptome before anything else and converts the coordinates back to genomic coordinates. This typically results in better alignments. One reason you might get less alignments in newer versions of TopHat (>1.3 or 1.4) is that the internal bowtie parameters have become more stringent (allowing less mismatches with -N I believe).



          HTH,

          Harold

          Comment


          • #6
            Thanks.

            Hi Everyone,

            I really appreciate your guys replies and suggestions. I now find out the reason why this happened. It is just as what Harold said: I used space instead of "," to separate the reads passed in.
            Last edited by IceWater; 06-18-2012, 01:42 PM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X