Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • why low mapping rates for RNA-seq with tophat2

    Hi everybody,

    As a beginner for RNA-seq analysis, I desperately need your help and will appreciate it very much.
    I did single end sequencing of Arabidopsis thaliana transcriptome with Hiseq2000. The read length is 51bp. The sequencing quality seemed to be quite good when checked with FASTQC. When I ran Tophat2, the resulting accepted_hits.bam file was about 38 M bite in its size while the unmapped.bam was about 280 MB. Although I haven't found out the exact mapping rate, judging from the sizes of the mapped and unmapped files it seems that the majority of the reads are not mapped to the genome. When I randomly picked up some reads from the unmapped file and blasted them against the Arabidopsis genome (-intron, +UTR), I found almost all the reads I checked can be perfectly blasted to a certain mRNA. I used genes.gtf and genome in the TAIR10 downloaded from iGenome. This low mapped rate happened no matter I used the following scrpit1 or 2. Does any one has any clue what the reason can be? Thanks for your suggestions.

    script1:
    tophat2 -p 8 -i 30 -g 5 --min-coverage-intron 30 --min-segment-intron 30 --b2-sensitive -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz

    script2:
    tophat2 -p 8 -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz
    Last edited by IceWater; 06-18-2012, 01:41 PM.

  • #2
    There are a lot of potential reasons. Poor quality sequence, contamination, ribosomal RNA, etc. I've had all of these affect my mapping at one time or another.
    With your short reads, have you tried just using Bowtie? You should be able to get a significant amount of them mapping. If not, it might indicate a sample problem rather than an alignment problem.
    You might also want to give STAR a try. I always get better mapping with it over Tophat.

    Comment


    • #3
      Tophat 1

      I am not a Tophat user, but I have heard from others that Tophat 2.0 changed from Tophat 1 in the sense that it maps only to annotated references, which reduces mapability. Maybe try Tophat 1? Some of our bioinformaticians have switched back. Don't quote me on any of this, just hearsay

      Comment


      • #4
        Originally posted by JeremyDay View Post
        I am not a Tophat user, but I have heard from others that Tophat 2.0 changed from Tophat 1 in the sense that it maps only to annotated references, which reduces mapability. Maybe try Tophat 1? Some of our bioinformaticians have switched back. Don't quote me on any of this, just hearsay
        Tophat2 will do an initial mapping to annotated transcripts sequences, but then it should map to back to the genome regardless of annotation. This actually speeds up the mapping drastically.

        Comment


        • #5
          Just trickling back from our mailing list... this user was answered on our mailing list by Daehwan. The main problem was that the reads were not passed in with a ',' in between them, they were separated by space. TopHat will interpret this command entirely differently.

          Originally posted by JeremyDay View Post
          I am not a Tophat user, but I have heard from others that Tophat 2.0 changed from Tophat 1 in the sense that it maps only to annotated references, which reduces mapability. Maybe try Tophat 1? Some of our bioinformaticians have switched back. Don't quote me on any of this, just hearsay
          I believe the change that you are referring to is in TopHat 1.4, where we changed transcriptome mapping if a GTF is given using the argument '-G'. Internally this program is called 'map2gtf'. The new method maps directly to the transcriptome before anything else and converts the coordinates back to genomic coordinates. This typically results in better alignments. One reason you might get less alignments in newer versions of TopHat (>1.3 or 1.4) is that the internal bowtie parameters have become more stringent (allowing less mismatches with -N I believe).



          HTH,

          Harold

          Comment


          • #6
            Thanks.

            Hi Everyone,

            I really appreciate your guys replies and suggestions. I now find out the reason why this happened. It is just as what Harold said: I used space instead of "," to separate the reads passed in.
            Last edited by IceWater; 06-18-2012, 01:42 PM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-27-2024, 06:37 PM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-27-2024, 06:07 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            69 views
            0 likes
            Last Post seqadmin  
            Working...
            X