Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low percentage of mapped reads(Tophat)

    I encountered a problem dealing the RNA-seq by Tophat.

    The data is 36bp(single-end) by Illumina's Genome analyzer, and during the first process of Tophat, that is Bowtie,
    the result is bad:

    # reads processed: 26732967
    # reads with at least one reported alignment: 134706 (0.50%)
    # reads that failed to align: 26586496 (99.45%)
    # reads with alignments suppressed due to -m: 11765 (0.04%)
    Reported 400680 alignments to 1 output stream(s)


    The reads that failed to align: 26586496 (99.45%) is too high and I really dont know how to deal with that.

    I tried a lot of different parameters of Tophat, the parameter is as follows:

    tophat --bowtie1 --library-type fr-unstranded --segment-length 18 -p 8 -o /sjn/rep1/try/tophat_821 /sjn/gencode/bowtie1_index/hg19 /rep1/RawDataRep1.fastq


    1. Among them, --bowtie1 is try to use Bowtie1 instead of Bowtie2.
    I tried bowtie2 before, the result are even worse.
    For reads longer than about 50 bp Bowtie 2 is generally faster, more sensitive, and uses less memory than Bowtie 1. For relatively short reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or more sensitive.(http://bowtie-bio.sourceforge.net/index.shtml)

    2. --segment-length 18
    because the length of the reads is 36bp.


    I really do not know what to do.

  • #2
    Have you taken a sample of reads and blasted them to see what comes back? That may give some clue as to what may be happening (i.e. not the data you expect to have). Have you done in any other QC on this data?

    Comment


    • #3
      Not yet, I am just trying blasted them as you suggested.

      Comment


      • #4
        I blat(UCSC) some of the sequences(fastq file), which shows the information of the sequence. So the sequence is good?
        As for the QC test, I am just on it.

        Comment


        • #5
          Are the blat hits going to the right organism/genome i.e. there is no unexpected contamination in data? If that is true then it may perhaps just be the case that you need to trim the sequences (to remove adapters etc).

          Use FASTQC as a simple option (if you have not done any QC on your data).

          Comment


          • #6
            By the FASTQC, the "Per base sequence content" of the first 10bp is strange, so maybe they are the adapters of the Fastq file.

            Comment


            • #7
              RNA-seq data generally has that signature for the first few base pairs. This is a known bias and does not affect alignments or downstream analysis.

              See this thread and others within.

              Comment


              • #8
                The technology is CAGE, not rna-seq, i just mixed them.

                Comment


                • #9
                  What did you find in the blat results? Expected genome matches or otherwise?

                  Comment


                  • #10
                    GenoMax, I am glad to see you in this Forum, I'd like to get your opinion.

                    2 out of 15 of my libraries have low mapping rates (~60%) when using tophat. The rest are good (~94%). The 60% of these 2 samples translates to ~8 Millions mappable reads to the reference genome, which isn't that bad considering I lost a lot. I looked at the 'unmapped.bam' file and most of these reads map to anything but my model organism.
                    My question, do you think this will affect downstream analysis? I am assuming if ~8 MR mapped to the genome that is not a total waste ....?

                    Thank you kindly.
                    G

                    Comment


                    • #11
                      1) What are you mapping to what? What is the data source (platform, chemistry) and type (read length, etc), and what is the reference?
                      2) What do the unmapped reads map to? Human, for example?
                      3) And what percent of the unmapped reads map to other organisms?
                      4) By Tophat, do you mean Tophat1 or Tophat2?
                      5) What kind of QC are you doing? Removing chastity-filter-failed reads, reads that don't exactly match the right barcode (assuming you are multiplexing), adapter-trimming, etc.

                      Also, it's never a bad idea to post a FastQC report when you have low mapping rates.

                      Comment


                      • #12
                        Thanks Brian. Here are the details. Same question, will this affect downstream analysis (I ask because I have ~8 MR mapped to my reference genome) ?

                        --

                        1) Mapping to TAIR10. Illumina HiSeq2500 Single end sequencing
                        2) unammped reads --> Platynereis dumerilii
                        3) ~40 % mapping to Platynereis dumerilii
                        4) tophat2
                        5) i cleaned the reads using fastq-mcf. After filtering, all the reads have a Q values >30.

                        Comment


                        • #13
                          Did you make the libraries or did the sequence provider make them? If you made the libraries were they pooled by the provider? Were these samples multiplexed and if so do the contaminant reads have the barcode you expected for your own sample?

                          Comment


                          • #14
                            the facility made and sequence them. They were multiplexed and pooled in a single lane.
                            I have not look at the barcode in the contaminated reads, i guess that is a good idea (Those barcode should be in the unmmaped.bam, right?)

                            THANKS !

                            Comment


                            • #15
                              If the contaminating reads were in the same file as your real sample then they must have the same barcode. Since you did not make the libraries and the contamination is not in all lanes something must have gone wrong with those two libraries.

                              A marine annelid worm is about as far as one can be from Arabidopsis .. unless you have very diverse research interests!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 08:47 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X