Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low percentage of mapped reads(Tophat)

    I encountered a problem dealing the RNA-seq by Tophat.

    The data is 36bp(single-end) by Illumina's Genome analyzer, and during the first process of Tophat, that is Bowtie,
    the result is bad:

    # reads processed: 26732967
    # reads with at least one reported alignment: 134706 (0.50%)
    # reads that failed to align: 26586496 (99.45%)
    # reads with alignments suppressed due to -m: 11765 (0.04%)
    Reported 400680 alignments to 1 output stream(s)


    The reads that failed to align: 26586496 (99.45%) is too high and I really dont know how to deal with that.

    I tried a lot of different parameters of Tophat, the parameter is as follows:

    tophat --bowtie1 --library-type fr-unstranded --segment-length 18 -p 8 -o /sjn/rep1/try/tophat_821 /sjn/gencode/bowtie1_index/hg19 /rep1/RawDataRep1.fastq


    1. Among them, --bowtie1 is try to use Bowtie1 instead of Bowtie2.
    I tried bowtie2 before, the result are even worse.
    For reads longer than about 50 bp Bowtie 2 is generally faster, more sensitive, and uses less memory than Bowtie 1. For relatively short reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or more sensitive.(http://bowtie-bio.sourceforge.net/index.shtml)

    2. --segment-length 18
    because the length of the reads is 36bp.


    I really do not know what to do.

  • #2
    Have you taken a sample of reads and blasted them to see what comes back? That may give some clue as to what may be happening (i.e. not the data you expect to have). Have you done in any other QC on this data?

    Comment


    • #3
      Not yet, I am just trying blasted them as you suggested.

      Comment


      • #4
        I blat(UCSC) some of the sequences(fastq file), which shows the information of the sequence. So the sequence is good?
        As for the QC test, I am just on it.

        Comment


        • #5
          Are the blat hits going to the right organism/genome i.e. there is no unexpected contamination in data? If that is true then it may perhaps just be the case that you need to trim the sequences (to remove adapters etc).

          Use FASTQC as a simple option (if you have not done any QC on your data).

          Comment


          • #6
            By the FASTQC, the "Per base sequence content" of the first 10bp is strange, so maybe they are the adapters of the Fastq file.

            Comment


            • #7
              RNA-seq data generally has that signature for the first few base pairs. This is a known bias and does not affect alignments or downstream analysis.

              See this thread and others within.

              Comment


              • #8
                The technology is CAGE, not rna-seq, i just mixed them.

                Comment


                • #9
                  What did you find in the blat results? Expected genome matches or otherwise?

                  Comment


                  • #10
                    GenoMax, I am glad to see you in this Forum, I'd like to get your opinion.

                    2 out of 15 of my libraries have low mapping rates (~60%) when using tophat. The rest are good (~94%). The 60% of these 2 samples translates to ~8 Millions mappable reads to the reference genome, which isn't that bad considering I lost a lot. I looked at the 'unmapped.bam' file and most of these reads map to anything but my model organism.
                    My question, do you think this will affect downstream analysis? I am assuming if ~8 MR mapped to the genome that is not a total waste ....?

                    Thank you kindly.
                    G

                    Comment


                    • #11
                      1) What are you mapping to what? What is the data source (platform, chemistry) and type (read length, etc), and what is the reference?
                      2) What do the unmapped reads map to? Human, for example?
                      3) And what percent of the unmapped reads map to other organisms?
                      4) By Tophat, do you mean Tophat1 or Tophat2?
                      5) What kind of QC are you doing? Removing chastity-filter-failed reads, reads that don't exactly match the right barcode (assuming you are multiplexing), adapter-trimming, etc.

                      Also, it's never a bad idea to post a FastQC report when you have low mapping rates.

                      Comment


                      • #12
                        Thanks Brian. Here are the details. Same question, will this affect downstream analysis (I ask because I have ~8 MR mapped to my reference genome) ?

                        --

                        1) Mapping to TAIR10. Illumina HiSeq2500 Single end sequencing
                        2) unammped reads --> Platynereis dumerilii
                        3) ~40 % mapping to Platynereis dumerilii
                        4) tophat2
                        5) i cleaned the reads using fastq-mcf. After filtering, all the reads have a Q values >30.

                        Comment


                        • #13
                          Did you make the libraries or did the sequence provider make them? If you made the libraries were they pooled by the provider? Were these samples multiplexed and if so do the contaminant reads have the barcode you expected for your own sample?

                          Comment


                          • #14
                            the facility made and sequence them. They were multiplexed and pooled in a single lane.
                            I have not look at the barcode in the contaminated reads, i guess that is a good idea (Those barcode should be in the unmmaped.bam, right?)

                            THANKS !

                            Comment


                            • #15
                              If the contaminating reads were in the same file as your real sample then they must have the same barcode. Since you did not make the libraries and the contamination is not in all lanes something must have gone wrong with those two libraries.

                              A marine annelid worm is about as far as one can be from Arabidopsis .. unless you have very diverse research interests!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              38 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X