Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test for RNAseq data preprocessing step (with regards to adapter and hexamer)

    Hi, every one. This is my first thread at this forum, so please forgive me if I asked some naive questions.

    My question is at the bottom of this thread.

    I am currently working on RNA-seq data. I am using tophat + cufflinks pipeline from this paper. I did fastQC for my rna-seq data and I pasted some pictures from FastQC report here. The experiment is designed to comparing different gene expression and splicing isoforms.





    I did some test, each of the data are preprocessed individually. (I extract 500,000 sequences from forward fastq as well as reverse fastq. So 500,000*2 reads in total)

    1. Do nothing (1,000,000 sequences : 82.2%)
    822406 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    822406 + 0 mapped (100.00%:-nan%)
    822406 + 0 paired in sequencing
    413414 + 0 read1
    408992 + 0 read2
    718716 + 0 properly paired (87.39%:-nan%)
    776272 + 0 with itself and mate mapped
    46134 + 0 singletons (5.61%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    2. Only Trim first 15 bases (1,000,000 sequences : 82.6%)
    826128 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    826128 + 0 mapped (100.00%:-nan%)
    826128 + 0 paired in sequencing
    414760 + 0 read1
    411368 + 0 read2
    721394 + 0 properly paired (87.32%:-nan%)
    777296 + 0 with itself and mate mapped
    48832 + 0 singletons (5.91%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    3. Only Remove adapter (424612+470223=894,835 sequences : 29.6%)
    264949 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    264949 + 0 mapped (100.00%:-nan%)
    264949 + 0 paired in sequencing
    140743 + 0 read1
    124206 + 0 read2
    42 + 0 properly paired (0.02%:-nan%)
    50210 + 0 with itself and mate mapped
    214739 + 0 singletons (81.05%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    4. Trim first 15 bases and remove adapter (454648+468550=923,198 sequences : 29.4%)

    271326 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    271326 + 0 mapped (100.00%:-nan%)
    271326 + 0 paired in sequencing
    131398 + 0 read1
    139928 + 0 read2
    98 + 0 properly paired (0.04%:-nan%)
    48556 + 0 with itself and mate mapped
    222770 + 0 singletons (82.10%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    My question is, with an overall good quality score for each position, is it really necessary to remove adapter or remove first 15 bases (bias caused by hexamer). The result from samtools stat shows if I do adapter remove, I will lose a large amount of data. If I do hexamer trimming, I can get a better mapping result (actually only 0.4% improvement), but I lost 15 bases for each read!

    Several things to be mentioned here.
    1. Adapter removal was performed by using fastx_clipper. Adapter sequences was specified from corresponding adapter from fastQC contaminants.txt. I discard those trimmed sequences if they are less than 20bp after clipping.
    2. I use fastx_trimmer to trim the first 15 bases for each reads.
    3. For the last test "Trim first 15 bases and remove adapter", trimming was the first step and adapter removal was the second.
    4. Number of sequences was given by fastQC "basic statistics" table.

    Regards

    Lynn
    Attached Files

  • #2
    I should have change the icon of the title from ^_^ to ?.

    Comment


    • #3
      i'm fresh here. is anyone here giving help?

      Comment


      • #4
        I'm not sure I know the right answers and I'd like to hear other folks' ideas.

        For the question about removing the first 15 bases, my impression is that the bias introduced by non-random hexamer priming is not changed by trimming. Trimming the sequences makes the fastqc report look better but the sequences that are in the sample are still the product of the non-random hexamer priming.

        Note that, if this is true, we have been incorporating biased read counts into the tuxedo analysis - though maybe the biases cancel out.

        Any other observations on this?

        With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis. In case you are not aligning to a reference transcriptome, you may still have to remove them separately.

        Comment


        • #5
          Originally posted by mceachin View Post
          With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis.
          That is my understanding and experience as well.

          In case you are not aligning to a reference transcriptome, you may still have to remove them separately.
          I would replace the word "may" with the word "must". But I suppose it does depend on which program you will be using for denovo analysis.

          Comment


          • #6
            Hello all, this is my first post. Can anybody tell me how the ligation of the random hexamers works in the priming for the 1st strand cDNA synthesis reaction during the RNA-Seq (TruSeq kit Illumina)?!
            In the Fastq report the per base sequence content looks much better after the first 13 bp. I wonder if the bias caused by priming with random hexamers?
            We thought that, being the hexamers short random sequences contained in a mix, they randomly bind to the fragment, maybe with some mismatches, in different positions . The ones that bind in the beginning of the sequence will produce the desired strand, the others will generate short sequences that are going to be lost in the next purification steps. It would not be unexpected to see biases in per base nucleotide content in the first 6 bases of the read…but what about the next 7 bases? The bias in the first 13 bp is probably generated by hexamer-dimers! These are all hypothesis that need to be validated.
            Maybe part of the problem is due to the sequencer that amplifies the signal too much when meets more bases one next to the other.
            I red something about this here "http://ethanomics.wordpress.com/2012/03/12/more-thoughts-on-the-truseq-rna-sample-prep-kit/" and there "http://nar.oxfordjournals.org/content/38/12/e131.short"
            Can you tell me something more?
            Thanks for any help anyone can provide!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X