Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Preprocessing needed for RNA-Seq data

    Hi,

    I am new to the RNA-Seq data analysis and I have a very basic question.

    I need to analyze some RNA-Seq data (from Illumina).
    Are there specific steps to take before aligning the reads and proceed with the analysis? In other words, are there adapters to remove or other type of trimming/filtering necessary?

    Thanks in advance!

  • #2
    Usually, no. Unless you do small RNA or microRNA, your fragments will be longer than the reads so that there is no risk that you have sequenced into the adapter at the opposite end, hence no need for trimming.

    Many people trim of bad quality reads at the end. However, if you use an aligner that is aware of base-call qualities, this is not necessary, as the aligner will know to disregard or down-weight bad-quality base calls. The aligner will flag alignments which are dubious due to bad base-call quality by indicating a low alignment quality. I would hence filter after alignment, based on alignment quality.

    If your aligner is not aware of quality scores you should fiter beforehands, of course.

    Simon

    Comment


    • #3
      You might want to run FastQC to double check the quality of your reads. Just a thought.

      EDIT: FastQC Link. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
      Last edited by Lee Sam; 08-18-2010, 04:52 PM.

      Comment


      • #4
        Hello everybody,

        I come back to this topic to discuss several questions concerning the preprocessing of RNA-Seq reads. I have not found so much information, so sorry if I address to already asked questions.

        I will analyze 14 RNA-Seq paired-ends of 100bp reads samples. The aim is to perform differential analysis of gene expression, detection of fusion genes and novel transcripts. For the alignment, I will provide a reference transcriptome. Tophat2 will first align the reads to this reference transcriptome, then it will align the unmpapped reads to the genome. Finally, the remaining reads will be segmented. But I will use the option --transcriptome-only that only aligns the reads to the transcriptome.

        I have several questions about the preprocessing steps I have to performed before the alignment.
        1. Do you perform systematic preprocessing? Or do you check with FastQC to decide if you should or not perform preprocessing.
          I am wondering if I can do preprocessing for only a part of my 14 samples or should I do it for all of them, in case of samples with lower quality?
        2. Which tools do you use to preprocess the reads?
          I plan to use Trimmomatic, but there are several tools: cutadapt, Princeps, ...
        3. I read in this forum that when aligning to a reference transcriptome, it is useless to remove adapters because the adapters won't align to the transcriptome.
          Do you agree with that?
        4. With the "Per base sequence content" graph of FastQC, we can see how many bases could/should be removed from the start of each read. Do you perform this step? I read that it is controversial.
        5. Are there consensus or can you please tell me wich thresholds do you use for the following points :
          - to trim reads using a sliding windows ? (I use a 6bp windows with a mean quality of 20 minimum)
          - to cut bases off the start or and of a read if below a quality? (I use 20)
          - minimum length? (I use 36)
          - average quality of the remaining read? (I use 20)
        6. Last question, do you remove duplicates (with Picard for example) after the alignment step?


        It would be great if you could advise me on some of these points, there are a lot of points to define and I have few experiment with RNA-Seq data for the moment.

        Thank you in advance,
        Jane

        Comment


        • #5
          1. Yes, I adapter and quality trim everything prior to alignment.
          2. I ended up writing my own that does only what I need (makes things faster), but otherwise trimmomatic and trim_galore are quite good.
          3. No, that's wrong. You can have the aligner try to soft-clip the adapter off, but that won't happen with the default settings in tophat2. If you leave much adapter on a read, it'll likely just tank its alignment score incorrectly.
          4. By this I assume that you're referring to the "random hexamer priming" effect. There's no need to trim those off. The priming isn't really random, but those bases are still correct.
          5. I usually trim bases off both ends with qualities <20. A minimum length of somewhere between 20 and 36 is fine (I have computational resources to throw at the alignment, so having that take slightly longer isn't a problem).
          6. Not for normal differential expression analysis (it'd be incorrect to do so). If you're going to be calling SNPs or something like that only then will you need to remove/mark duplicates.

          Comment


          • #6
            Thank you a lot for your answers dpryan!

            Originally posted by dpryan View Post
            [LIST=1]
            3. No, that's wrong. You can have the aligner try to soft-clip the adapter off, but that won't happen with the default settings in tophat2. If you leave much adapter on a read, it'll likely just tank its alignment score incorrectly.
            Ok, I will remove the adapters then.

            4. By this I assume that you're referring to the "random hexamer priming" effect. There's no need to trim those off. The priming isn't really random, but those bases are still correct.
            Yes, that is what I meant.

            6. Not for normal differential expression analysis (it'd be incorrect to do so). If you're going to be calling SNPs or something like that only then will you need to remove/mark duplicates.
            Ok, that is what I heard for DE.
            I won't do SNP detection, but detection of novel transcripts (with Cufflinks/Cuffcompare) and detection of fusion genes (with tophat --fusion-search et tophat-fusion-post). I don't know if I should remove them for these purposes...

            Jane

            Comment


            • #7
              Sorry I was confused about this topic.
              Hi All,
              I am a rookie in RNA-seq and I will get some human RNA-Seq fastq file from Ion proton, and cow RNA-seq fastq file from Illumina.

              1.what should do I do first? It is controversial that some one said need to pre-pcocessing but some one said no.

              2.Do I need to remove adapter first? and how about the trimming/filtering, which parameter I need to know and which software you recommend?

              3.And after Tophat, I found that the mapping rate is always ~ 60%, is it too low ? Do I need to re-alignment the unmapped reads from Tophat output and then go to the downstream analysis (e.g. Cufflinks, edgeR, DESeq or Cuffdiff)? Or any other better choice for alignment ?
              Thank you!
              Last edited by super0925; 03-06-2014, 05:47 AM.

              Comment


              • #8
                This topic may give you an example of the effect of adapter trimming on RNA-Seq downstream analysis. http://seqanswers.com/forums/showthread.php?t=40926

                Originally posted by Jane M View Post
                Hello everybody,

                I come back to this topic to discuss several questions concerning the preprocessing of RNA-Seq reads. I have not found so much information, so sorry if I address to already asked questions.

                I will analyze 14 RNA-Seq paired-ends of 100bp reads samples. The aim is to perform differential analysis of gene expression, detection of fusion genes and novel transcripts. For the alignment, I will provide a reference transcriptome. Tophat2 will first align the reads to this reference transcriptome, then it will align the unmpapped reads to the genome. Finally, the remaining reads will be segmented. But I will use the option --transcriptome-only that only aligns the reads to the transcriptome.

                I have several questions about the preprocessing steps I have to performed before the alignment.
                1. Do you perform systematic preprocessing? Or do you check with FastQC to decide if you should or not perform preprocessing.
                  I am wondering if I can do preprocessing for only a part of my 14 samples or should I do it for all of them, in case of samples with lower quality?
                2. Which tools do you use to preprocess the reads?
                  I plan to use Trimmomatic, but there are several tools: cutadapt, Princeps, ...
                3. I read in this forum that when aligning to a reference transcriptome, it is useless to remove adapters because the adapters won't align to the transcriptome.
                  Do you agree with that?
                4. With the "Per base sequence content" graph of FastQC, we can see how many bases could/should be removed from the start of each read. Do you perform this step? I read that it is controversial.
                5. Are there consensus or can you please tell me wich thresholds do you use for the following points :
                  - to trim reads using a sliding windows ? (I use a 6bp windows with a mean quality of 20 minimum)
                  - to cut bases off the start or and of a read if below a quality? (I use 20)
                  - minimum length? (I use 36)
                  - average quality of the remaining read? (I use 20)
                6. Last question, do you remove duplicates (with Picard for example) after the alignment step?


                It would be great if you could advise me on some of these points, there are a lot of points to define and I have few experiment with RNA-Seq data for the moment.

                Thank you in advance,
                Jane

                Comment


                • #9
                  Originally posted by relipmoc View Post
                  This topic may give you an example of the effect of adapter trimming on RNA-Seq downstream analysis. http://seqanswers.com/forums/showthread.php?t=40926
                  So Do your mean that it is essential to do preprocessing?
                  Except Fastqc to visualize the quallity of reads , what software do you recommend? What's more , I have 3 questions could you please help me to answer them?
                  Thank you!

                  Comment


                  • #10
                    Originally posted by super0925 View Post
                    Sorry I was confused about this topic.
                    Hi All,
                    I am a rookie in RNA-seq and I will get some human RNA-Seq fastq file from Ion proton, and cow RNA-seq fastq file from Illumina.
                    I'm not familiar with Ion proton RNA-Seq data. But for Illumina data, you may decide whether to do adapter trimming based on the Kmer Content plot of FastQC.

                    Originally posted by super0925 View Post
                    1.what should do I do first? It is controversial that some one said need to pre-pcocessing but some one said no.
                    I suggest do FastQC etc. first.

                    Originally posted by super0925 View Post
                    2.Do I need to remove adapter first? and how about the trimming/filtering, which parameter I need to know and which software you recommend?
                    I recommend skewer which is a new trimming tool. Other widely accepted tools are trimmomatic, cutadapt, flexbar, trimgalore!, AdapterRemoval, Btrim..., etc.

                    3.And after Tophat, I found that the mapping rate is always ~ 60%, is it too low ? Do I need to re-alignment the unmapped reads from Tophat output and then go to the downstream analysis (e.g. Cufflinks, edgeR, DESeq or Cuffdiff)? Or any other better choice for alignment ?
                    In my above reply, you may find that different trimming strategies may lead to different mapping rate of Tophat (81.3% vs 64.3% in that example).

                    Comment


                    • #11
                      Originally posted by super0925 View Post
                      3.And after Tophat, I found that the mapping rate is always ~ 60%, is it too low ? Do I need to re-alignment the unmapped reads from Tophat output and then go to the downstream analysis (e.g. Cufflinks, edgeR, DESeq or Cuffdiff)? Or any other better choice for alignment ?
                      Thank you!
                      BBMap is a splice-aware aligner for DNA/RNA-seq with much higher sensitivity than TopHat/Bowtie2; it will align substantially more reads.

                      For RNA-seq, the command would be something like this:

                      (to index)
                      bbmap.sh ref=genome.fasta

                      (to map)
                      bbmap.sh in=reads.fq out=mapped.sam maxindel=100000 xstag=fs intronlen=10
                      (for paired reads in 2 files, use "in1=" and "in2=")
                      This will generate XS tags, used by Cufflinks, according to the first strand protocol; the alternatives are 'ss' for second strand and 'us' for unstranded. If you don't know the library protocol then use 'us'.

                      You can also add additional flags to the mapping stage, such as:
                      qtrim=rl trimq=10

                      ...which will quality-trim the left and right ends of a read to Q10 before mapping. This is helpful for low-quality libraries.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      50 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X