SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
New to RNA-Seq: Help obtaining sequencing summary needed. ccard28 Bioinformatics 12 05-14-2012 12:44 AM
ChIP-Seq: Analyzing ChIP-seq Data: Preprocessing, Normalization, Differential Identif Newsbot! Literature Watch 0 12-02-2011 04:51 AM
500 million reads needed for RNA-Seq?! epistatic RNA Sequencing 6 10-31-2011 03:53 PM
ChIP-seq: time needed for data analysis? Mela General 4 10-03-2011 11:45 PM
quality of RNA needed for prokaryotic RNA-seq? greigite RNA Sequencing 1 12-01-2010 09:53 AM

Reply
 
Thread Tools
Old 08-17-2010, 08:31 AM   #1
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 55
Default Preprocessing needed for RNA-Seq data

Hi,

I am new to the RNA-Seq data analysis and I have a very basic question.

I need to analyze some RNA-Seq data (from Illumina).
Are there specific steps to take before aligning the reads and proceed with the analysis? In other words, are there adapters to remove or other type of trimming/filtering necessary?

Thanks in advance!
PFS is offline   Reply With Quote
Old 08-18-2010, 02:12 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Usually, no. Unless you do small RNA or microRNA, your fragments will be longer than the reads so that there is no risk that you have sequenced into the adapter at the opposite end, hence no need for trimming.

Many people trim of bad quality reads at the end. However, if you use an aligner that is aware of base-call qualities, this is not necessary, as the aligner will know to disregard or down-weight bad-quality base calls. The aligner will flag alignments which are dubious due to bad base-call quality by indicating a low alignment quality. I would hence filter after alignment, based on alignment quality.

If your aligner is not aware of quality scores you should fiter beforehands, of course.

Simon
Simon Anders is offline   Reply With Quote
Old 08-18-2010, 01:56 PM   #3
Lee Sam
Member
 
Location: Ann Arbor, MI

Join Date: Oct 2008
Posts: 57
Default

You might want to run FastQC to double check the quality of your reads. Just a thought.

EDIT: FastQC Link. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

Last edited by Lee Sam; 08-18-2010 at 04:52 PM.
Lee Sam is offline   Reply With Quote
Old 10-30-2013, 02:00 AM   #4
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Hello everybody,

I come back to this topic to discuss several questions concerning the preprocessing of RNA-Seq reads. I have not found so much information, so sorry if I address to already asked questions.

I will analyze 14 RNA-Seq paired-ends of 100bp reads samples. The aim is to perform differential analysis of gene expression, detection of fusion genes and novel transcripts. For the alignment, I will provide a reference transcriptome. Tophat2 will first align the reads to this reference transcriptome, then it will align the unmpapped reads to the genome. Finally, the remaining reads will be segmented. But I will use the option --transcriptome-only that only aligns the reads to the transcriptome.

I have several questions about the preprocessing steps I have to performed before the alignment.
  1. Do you perform systematic preprocessing? Or do you check with FastQC to decide if you should or not perform preprocessing.
    I am wondering if I can do preprocessing for only a part of my 14 samples or should I do it for all of them, in case of samples with lower quality?
  2. Which tools do you use to preprocess the reads?
    I plan to use Trimmomatic, but there are several tools: cutadapt, Princeps, ...
  3. I read in this forum that when aligning to a reference transcriptome, it is useless to remove adapters because the adapters won't align to the transcriptome.
    Do you agree with that?
  4. With the "Per base sequence content" graph of FastQC, we can see how many bases could/should be removed from the start of each read. Do you perform this step? I read that it is controversial.
  5. Are there consensus or can you please tell me wich thresholds do you use for the following points :
    - to trim reads using a sliding windows ? (I use a 6bp windows with a mean quality of 20 minimum)
    - to cut bases off the start or and of a read if below a quality? (I use 20)
    - minimum length? (I use 36)
    - average quality of the remaining read? (I use 20)
  6. Last question, do you remove duplicates (with Picard for example) after the alignment step?

It would be great if you could advise me on some of these points, there are a lot of points to define and I have few experiment with RNA-Seq data for the moment.

Thank you in advance,
Jane
Jane M is offline   Reply With Quote
Old 10-30-2013, 02:16 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

  1. Yes, I adapter and quality trim everything prior to alignment.
  2. I ended up writing my own that does only what I need (makes things faster), but otherwise trimmomatic and trim_galore are quite good.
  3. No, that's wrong. You can have the aligner try to soft-clip the adapter off, but that won't happen with the default settings in tophat2. If you leave much adapter on a read, it'll likely just tank its alignment score incorrectly.
  4. By this I assume that you're referring to the "random hexamer priming" effect. There's no need to trim those off. The priming isn't really random, but those bases are still correct.
  5. I usually trim bases off both ends with qualities <20. A minimum length of somewhere between 20 and 36 is fine (I have computational resources to throw at the alignment, so having that take slightly longer isn't a problem).
  6. Not for normal differential expression analysis (it'd be incorrect to do so). If you're going to be calling SNPs or something like that only then will you need to remove/mark duplicates.
dpryan is offline   Reply With Quote
Old 10-30-2013, 04:57 AM   #6
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Thank you a lot for your answers dpryan!

Quote:
Originally Posted by dpryan View Post
[LIST=1]
3. No, that's wrong. You can have the aligner try to soft-clip the adapter off, but that won't happen with the default settings in tophat2. If you leave much adapter on a read, it'll likely just tank its alignment score incorrectly.
Ok, I will remove the adapters then.

Quote:
4. By this I assume that you're referring to the "random hexamer priming" effect. There's no need to trim those off. The priming isn't really random, but those bases are still correct.
Yes, that is what I meant.

Quote:
6. Not for normal differential expression analysis (it'd be incorrect to do so). If you're going to be calling SNPs or something like that only then will you need to remove/mark duplicates.
Ok, that is what I heard for DE.
I won't do SNP detection, but detection of novel transcripts (with Cufflinks/Cuffcompare) and detection of fusion genes (with tophat --fusion-search et tophat-fusion-post). I don't know if I should remove them for these purposes...

Jane
Jane M is offline   Reply With Quote
Old 03-06-2014, 02:34 AM   #7
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Sorry I was confused about this topic.
Hi All,
I am a rookie in RNA-seq and I will get some human RNA-Seq fastq file from Ion proton, and cow RNA-seq fastq file from Illumina.

1.what should do I do first? It is controversial that some one said need to pre-pcocessing but some one said no.

2.Do I need to remove adapter first? and how about the trimming/filtering, which parameter I need to know and which software you recommend?

3.And after Tophat, I found that the mapping rate is always ~ 60%, is it too low ? Do I need to re-alignment the unmapped reads from Tophat output and then go to the downstream analysis (e.g. Cufflinks, edgeR, DESeq or Cuffdiff)? Or any other better choice for alignment ?
Thank you!

Last edited by super0925; 03-06-2014 at 04:47 AM.
super0925 is offline   Reply With Quote
Old 03-06-2014, 04:34 AM   #8
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

This topic may give you an example of the effect of adapter trimming on RNA-Seq downstream analysis. http://seqanswers.com/forums/showthread.php?t=40926

Quote:
Originally Posted by Jane M View Post
Hello everybody,

I come back to this topic to discuss several questions concerning the preprocessing of RNA-Seq reads. I have not found so much information, so sorry if I address to already asked questions.

I will analyze 14 RNA-Seq paired-ends of 100bp reads samples. The aim is to perform differential analysis of gene expression, detection of fusion genes and novel transcripts. For the alignment, I will provide a reference transcriptome. Tophat2 will first align the reads to this reference transcriptome, then it will align the unmpapped reads to the genome. Finally, the remaining reads will be segmented. But I will use the option --transcriptome-only that only aligns the reads to the transcriptome.

I have several questions about the preprocessing steps I have to performed before the alignment.
  1. Do you perform systematic preprocessing? Or do you check with FastQC to decide if you should or not perform preprocessing.
    I am wondering if I can do preprocessing for only a part of my 14 samples or should I do it for all of them, in case of samples with lower quality?
  2. Which tools do you use to preprocess the reads?
    I plan to use Trimmomatic, but there are several tools: cutadapt, Princeps, ...
  3. I read in this forum that when aligning to a reference transcriptome, it is useless to remove adapters because the adapters won't align to the transcriptome.
    Do you agree with that?
  4. With the "Per base sequence content" graph of FastQC, we can see how many bases could/should be removed from the start of each read. Do you perform this step? I read that it is controversial.
  5. Are there consensus or can you please tell me wich thresholds do you use for the following points :
    - to trim reads using a sliding windows ? (I use a 6bp windows with a mean quality of 20 minimum)
    - to cut bases off the start or and of a read if below a quality? (I use 20)
    - minimum length? (I use 36)
    - average quality of the remaining read? (I use 20)
  6. Last question, do you remove duplicates (with Picard for example) after the alignment step?

It would be great if you could advise me on some of these points, there are a lot of points to define and I have few experiment with RNA-Seq data for the moment.

Thank you in advance,
Jane
relipmoc is offline   Reply With Quote
Old 03-06-2014, 04:58 AM   #9
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by relipmoc View Post
This topic may give you an example of the effect of adapter trimming on RNA-Seq downstream analysis. http://seqanswers.com/forums/showthread.php?t=40926
So Do your mean that it is essential to do preprocessing?
Except Fastqc to visualize the quallity of reads , what software do you recommend? What's more , I have 3 questions could you please help me to answer them?
Thank you!
super0925 is offline   Reply With Quote
Old 03-06-2014, 05:30 AM   #10
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

Quote:
Originally Posted by super0925 View Post
Sorry I was confused about this topic.
Hi All,
I am a rookie in RNA-seq and I will get some human RNA-Seq fastq file from Ion proton, and cow RNA-seq fastq file from Illumina.
I'm not familiar with Ion proton RNA-Seq data. But for Illumina data, you may decide whether to do adapter trimming based on the Kmer Content plot of FastQC.

Quote:
Originally Posted by super0925 View Post
1.what should do I do first? It is controversial that some one said need to pre-pcocessing but some one said no.
I suggest do FastQC etc. first.

Quote:
Originally Posted by super0925 View Post
2.Do I need to remove adapter first? and how about the trimming/filtering, which parameter I need to know and which software you recommend?
I recommend skewer which is a new trimming tool. Other widely accepted tools are trimmomatic, cutadapt, flexbar, trimgalore!, AdapterRemoval, Btrim..., etc.

Quote:
3.And after Tophat, I found that the mapping rate is always ~ 60%, is it too low ? Do I need to re-alignment the unmapped reads from Tophat output and then go to the downstream analysis (e.g. Cufflinks, edgeR, DESeq or Cuffdiff)? Or any other better choice for alignment ?
In my above reply, you may find that different trimming strategies may lead to different mapping rate of Tophat (81.3% vs 64.3% in that example).
relipmoc is offline   Reply With Quote
Old 03-06-2014, 08:36 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by super0925 View Post
3.And after Tophat, I found that the mapping rate is always ~ 60%, is it too low ? Do I need to re-alignment the unmapped reads from Tophat output and then go to the downstream analysis (e.g. Cufflinks, edgeR, DESeq or Cuffdiff)? Or any other better choice for alignment ?
Thank you!
BBMap is a splice-aware aligner for DNA/RNA-seq with much higher sensitivity than TopHat/Bowtie2; it will align substantially more reads.

For RNA-seq, the command would be something like this:

(to index)
bbmap.sh ref=genome.fasta

(to map)
bbmap.sh in=reads.fq out=mapped.sam maxindel=100000 xstag=fs intronlen=10
(for paired reads in 2 files, use "in1=" and "in2=")
This will generate XS tags, used by Cufflinks, according to the first strand protocol; the alternatives are 'ss' for second strand and 'us' for unstranded. If you don't know the library protocol then use 'us'.

You can also add additional flags to the mapping stage, such as:
qtrim=rl trimq=10

...which will quality-trim the left and right ends of a read to Q10 before mapping. This is helpful for low-quality libraries.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO