SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trim FastQ nxtgenkid10 Bioinformatics 7 05-27-2014 05:40 PM
Trim adapter using CLCBio azleen Bioinformatics 1 04-02-2013 05:37 AM
DGE - filter or not filter masterpiece Bioinformatics 0 07-11-2011 08:55 PM
Newbler Trim Status blindtiger454 De novo discovery 2 05-18-2011 04:46 AM
Do I need to trim the sequences like this? days369 Bioinformatics 4 08-16-2010 08:19 PM

Reply
 
Thread Tools
Old 02-12-2014, 08:40 AM   #1
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default Trim/filter or not?

Hello,
I have some paired-end RNA-Seq reads (length=100) sequenced using Illumina's HiSeq platform and was wondering whether I should trim/ilter my reads. I've run FastQC with some of my samples and the immages are attached. Here are my questions:
1) The per_base_n_content plots show many Ns in the last (about 10 or so) positions. Is this normal? Do I need to trim the end of reads if I use these reads for differential expression analysis? What if I use them for variant calling? If trimming is necessary, what is a good strategy? To trim a fixed number of bases, to trim by quality score, or to trim if there's an N and the quality is low?
2) The duplication levels of my samples are all over 50%. I can use Picard to mark those caused by PCR amplification or something else. But how reliable is that? Is it safe to remove duplicates marked by that software?
3) There are some adapter sequences in my reads but the percentage is low (from 0.1% to <1%). Is it necessary to remove them?
I actually have two independent sets of data sequenced by our NGS core facility. The samples were prepared by different people for different purposes at different times, but both data sets have similar FastQC results. If they both look unusual, could it be that there was something worng with the core facility?

Thanks,
Sylvia
Attached Images
File Type: png duplication_levels.png (17.2 KB, 17 views)
File Type: png per_base_gc_content.png (12.1 KB, 16 views)
File Type: png per_base_n_content.png (11.2 KB, 19 views)
File Type: png per_base_quality.png (10.0 KB, 23 views)
File Type: png per_base_sequence_content.png (23.5 KB, 20 views)
zzhao2 is offline   Reply With Quote
Old 02-12-2014, 09:10 AM   #2
jwfoley
Senior Member
 
Location: Stanford

Join Date: Jun 2009
Posts: 181
Default

Quote:
The per_base_n_content plots show many Ns in the last (about 10 or so) positions. Is this normal?
On Illumina machines, base quality decreases as you get farther into the read, so yes.

Quote:
Do I need to trim the end of reads if I use these reads for differential expression analysis?
Any sane aligner should handle these properly, but mild trimming is always a good idea.

Quote:
If trimming is necessary, what is a good strategy? To trim a fixed number of bases, to trim by quality score, or to trim if there's an N and the quality is low?
The only fixed number of bases it makes sense to trim are the six that came from your random hexamer primer, if that's what you did, since those are prone to mismatches. Otherwise set your trim threshold to Q5 or so (see http://genomebio.org/is-trimming-is-...al-in-rna-seq/). This should also remove the N's, which are low-quality by definition.

Quote:
The duplication levels of my samples are all over 50%. I can use Picard to mark those caused by PCR amplification or something else. But how reliable is that? Is it safe to remove duplicates marked by that software?
Do you have some reason to think your library is overamplified? E.g. did you do more than 15 cycles of PCR? Or, did you sequence to high depth? (That would cause "duplicate" reads by chance.)

Inspect the read distribution in a browser and see if it looks spikey. Or, since you have paired-end reads (unnecessary for differential expression, FYI), confirm that both R1s and R2s are identical, which is very unlikely to happen by chance, before you remove any reads.

Quote:
There are some adapter sequences in my reads but the percentage is low (from 0.1% to <1%). Is it necessary to remove them?
Only if you don't feel like wasting <1% of your data. It's easy to fix with e.g. Trimmomatic so why wouldn't you?

Quote:
If they both look unusual, could it be that there was something worng with the core facility?
Core facilities are consistent but sometimes that means consistently wrong. E.g. mine very consistently selects ~175 bp inserts for 2x100 sequencing, which is at best a creative use of paired-end sequencing that requires special bioinformatics to take advantage of, but more likely a simple oversight.
jwfoley is offline   Reply With Quote
Old 02-12-2014, 09:42 AM   #3
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

HI jwfoley,
Thank you so much for the informative reply. So it sounds like I can do a Q5 trimming as well as removing adapters. I don't know the details about library preparation, so I'll ask my colleagues how they did it.
Is the "random hexamer primer" the barcode? If so, the core facility should have removed them.
As for the duplication, since my reads are paired-end, if both the left and right duplicated reads are exactly the same, they are unlikely to be real. Is this understanding correct? Also, will Picard or whatever software mark only this kind of duplicates?
My reads' map rate by TopHat is only around 60%. I don't know why and whether this needs to be improved. I understand that simply imcreasing the map rate doesn't necessarily give a better differential expression result.

Thank you,
Sylvia
zzhao2 is offline   Reply With Quote
Old 02-12-2014, 09:49 AM   #4
jwfoley
Senior Member
 
Location: Stanford

Join Date: Jun 2009
Posts: 181
Default

Quote:
Is the "random hexamer primer" the barcode? If so, the core facility should have removed them.
No, most likely your barcodes are in separate reads, per the current Illumina system. The random hexamer I'm talking about is the reverse-transcription primer, in the most common RNA-seq library protocols. If you used one of those, it is probably found in the first six bases of both reads if your library is nondirectional ("fr-unstranded" in TopHat) or the first six bases of R1 if it's from the common directional protocol ("fr-firststrand"). http://onetipperday.blogspot.ca/2012...pe-to-use.html

Quote:
As for the duplication, since my reads are paired-end, if both the left and right duplicated reads are exactly the same, they are unlikely to be real. Is this understanding correct?
Yes, that's what I'm saying. True PCR clones will be identical at both ends.

Quote:
Also, will Picard or whatever software mark only this kind of duplicates?
I don't know.

Quote:
My reads' map rate by TopHat is only around 60%. I don't know why and whether this needs to be improved.
It's a bit low but see what happens when you trim low-quality bases and clip off adapter sequence.
jwfoley is offline   Reply With Quote
Old 02-12-2014, 11:55 AM   #5
Bukowski
Senior Member
 
Location: UK

Join Date: Jan 2010
Posts: 390
Default

Quote:
Originally Posted by jwfoley View Post
On Illumina machines, base quality decreases as you get farther into the read, so yes.
Any sane aligner should handle these properly, but mild trimming is always a good idea.
True, but I've never had data come back with that profile at the end of the reads, RNA-Seq or otherwise, not in thousands of samples - so it is not as far as I am concerned 'normal'.
Bukowski is offline   Reply With Quote
Old 02-12-2014, 12:07 PM   #6
jwfoley
Senior Member
 
Location: Stanford

Join Date: Jun 2009
Posts: 181
Default

Fair enough. It's not normal but it's not unheard of.

It certainly wouldn't be unreasonable to toss all those positions out of an abundance of caution, but I suspect trimming by quality would have the same effect. The question is whether you believe the non-N bases in those positions.

Last edited by jwfoley; 02-12-2014 at 12:15 PM.
jwfoley is offline   Reply With Quote
Old 02-13-2014, 07:54 AM   #7
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

I tried the Q5 trimming using FastX with a typical sample, but no reads were trimmed. I think it's because all bases in my reads have a score higher than 5. So why did I get those Ns at the end of reads, and scores of those Ns are not low?

Last edited by zzhao2; 02-13-2014 at 08:20 AM.
zzhao2 is offline   Reply With Quote
Old 02-13-2014, 10:05 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

You could CROP the sequences using trimmomatic to get rid of the N's.

That said it is odd that you have so many N's at the end of the reads. That almost certainly indicates that there was some technical problem with the flowcell this sample ran on. You may want to check with your sequence provider to see if something happened.
GenoMax is offline   Reply With Quote
Old 02-13-2014, 01:19 PM   #9
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

Well, I've used trimmomatic to get rid of adapters, N's and low quality bases for the sample shown in those plots in my initial post. I used the example parameter settings in the trimmomatic's manual for paired-end data i.e.

ILLUMINACLIP:adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Then I've rerun tophat and got an increased map rate at about 70% (it used to be about 60%). I checked the trimmed reads again using FastQC and now the per_base_n_content plot looks normal. However, the per_base_gc_content is still like before i.e. the last (3~5? can't tell exactly from the plot) bases' gc% are very different. Please see attached images. Do I need to crop the last a few bases off?

Also, my first 10-15 bases' gc% are fluctuating as well, but it seems that people think this is quite normal. so I don't need to crop them, right?
Attached Images
File Type: png per_base_gc_content.png (12.1 KB, 9 views)
File Type: png per_base_n_content.png (7.8 KB, 6 views)
File Type: png duplication_levels.png (16.1 KB, 6 views)
File Type: png kmer_profiles.png (66.1 KB, 8 views)
File Type: png per_base_sequence_content.png (23.4 KB, 13 views)
zzhao2 is offline   Reply With Quote
Old 02-13-2014, 01:35 PM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by zzhao2 View Post
Also, my first 10-15 bases' gc% are fluctuating as well, but it seems that people think this is quite normal. so I don't need to crop them, right?
That is correct, specially for RNA-seq data.

I still feel that there was something technically wrong with this run (see how the A signal just drops to zero). If you are able to, then ask your sequence provider about any problems during the run, before saying anything about what you are observing.
GenoMax is offline   Reply With Quote
Old 02-14-2014, 06:56 AM   #11
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

Quote:
Originally Posted by GenoMax View Post
I still feel that there was something technically wrong with this run (see how the A signal just drops to zero).
Agree. And as I have mentioned, I have two independent sets of data and both have similar FastQC plots. Maybe our core facility consistently has some problem. I'll check with them.

Thank you all so much for your help!
zzhao2 is offline   Reply With Quote
Old 02-14-2014, 06:57 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by zzhao2 View Post
Agree. And as I have mentioned, I have two independent sets of data and both have similar FastQC plots. Maybe our core facility consistently has some problem. I'll check with them.

Thank you all so much for your help!
The question is are these independent samples that were at different times on separate flowcells? If that was the case and you are still seeing that pattern then it may indicate that the problem may be with your libraries.
GenoMax is offline   Reply With Quote
Old 02-14-2014, 07:42 AM   #13
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

Quote:
Originally Posted by GenoMax View Post
The question is are these independent samples that were at different times on separate flowcells? If that was the case and you are still seeing that pattern then it may indicate that the problem may be with your libraries.
My two colleagues did their RNA-Seq experiments at different times, and they study different problems. One had 6 samples sequenced maybe 6 months ago (let me call it exp1), and the other had 18 samples sequenced 2 months ago (let me call it exp2). Honestly I know nothing about the details of their experiments. They sent their samples to our core facility and I just downloaded the data for analysis. Then I randomly checked 3 samples in exp2 and 1 sample in exp1, and their FastQC plots all look similar to those attached in my initial post. Based on this description, can you help identify where the problem could be? Or if more information is needed, please just let me know.

Another question is the duplication level.The duplication levels of all the random samples that I've checked are over 50%. Is >50% normal for RNA-Seq? If not, how to avoid it in the future experiment design?
zzhao2 is offline   Reply With Quote
Old 02-14-2014, 07:57 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

I was only referring to the drop in A signal based on the plots you had posted. It would be odd to have that pattern repeat on samples done months apart.

Just to confirm: These samples have already been demultiplexed, correct? The only time I remember seeing that sort of pattern at the end of the read is for the "tag".

See Simon's note about the duplication plots in FastQC: http://proteo.me.uk/2013/09/a-new-wa...-fastqc-v0-11/ and the original interpretation: http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/
GenoMax is offline   Reply With Quote
Old 02-14-2014, 07:58 AM   #15
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

I have only just looked at the FastQC plots of your data, and I don't think it looks all that abnormal.

Try running FastQC using the --nogroup parameter, it will let you see how many bases at the end of the read are affected, it could be only the last base, which often has a much lower quality than the rest of the read.

As for the trimmomatic trimming, how long are the adapter sequences you are using for palindrome trimming?
With a threshold score of 30 for palindrome trimming, if each matching base adds 0.6 to the score (see the trimmomatic web page), unless the sequences in your adapter.fasta file are quite long, trimmomatic will not recognise and trim the adapter sequences.

You can use grep to see how many adapter sequences are in your reads before and after trimmomatic trimming.
mastal is offline   Reply With Quote
Old 02-14-2014, 08:00 AM   #16
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Depending on the type of libraries that were made high duplication levels can be normal. In total RNA preps, rRNAs will be so abundant that they will show lots of duplicates. Many highly abundant mRNA transcripts will cause the same thing in other types of preps. Often times if you take overrepressented sequences from fastqc and blast them, you’re going to find rRNAs or even other transcripts at times. Remember fastqc only checks the first 50bp of 200K reads for its duplication check and its only taking into account one side of the pair. Picardtools markduplicates will give you the kind of duplication checks you really need.
Wallysb01 is offline   Reply With Quote
Old 02-14-2014, 08:05 AM   #17
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

Quote:
Originally Posted by GenoMax View Post
Just to confirm: These samples have already been demultiplexed, correct? The only time I remember seeing that sort of pattern at the end of the read is for the "tag".
What I got from the core facility is the separate fastq files, a R1 and a R2 file per sample. So I assume that the samples have been demultiplexed. If in case the tags are left in the sequences, is there anyway to check it?

I'll read Simon's article about duplication level. Thank you!
zzhao2 is offline   Reply With Quote
Old 02-14-2014, 08:15 AM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by zzhao2 View Post
What I got from the core facility is the separate fastq files, a R1 and a R2 file per sample. So I assume that the samples have been demultiplexed. If in case the tags are left in the sequences, is there anyway to check it?
Tags won't be there is you received separate R1/R2 files for each sample. You may have some adapter left (if the inserts were short) but you have already done trimming, so most should be gone. Consider @mastal's advice a couple of posts up about trimmomatic, in case some adapters are left.
GenoMax is offline   Reply With Quote
Old 02-14-2014, 08:52 AM   #19
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

Quote:
Originally Posted by mastal View Post
Try running FastQC using the --nogroup parameter, it will let you see how many bases at the end of the read are affected, it could be only the last base, which often has a much lower quality than the rest of the read.
I've run FastQC with --nogroup. This is very helpful. I see 5 or 6 bases with abnormal GC. Please see attached plots.
Quote:
As for the trimmomatic trimming, how long are the adapter sequences you are using for palindrome trimming?
With a threshold score of 30 for palindrome trimming, if each matching base adds 0.6 to the score (see the trimmomatic web page), unless the sequences in your adapter.fasta file are quite long, trimmomatic will not recognise and trim the adapter sequences.
I used the Illumina's TruSeq3-PE-2.fa file provided by trimmomatic. The sequences there are 34 bases long.

Quote:
You can use grep to see how many adapter sequences are in your reads before and after trimmomatic trimming.
Just by searching the 34-base sequences in the trimmomatic's adapter file, I found only <10 exact matches in both my raw reads and trimmed reads. But trimmomatic reported that thousands of reads were trimmed during a test run which trimmed adapters only.
zzhao2 is offline   Reply With Quote
Old 02-14-2014, 08:56 AM   #20
zzhao2
Member
 
Location: Dallas

Join Date: Aug 2012
Posts: 21
Default

Quote:
Originally Posted by Wallysb01 View Post
Picardtools markduplicates will give you the kind of duplication checks you really need.
Do you mean that it's safe to remove duplicates marked by Picard? I think it's easier to check PCR duplicates for PE reads. What about SE reads? Is Picard equally reliable or not?
zzhao2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:07 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO