SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
High duplication levels in FASTQC flobpf Bioinformatics 3 11-27-2013 12:28 PM
high sequence duplication levels for Illumina RNA-Seq meta-transcriptomics Marcus RNA Sequencing 12 07-20-2012 06:41 AM
What might cause the "Sequence Duplication Levels" failures in FastQC report? elrohir610 Bioinformatics 6 05-07-2012 09:38 PM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 08:43 AM
Fastqc sequence duplication levels Bruce E Illumina/Solexa 1 07-29-2011 07:13 AM

Reply
 
Thread Tools
Old 05-22-2014, 08:05 PM   #1
ege
Junior Member
 
Location: USA

Join Date: Apr 2014
Posts: 3
Default Duplication levels

Hello all,
I'm currently analyzing a single-end 50bp-read RNAseq data, which was sequenced at an outside facility. I've got a very naive question, since I'm relatively new to all this.

The facility provided me with what they call raw reads, containing sequencing adaptors etc. In addition to that, I also have the pre-processed "clean" reads. The details of the "cleaning", as they informed me, are as follows:
Quote:
1. Remove reads with adaptor sequences. 2. Remove reads in which the percentage of unknown bases (N) is greater than 10%. 3. Remove low quality reads. If the percentage of the low quality base (base with quality value ≤ 5) is greater than 50% in a read, we define this read as low quality.
I've already used these for alignment and other downstream analyses, but I just wanted to make sure so went ahead to quality check the "clean" fastq files with FASTQC, which gives me an error that the sequence duplication levels are high(roughly >66% in average for each sample I have)

I think this is because of the "cleaning" process, enriching the fastqs for higher quality data, but could this be due to any error during the library preparation step, or anything else? Would it even make sense QC'ing these processed fastq files?

Ege
ege is offline   Reply With Quote
Old 05-22-2014, 08:53 PM   #2
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I don't see how the trimming would affect the duplicate levels.
High duplicate levels are either due to PCR overamplification, or a low complexity library.

Without more information, it is not possible to tell if the high duplicate levels are due to PCR over amplification, and therefore a problem, or due to a low complexity library, and are therefore representative of the library.

If the amount of starting RNA was low and/or the number of PCR cycles was high, one would suspect PCR over amplification.
If when examining the alignment peaks, one sees isolated sequences duplicated multiple times, one would also suspect PCR over amplification.

It can be tricky to distinguish if high duplicate levels are due to PCR over amplification or a low complexity starting library. The researcher may not always be expecting a low complexity library. For example, I had an RNA-Seq sample of a cytoplasmic fraction with a high duplication rate because the library had been prepared using ribosomal depletion. An RNA signalling molecule present in very high numbers in the cytoplasm had not been removed.

Sometimes, you need to really understand your samples to identify the cause of the high duplicate levels.
blancha is offline   Reply With Quote
Old 05-22-2014, 09:10 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If coverage is high enough, you will have duplicates. Consider a 5 MB genome. At most you could have 100,000 unique 50bp reads; any more must be duplicates.

RNA-seq data often has some genes that have super-high expression levels; if a gene has 1000x coverage, with 50bp reads, then at least 95% of them must be duplicates, because unique reads can only reach 50x coverage. I think FastQC's warning is based on the assumption that you have DNA data; I would ignore it.

Duplicates often come from over-amplification with PCR, too, but generally it's possible to determine the cause of the duplicates, if you know what to look for. Mapping and looking at the mapped reads in IGV can help. High levels of PCR duplicates will have a distinctive patchy coverage. Normally people don't remove duplicates from RNA-seq data because that interferes with quantification, so if the duplicates are indeed from amplification, you should either ignore them, or redo the experiment with more RNA and less amplification if they are actually a problem.

The cleaning process sounds OK to me, but normally I recommend adapter trimming rather than adapter filtering, because you lose less data. The cleaning would tend to increase the percent of duplicate reads by removing reads with errors, but it's not like it adds any new duplicates, so that doesn't really matter.
Brian Bushnell is offline   Reply With Quote
Old 05-23-2014, 04:23 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

Quote:
Originally Posted by Brian Bushnell View Post
If coverage is high enough, you will have duplicates. Consider a 5 MB genome. At most you could have 100,000 unique 50bp reads; any more must be duplicates.
For a 5 Mbp genome you can have 5,000,000 unique 50bp reads (or 100bp read, or 123bp reads, etc). A read starting a base n is unique from a read starting at base n+1 (e.g. 1-50, vs. 2-51). This assumes the genome is circular. If it is linear then the number of potential unique 50bp reads is 4,999,950.

Last edited by kmcarr; 05-23-2014 at 04:26 AM.
kmcarr is offline   Reply With Quote
Old 05-23-2014, 08:16 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by kmcarr View Post
For a 5 Mbp genome you can have 5,000,000 unique 50bp reads (or 100bp read, or 123bp reads, etc). A read starting a base n is unique from a read starting at base n+1 (e.g. 1-50, vs. 2-51). This assumes the genome is circular. If it is linear then the number of potential unique 50bp reads is 4,999,950.
Woops, my math was totally wrong, that's correct =) For Xbp reads you can have at most X-fold unique coverage.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
duplication, fastqc, quality check

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO