SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Human Illumina Paired-end RNA-Seq remove duplication. fabrice Bioinformatics 8 10-15-2012 09:10 PM
Looking for Beta testers: do you have isoform-level RNA-Seq data? MLaurance RNA Sequencing 0 06-19-2012 12:00 PM
fastqc sequence duplication level fadista Bioinformatics 4 01-11-2012 09:17 AM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 08:43 AM
RNA-Seq: Isoform-level microRNA-155 target prediction using RNA-seq. Newsbot! Literature Watch 0 02-15-2011 02:00 AM

Reply
 
Thread Tools
Old 10-11-2012, 07:13 PM   #1
gary
Member
 
Location: Shanghai

Join Date: Dec 2009
Posts: 16
Default Duplication level of RNA-seq data

Hello everyone!

I got my 100bp paired-end RNA-seq data today, FastQC told me that the duplication rate is above 60%. I searched around and found that it is common to get a high dup level with RNA-seq. Is that normal?

Should I remove the duplication? I heard some discussions in this forum that if the duplicates were removed then I cannot compare the highly expressed genes since the max depth of coverage at one point is 200 with 100bp sequencing data.

thanks!
gary is offline   Reply With Quote
Old 10-11-2012, 11:11 PM   #2
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

200 is the maximum coverage per base if you have unstranded single end reads, with paired ends you can have many fragments starting at the same point if the other end differs.

FastQC only looks at the reads, what you should do is calculate the library complexity after alignment (e.g. with Picard). Looking at the alignments at lowly expressed genes also helps to determine if the library is over-sequenced.
Chipper is offline   Reply With Quote
Old 10-11-2012, 11:50 PM   #3
gary
Member
 
Location: Shanghai

Join Date: Dec 2009
Posts: 16
Default

Thank you Chipper! I will try your suggestions.

In your experiences, do you think the >=60% duplication level given by FastQC is too high? Or I will need to look at the alignment results to see if that was too high.
gary is offline   Reply With Quote
Old 10-12-2012, 07:00 AM   #4
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

What was the yield? We usually aim for less than 100 ng/ul but some RNA samples amplify better than others.
NextGenSeq is offline   Reply With Quote
Old 10-15-2012, 03:29 AM   #5
arkal
advancing one byte at a time!
 
Location: Bangalore, India

Join Date: Jun 2011
Posts: 56
Default

In my experience (from what i've seen and what i've read), you can expect to see 60-90% duplication when u run a fastqc on ur data! Don't really know how it correlates to the real picture though!
arkal is offline   Reply With Quote
Old 10-15-2012, 04:17 AM   #6
TonyBrooks
Senior Member
 
Location: London

Join Date: Jun 2009
Posts: 298
Default

FastQC bases it's duplication estimates on the first 50bp of sequence only. It also makes no allowance for paired end data. High FastQC duplication rates for RNA-Seq is normal.
To get a better idea you need to look at the mapping co-ordinates.
As Chipper says, use the Picard library complexity estimator. For example, I just ran and RNA-Seq sample with 64.59% (Read1) and 58.13% (Read 2) FastQC duplication, but only 0.32% duplication using the Picard library complexity estimator.
TonyBrooks is offline   Reply With Quote
Old 06-16-2013, 08:45 AM   #7
Baoqing
Member
 
Location: Texas

Join Date: Jan 2013
Posts: 24
Default Interpretation of the PICARD results

Hi, Guys
I was also trying to estimate library complexity with PICARD with my paired end data, I used the tophat aligned reads as input, according to the picard, "One or more files to combine and estimate library complexity from." what does it exactly mean ? Does multiple inputs mean the bam files from each duplicate of biological samples?

If so, how to add? Should I just add extra files in the INPUT argument

java -Xmx2g -jar ~/Desktop/apps/picard-tools-1.92/EstimateLibraryComplexity.jar INPUT= accepted_hits025.bam <more bam files here?> OUTPUT= picard_file MIN_IDENTICAL_bases=6 MAX_DIFF_RATE=0.02

I also get a table from the run of the command, but any clue how to understand this file? I also attached part of the table (It is too long to attach)

Thank you in advance!
## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown 0 6501528 0 0 2054652 1644248 0.316026 27100851

## HISTOGRAM java.lang.Integer
duplication_group_count Unknown
1 3808007
2 376471
3 78976
4 57845
5 19068
6 26550
7 8180
8 17601
9 4826
10 10475
11 3229
12 5810
Baoqing is offline   Reply With Quote
Old 06-16-2013, 11:00 AM   #8
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Picard estimates that your library has 27100851 molecules.
Chipper is offline   Reply With Quote
Old 06-16-2013, 12:24 PM   #9
Baoqing
Member
 
Location: Texas

Join Date: Jan 2013
Posts: 24
Default

Thanks. Still not quite clear to me. Does that mean 27,100,851 mRNA, exon or something else? seems unlikely are mRNAs, the read pair examined in total is 6,501,528. Also, it seems that the duplicate_group_count are some kind of id, how could i get back and check those duplicated reads? Really appreciate your help in clarifying this.
Baoqing is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO