SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
using bwa to map illumina paired end reads mikeworth Bioinformatics 6 08-13-2013 06:02 PM
Human Illumina Paired-end RNA-Seq remove duplication. fabrice Bioinformatics 8 10-15-2012 09:10 PM
Using Bfast to align paired end Illumina reads gavin.oliver Bioinformatics 14 01-14-2012 06:51 AM
Bowtie Illumina paired end reads alignment empyrean Bioinformatics 3 09-20-2011 09:51 AM
Limiting Illumina Paired-End Reads cryptic_star Bioinformatics 1 06-21-2010 05:30 AM

Reply
 
Thread Tools
Old 01-30-2012, 08:39 PM   #1
amango
Member
 
Location: New York

Join Date: Dec 2009
Posts: 17
Default Source of duplication in illumina hiseq paired-end reads?

I recently received my first short read data set, one lane of 2x100bp Illumina Hiseq reads. I'm hoping the community can help me identify the source of duplicate sequences indicated in fastQC reports on the data.

FastQC showed high duplication (>60%) for both forward and reverse reads. The report for the reverse reads did not turn up any specific over-represented sequences, while the report for the forward reads identified a PCR primer and adapter sequence. However a bowtie alignment against illumina paired-end adapters and primers showed 0% alignment. And when I tried to use picard to mark and remove duplicate reads, no reads were removed (picard command below).

My reads are from a single lane of 12 individually barcoded cDNA sub-libraries from a non-model organism (no reference genome). Six of these libraries were normalized (via DSN digestion), six were not. Has anyone seem similar fastQC curves for rna-seq data?

Is there a way to search the file for the actual sequences that are highly duplicated?

Full fastQC reports are attached.

[command run as below, though I have omitted the path for each file]
nohup java -jar MarkDuplicates.jar INPUT=sequence_file.bam OUTPUT=deduplicated_reads.bam METRICS_FILE=deduplicated_reads_metrics.txt REMOVE_DUPLICATES=true &



Attached Files
File Type: pdf C08DRACXX.8_1.fastq FastQC Report.pdf (407.4 KB, 35 views)
File Type: pdf C08DRACXX.8_2.fastq FastQC Report.pdf (337.6 KB, 2 views)
amango is offline   Reply With Quote
Old 01-30-2012, 10:57 PM   #2
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

I usually see 60-80 % duplication levels in non-normalized RNA-Seq samples. Did you de-multiplex the sequences to see whether there are differences in duplication levels between the normalized and non-normalized libraries? I'd suspect that the normalization didn't work out very well.

FastQC looks at initial 50-mers for overrepresentation, but as you pointed out yourself, only some adapters were found on the fw strand. You can remove those with e.g. trimmomatic.

I'm not sure about Picard, but samtools rmdup only works on mapped reads...

Did you check the rRNA contamination levels already?
arvid is offline   Reply With Quote
Old 01-30-2012, 11:00 PM   #3
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

You could run a k-mer counter on the data to check for overrepresentation, e.g. Meryl or Jellyfish...
arvid is offline   Reply With Quote
Old 01-30-2012, 11:34 PM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

High duplication levels in RNA-Seq are not necessarily a problem. Duplication simply means that you're getting very high fold coverage. For RNA-Seq it's quite normal to oversequence highly expressed transcripts in order to be able to see lowly expressed transcripts. Duplication warnings are more of a concern when they occur in libraries where you're expecting more equal coverage. 60% also isn't very high - a badly PCR duplicated library might have duplication levels above 90% (our personal record is 98%!). For more details of how to interpret this plot you can look at this blog post.
simonandrews is offline   Reply With Quote
Old 01-31-2012, 02:55 AM   #5
harryzs
Member
 
Location: Germany

Join Date: Dec 2010
Posts: 29
Default

I agree.

see this http://seqanswers.com/forums/showthr...ght=duplicates

Quote:
Originally Posted by simonandrews View Post
High duplication levels in RNA-Seq are not necessarily a problem. Duplication simply means that you're getting very high fold coverage. For RNA-Seq it's quite normal to oversequence highly expressed transcripts in order to be able to see lowly expressed transcripts. Duplication warnings are more of a concern when they occur in libraries where you're expecting more equal coverage. 60% also isn't very high - a badly PCR duplicated library might have duplication levels above 90% (our personal record is 98%!). For more details of how to interpret this plot you can look at this blog post.
harryzs is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:17 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO