SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
High duplication levels in FASTQC flobpf Bioinformatics 3 11-27-2013 12:28 PM
raw data of paper "The evolution of gene expression levels in mammalian organs" rzhang Illumina/Solexa 5 02-27-2012 05:08 PM
Fastqc sequence duplication levels Bruce E Illumina/Solexa 1 07-29-2011 07:13 AM
PubMed: Methylation levels of the "Long Interspersed Nucleotide Element-1" repetitive Newsbot! Literature Watch 0 05-28-2011 06:21 AM
FastQC "Per Base Sequence Content": systematic deviation at 3' end of reads d f Illumina/Solexa 4 09-28-2010 09:46 AM

Reply
 
Thread Tools
Old 01-05-2012, 08:34 PM   #1
elrohir610
Junior Member
 
Location: Taipei

Join Date: Dec 2011
Posts: 2
Unhappy What might cause the "Sequence Duplication Levels" failures in FastQC report?

Hello Everyone,
We are dealing with exome sequencing data of tumor samples.
After using FastQC to assess the quality of data,
we found that those data failed at the "Sequence Duplication Levels" part
and got a warning at the "Kmer Content" part.

Moreover, according to the report of CASAVA,
read1 and read2 had a very different per base mapping score pattern.
(attachemnt: R1_vs_R2.jpg ).

The FastQC report also gave similar per base quality score patterns.

So..... what might cause these failures?

We have realigned the fasta files by BWA,
but the result was similar.
Therefore, the problem seems to result from sample preparation
or library preparation.........but, which one??

Our guess is that the DNA extracted from those sample had degraded.
Or actually this is a common phenomenon
when dealing DNA sequencing data of tumor samples?

Any response is welcome, thx.
Attached Images
File Type: jpg R1_vs_R2.jpg (17.1 KB, 122 views)
File Type: png per_base_quality.png (10.6 KB, 89 views)
File Type: png duplication_levels.png (17.8 KB, 199 views)
File Type: png kmer_profiles.png (33.6 KB, 130 views)
elrohir610 is offline   Reply With Quote
Old 01-05-2012, 09:23 PM   #2
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

How many PCR cycles were done during the protocol? Also, how many reads were their total?
Heisman is offline   Reply With Quote
Old 01-05-2012, 11:39 PM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

High duplication is either going to be the result of technical duplication (too many PCR cycles, as Heisman suggested), or over-sequencing (very high fold coverage). In your case you'll be able to tell between the two by aligning your sequences back to your reference and then seeing if you have high, even, coverage over your exome, or if you are seeing biased amplification of some parts. At the risk of repeating myself (I posted this yesterday in a different thread), it might be worth looking at this blog post, which goes over the different types of problems the duplicate plot can spot.

For the Kmer result, the most common reason to see a progressive pattern through a long read is that you're sequencing through your insert into the 3' adapter. In your case it's somewhat unusual that most of the overrepresented patterns are single nucleotide repeats, so there may be something else going on here. In any case it's worth looking at running an adapter trimmer on any dataset over 50bp.
simonandrews is offline   Reply With Quote
Old 01-07-2012, 06:07 PM   #4
elrohir610
Junior Member
 
Location: Taipei

Join Date: Dec 2011
Posts: 2
Default

Thank you both for reply.
The total number of reads is 415517084 and the length of reads is 100 bp.

The protocol we used is the standard protocol provided by Illumina.
Cancer DNA (100ng)-> TruSeq sample prep-> 10 cycle PCR
-> 500 ng DNA -> Exome capture ->10 cycle PCR -> Sequencing

However, we also sequenced other normal samples with same protocol
Normal DNA (1ug) -> TruSeq sample prep-> 10 cycle PCR
-> 500 ng -> Exome capture ->10 cycle PCR -> Sequencing

Even though the initial amount of tumor DNA was significantly lower,
the PCR still produced enough DNA to be sequenced.
The FastQC report of the normal sample also showed similar failure and warning.
(Attachment files)

Another interesting phenomenon is that,
some of the read2 from tumor sample are fusions:
The first half of those reads aligned on chrA,
but the second half of those reads aligned on chrB.
Both over sequencing and adapter contamination cause the
sequence duplication failure, but do any of them result in fusion read?

We will try the method suggested by simonandrews, hope they work.
Thanks.
Attached Images
File Type: png duplication_levels.png (16.2 KB, 110 views)
File Type: png kmer_profiles.png (19.3 KB, 64 views)

Last edited by elrohir610; 01-07-2012 at 06:10 PM.
elrohir610 is offline   Reply With Quote
Old 01-07-2012, 06:28 PM   #5
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

I've only done Agilent exomes but over 400 million reads seems like a very large amount. We get good coverage from under 100 million reads. It does not surprise me that you have a very high duplication rate with that many reads.
Heisman is offline   Reply With Quote
Old 05-06-2012, 03:21 AM   #6
sehrrot
Member
 
Location: USA

Join Date: Jul 2010
Posts: 58
Default

hi elrohir610

Could you upload your bioanalyzer profile for both? I've also experienced this kind of problem, but I don't think either PCR cycles or over-sequencing is the sole reason, of course they might affect some. Illumina sequencer is likely to sequence more for adapter dimers, even if they are only few ngs in the samples. Please check your bioA profile (HS chip preferred) or information of Base-pair composition by cycles in your SAV (if you see very zigged graph, not equally distributed, it would really be dimer contamination)
sehrrot is offline   Reply With Quote
Old 05-07-2012, 09:38 PM   #7
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Given that you have 400 million reads that profile doesn't look too bad. You only have fairly low level duplication and you may just be hitting the diversity limit of either your sample or your library. I'd not spend too much time worrying about this sample if it was one of ours we'd let it through for downstream analysis.
simonandrews is offline   Reply With Quote
Reply

Tags
casava, fastqc, illumina

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:52 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO