SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FastQC per base sequence content analyst Bioinformatics 14 02-15-2017 06:25 AM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 10:45 PM
Per Base Quality scores in FastQC mittymat Illumina/Solexa 3 03-30-2012 05:34 AM
kmer content in the first bases of Illumina sequence brachysclereid Bioinformatics 2 01-09-2012 02:54 PM
FastQC "Per Base Sequence Content": systematic deviation at 3' end of reads d f Illumina/Solexa 4 09-28-2010 09:46 AM

Reply
 
Thread Tools
Old 01-24-2011, 08:40 AM   #1
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 68
Default FastQC - strange 'per base sequence content' graph

Hi All,

I've been lurking for a while here trying to get a feel for some Illumina transcriptomic data I recently acquired (1 Lane 2 x 76bp).

After running FastQC, the following checks pass with flying colors:
Per base sequence quality
Per sequence quality scores
Per base N content

but I get red flags for the following with (what appear to me strange) staggered graphs:
Per base sequence content --- Strange staggered graph, peaks are staggered every 3bases
Per base GC content --- Same behavior as previous
Per sequence GC content --- GC Distribution appears to indicate contamination -- or possibly DNA sequenced from an organelle with a GC bias different from the host
Sequence duplication Levels -- uhoh, 63% duplication? Could this just indicate that we have more than full coverage of the organism's transcriptome?
Kmer Content -- No idea what this indicates, but must be related to strange staggered peaks in Per Base Sequence Content graph somehow?


Anyway, if anyone could shed some light on what the staggered graphs mean in terms of my data quality, I would appreciate any insight.

BTW> The graphs look similar before AND after primer / adapter clipping on both sides.

thanks & aloha
Attached Images
File Type: jpg Per Base Sequence Content.jpg (18.7 KB, 1201 views)
File Type: jpg Per Base GC Content.jpg (18.9 KB, 764 views)
File Type: jpg GC Distribution.jpg (17.4 KB, 690 views)
File Type: jpg SEQ Duplication Levels.jpg (18.6 KB, 706 views)
File Type: jpg Kmer Content.jpg (18.4 KB, 1042 views)
gconcepcion is offline   Reply With Quote
Old 01-24-2011, 11:45 PM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Looking at your results it seems that you have a significant proportion of your library which is composed of CAA multimeric repeats. These would explain the slight peaks running through the per-base composition and per-base GC graphs, as well as the secondary peak on the per sequence GC plot. The Kmer enrichment also shows a few cyclic variants of these - the reason for the sharp peaks is that the repeats seem to be aligned to the start of your sequence, so that certain positions favour starting a new repeat. It seems that there is a bias towards your sequences starting with C (which we've seen in a lot of transcriptome data), so this would make sense.

The duplication graph shows that you have quite high levels of duplication, but that this is spread over the majority of sequences in your library (so it's not just a few outliers which are being heavily duplicated). This could be simple saturation of your library if you're working with a very small transcriptome, or it could be a more subtle PCR bias. You'd need to look at the coverage you get over transcribing genes to be able to tell these apart.
simonandrews is offline   Reply With Quote
Old 01-25-2011, 05:09 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

I'll add some observations from our experience with mRNA-Seq data to Simon's excellent explanation.

First, with regard to the sequence duplication level your results are very much in line with what we observe for most of the mRNA-Seq samples run in our core. I agree with both you and Simon that this is simply a result of saturating the diversity of the library. Getting a 'Failed' for Sequence Duplication on an mRNA-Seq sample usually doesn't concern me. It is my impression that the pass/fail cut offs for FastQC are based on sequencing genomic libraries so they may not be appropriate thresholds for assessing mRNA-Seq data.

We have one submitter who, in every RNA sample they submit for Illumina sequencing we find a low level of 'CA' repeat sequence. I don't have an explanation for this, I relate this just to let you know that you are the only person finding simple repeat reads in their mRNA-Seq data.
kmcarr is offline   Reply With Quote
Old 01-25-2011, 07:10 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I suppose the best way of thinking about the duplication level plot is that it's a measure of the amount of sequencing you might have 'wasted' in your library - that is to say that a high duplication level means that you could have got much the same diversity in your library by doing a whole load less sequencing.

For RNA-Seq data you may have to accept that to see the most lowly expressed transcripts you need to oversequence the high expressed transcripts so you might consider this to be an expected fail.

The pass/warn/fail categories in FastQC are really just an indication of where you should focus your attention, not an absolute call that things are wrong. I've got a collection of perfectly good datasets which between then fail every test which FastQC does :-)

The repeat stuff is interesting though - we've seen datasets with CA repeats, but where we think this might be functionally relevant. We've not seen CAA repeats before though. I'm still never entirely sure whether to think of these as real effects or whether there's an artefact in the library prep or the sequencing which causes these.
simonandrews is offline   Reply With Quote
Old 01-26-2011, 07:02 AM   #5
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 68
Default

Thank you both for the insights. It's good to get confirmation that high levels of duplication in mRNA-seq data may be par for the course.

The STRs are interesting. As you could tell by the double peak on the GC content distribution graph they make up a significant portion of the library. They are not unique to the Illumina data, I also have a 454 dataset (350bp x 94000 reads = 32.9 Mb) with similar repeats comprising a large subset. Also, 'back in the day' when high-throughput sanger sequencing was all the rage, I helped sequence ~10,000 clones from EST libraries of two similar organisms and found a lot of similar repetitive elements. Definitely pushes me more towards being biologically significant rather than artefact as confirmed by 3 sequencing technologies & 3 different library prep protocols on 3 different taxa in the same group.

At any rate, i'm running into the inevitable hurdle of our lab computer with the most memory (32gb) being woefully inadequate for de novo assembly at this point.

While I await super computer access, would it be an OK strategy for me to circumvent the memory issues by mapping the Illumina reads to the 454 transcripts? Any special considerations?

Last edited by gconcepcion; 01-26-2011 at 09:47 AM.
gconcepcion is offline   Reply With Quote
Old 05-25-2011, 11:56 AM   #6
FWOS
Epigenomics NGS Beast
 
Location: New Jersey

Join Date: Oct 2010
Posts: 17
Default Help Interpreting mRNA Seq Duplicate Sequence Plot FastQC

Hi All,

I recently noticed some strange trends in the duplicate sequence plots generated from a 2x50bp RNA sequencing experiment performed on an Illumina HiSeq. I understand that the libraries will most likely contain some duplicates that might have resulted from oligo dt and/or random hexamer priming methods and/or PCR. It also makes sense that the FastQC thresholds are based on libraries created via DNA fragmentation etc...
What I am trying to figure out is how the duplicate sequence plot calculates the total percentage of non-unique sequences. Specifically, I have a data set with non-unique sequences calculated by FastQC to be > 53% of all sequences, but it seems like only two sequences are listed as "over represented" (>0.1%). I am not sure how it would be possible for such a small percentage of non-unique sequences to have such a large impact on the total number of non-unique sequences. Considering that only the following two over represented sequences are listed in the FastQC report:

1.) 0.673779203474544 TruSeq Adapter, Index 2 (100% over 50bp)
2.) 0.1471451982022855 TruSeq Adapter, Index 2 (100% over 49bp)

... Does anyone know how is it that the total percentage of duplicate sequences is 53% when only ~0.8% can be attributed to the primer contaminants?

Is there a calculation that relates specific contributions of overly expressed duplicate sequences to the total percentage of non-unique sequences, or something similar?

Please see attached, the Duplicate Sequence Plot that I am referring to:
Attached Images
File Type: png duplication_levels_1.png (7.4 KB, 238 views)
FWOS is offline   Reply With Quote
Old 05-25-2011, 12:52 PM   #7
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Hi FWOS,

The section "overrepresented sequences" shows only sequences that are present above a certain threshold. Not quite sure about the exact value but 0.1% of the total sequences in the file seems reasonable. So if your input file was say 50 million reads, then any sequence present more than 50,000 times would show up. In your case there are only 2 minor adapter contaminations, so the library seems to be reasonably clean.

The Duplication Plot shows how many sequences were seen once, twice... up to more than 10 times (exact matches over the entire length). The duplication level is counted as unique sequences (present only once)/(unique sequences + duplicated sequences (present more than once) ) * 100 in %. Even though the figure you linked is too small to read anything it seems that a fair amount of sequences is present more than once, which is normally due to PCR amplification, but you are right, adapter contaminations will also contribute to the overall duplication level.

Even though 50% is not great, there are still plenty of reads which are unique or present in low abundance (and we have seen much worse levels than this). Hope this helps.
fkrueger is offline   Reply With Quote
Old 05-25-2011, 12:55 PM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've actually just written a blog post which tries to explain the duplicate sequences plot in a bit more detail because sometimes it's not obvious what it's saying.

Looking at the graph you attached I'm surprised to see the overall duplication level as low as 53% as it looks higher than that. Basically you can get really high duplication levels by having a very small number of sequences which dominate an otherwise diverse library (in which case they'll show up in the overrepresented sequences list), or by having a larger number of sequences with moderately high duplication. Only sequences which individually represent more than 0.1% of the library (so 20,000 duplicates in a 20 million read library) are shown in the overrepresented sequence list which is a pretty high barrier. It's therefore easy to get high duplication levels by having sequences duplicated a few hundred times each which won't put anything in the list of overrepresented sequences.

In your case you have a high number of sequences with >10 duplicates (and there's no way to tell how much greater than 10 they are from the plot), but these are going to contribute the majority of the duplication in your particular library.
simonandrews is offline   Reply With Quote
Old 07-04-2011, 05:56 AM   #9
Celli
Junior Member
 
Location: Germany, US

Join Date: Feb 2011
Posts: 2
Default

Hello All,

I have some Illumina data (single end RNA-Seq) that has a 'funny' bias in Kmer distribution (FastQC plots) even after trimming. I have a attached a number of FastQC plots -explained below- of both the raw reads and the reads after adaptor + 32 3'bp trimming (due to low quality scores and adaptor sequencing at read ends). If anyone has thoughts on what may be causing these patterns and how to avoid similar data in future Illumina runs, they would be greatly appreciated! I can trim these off altogether by reducing my reads to ~30bp in length, but without knowing what is causing this pattern I can't assess if the short reads would be uncontaminated by whatever this problem is (I would like to do Diff. Expression analysis).

Thanks so much!
Celli

1. PerBaseQualityUntrimmed.pdf: quality seems okay until around 80-85 bp, trimmed to this length
2. PerBaseContentUntrimmed.pdf: 'flared' end lanes 5-8 mentioned in previous posts as adaptor sequencing. Uncertain what would cause 'bridges' from ~60bp to 110 bp in lanes 2&3.
3.PerBaseContentTrimmed.pdf: trimming removes all evidence of adapter sequencing from this diagnostic plot
4. KmerUntrimmed.pdf: large 'hills' at read ends in lanes 2 & 3 seem to reflect whatever is showing up in PerBaseContentUntrimmed.jpg. Uncertain if I should be concerned about lane 5 as well?
5. KmerTrimmed.pdf: even after trimming 'hills' in lanes 2 and 3 are apparent from about 35bp to read end.
Attached Files
File Type: pdf PerBaseQualityUntrimmed.pdf (370.4 KB, 308 views)
File Type: pdf PerBaseContentUntrimmed.pdf (311.5 KB, 306 views)
File Type: pdf PerBaseContentTrimmed.pdf (301.8 KB, 315 views)
File Type: pdf KmerUntrimmed.pdf (374.0 KB, 322 views)
File Type: pdf KmerTrimmed.pdf (405.1 KB, 333 views)
Celli is offline   Reply With Quote
Old 07-04-2011, 11:56 PM   #10
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

As you are already aware you have adapter contamination in your various libraries, but with quite a lot of variability as to where in the library it starts. You're also seeing some bias at the start of your reads, but this happens in all RNA-Seq libraries, so don't worry about that too much.

It looks like your adapter trimming has mostly fixed the biases you were seeing. Although there are some Kmers still enriched in your trimmed data I'd suspect that these show only low level enrichment (you'd need to look at the table under the graph to see how enriched they are - the graph only shows the pattern of enrichment). No adapter trimmers manage to remove every trace of adapter so you might just be seeing the ones which snuck through your original screen. As long as these are a fairly small proportion of your library you should be OK.

The easiest way to test how good your trimmed library is is to try to map it. If you get good mapping efficiency then you've probably done OK in removing whatever contaminants were present.
simonandrews is offline   Reply With Quote
Old 10-28-2011, 12:21 PM   #11
rpauly
Member
 
Location: Atlanta

Join Date: Apr 2011
Posts: 32
Default

Hi...
I have a very similar problem, but I am not sure if the data is of good quality.Also my overrepresented sequences are almost 15% of the reads in some cases..should I be concerned?
rpauly is offline   Reply With Quote
Old 10-31-2011, 12:39 AM   #12
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by rpauly View Post
Hi...
I have a very similar problem, but I am not sure if the data is of good quality.Also my overrepresented sequences are almost 15% of the reads in some cases..should I be concerned?
It's very difficult to comment specifically without knowing the details of your experiment. In some cases you might expect a few sequences to be hugely overrepresented in your library, but mostly this is a bad thing. The important thing is to try to understand where those sequences come from if they're not automatically identified by FastQC so you can try to avoid them in future. Having said that, 15% is a very high level of contamination by a small number of sequences and probably does indicate a problem in your library preparation - this doesn't mean the rest of the library isn't useful, but it's something you want to look at more closely.
simonandrews is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:07 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO