![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Overrepresented sequences from FastQC report | morning latte | Bioinformatics | 7 | 08-27-2013 09:31 AM |
FASTQC overrepresented Kmers: | Chirag | Bioinformatics | 1 | 08-23-2012 07:04 AM |
FastQC; overrepresented sequences versus a grep | mgg | Bioinformatics | 16 | 12-23-2011 02:51 AM |
interpretation of FASTQC Overrepresented Kmers | mattanswers | Bioinformatics | 1 | 09-20-2011 01:40 PM |
fastqc - overrepresented sequences | PFS | Bioinformatics | 3 | 07-05-2011 07:18 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Berlin Join Date: Jul 2013
Posts: 20
|
![]()
Hi guys,
i am very new at NGS data procesing. First, a small introduction to my current situation: - Samples were the small RNA fraction (<200 nt, MIRVANA MIRNA ISOLATION kit). - Library and template preparation, as well as the sequencing protocoll were performed by an external company. The company used the ion torrent software to trimm adaptes and filter reads by quality. - RNA sample quality was RIN >8 (assessed by analising the remaining RNA fraction, >200 nt). - Original output quality was assessed by FastQC - Followed by another FastQC analysis after filtering sequences by size (16-27nt) and QV=17. Ok, my problem is the following: - Analyses showed a good quality scores, however, it indicated an apparent contamination of the library. In the FastQC file appears: [PASS] Basic Statistics [PASS] Per base sequence quality [PASS] Per sequence quality scores [FAIL] Per base sequence content [FAIL] Per base GC content [WARNING] Per sequence GC content [PASS] Per base N content [WARNING] Sequence Length Distribution [FAIL] Sequence Duplication Levels [FAIL] Overrepresented sequences [FAIL] Kmer Content I have attached the files Should I do new run of my sample? Is it possible that the library/template preparation was performed incorrectly? Are there chances that overepresented reads are actually highly biologically significant? If i could get some feed back on this, it would help a lot, Thanks Last edited by Sergio.pv; 11-19-2013 at 01:22 AM. Reason: attaching files |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: London Join Date: Jun 2009
Posts: 298
|
![]()
You need to take the output of fastqc in context. Many of the tests are only applicable if sequencing DNA from a diverse genome with ~50% GC content. Last time I checked, there are only around 2700 known human miRNA's, so you would expect over representation with 484,000 reads - some miRNA's are just highly expressed.
A quick blastn of your top over represented sequences show match with known miRNA's, so I suspect your data is good. Assuming you have designed your experiment correctly and have enough biological, replicates, you should be able to identify any differential expression p.s. you also need to include the image files in your fastqc report zip. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
As people move more to sequencing non-genomic DNA samples perhaps it is time for Simon to consider adding a disclaimer that some of these metrics are only relevant for "normal" genomic DNA.
Seeing a [FAIL] tag appear in the FastQC report is scary for new users. |
![]() |
![]() |
![]() |
#4 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
Setting these cutoffs is the worst bit about writing QC modules. We did consider not putting them in at all, but they've actually proved to be a net positive despite false positives such as this. We do try to add disclaimers to say that a warning or a fail doesn't mean your library is necessarily bad, just that it doesn't look like a normal diverse library and that you should try to understand why the flag was raised (as in this case). If anyone has a good suggestion for words to use other than warn / fail then I'd happily consider these (although the actual reports only use icons).
In the next release we'll actually have user-tunable warn / error thresholds for all modules so sites can choose to change the stringencies of the different tests to match what they want to highlight. |
![]() |
![]() |
![]() |
#5 |
Member
Location: Berlin Join Date: Jul 2013
Posts: 20
|
![]()
Thank you very much to you all for the quick responses, it's now clearer.
If I understood correctly, the analyses performed by FastQC are modeled for genomic DNA, where the sequenced fragments belong to a very large population, hence reads are likely to be unique. In case of the RNA small fraction, where the population of potential reads is much lower, chances to find repeated fragments are much higher. Specially those which are highly expressed in the sample. Is that correct? Best! PD: i have resumbmited the files, this time in .pdf |
![]() |
![]() |
![]() |
#6 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
The FastQC modules aren't really aimed at any specific library type, but I guess that a randomly fragmented genomic library would be the sort of thing which is most likely not to trigger any warnings.
The basic assumptions are that you expect that the library you are sequencing comes from a single source, has no intrinsic positional sequence bias, and that you don't want to saturate the diversity in the library. Anything which violates these assumptions will be flagged up for you to look at and understand. I've said before that for pretty much all of the modules there are perfectly good libraries which we've seen which fail the module but that doesn't mean that either the data or the module are wrong. Most of the common failures you'll see in the report come from library types where you don't have an even representation of the different sequences in the library and where you deliberately over-sequence some parts of the library to be able to look at the parts with lower representation. Examples of this would be over-sequencing highly expressed genes in an RNA-Seq library to be able to see lo expressed genes. RNA-Seq libraries will usually trigger the duplication module since you are saturating at least part of your library. Small RNA libraries are even worse since some sequences will comprise a very high proportion of the library and will be hugely over-sequenced to the point where they trigger the over-represented sequences module (once they make up 0.1% of the whole library). The program is trying to warn you that you're wasting your sequencing capacity by continuing to sequence these overrepresented sequences, but then you probably knew that anyway. |
![]() |
![]() |
![]() |
#7 |
Member
Location: Berlin Join Date: Jul 2013
Posts: 20
|
![]()
Thanks Simon!
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
This would be great. Make it a settings file that can be customized for various scenarios. That way a few conservative options (that you know work) can be included in FastQC (and others can be shared by people). This can potentially allow exclusion of modules (e.g. duplication module) altogether in cases where they may not be useful.
Last edited by GenoMax; 11-19-2013 at 04:10 AM. |
![]() |
![]() |
![]() |
#9 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|