SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Overrepresented sequences from FastQC report morning latte Bioinformatics 7 08-27-2013 09:31 AM
FASTQC overrepresented Kmers: Chirag Bioinformatics 1 08-23-2012 07:04 AM
FastQC; overrepresented sequences versus a grep mgg Bioinformatics 16 12-23-2011 02:51 AM
interpretation of FASTQC Overrepresented Kmers mattanswers Bioinformatics 1 09-20-2011 01:40 PM
fastqc - overrepresented sequences PFS Bioinformatics 3 07-05-2011 07:18 PM

Reply
 
Thread Tools
Old 11-18-2013, 07:38 AM   #1
Sergio.pv
Member
 
Location: Berlin

Join Date: Jul 2013
Posts: 20
Default FastQC Overrepresented Sequences problem

Hi guys,
i am very new at NGS data procesing.
First, a small introduction to my current situation:
- Samples were the small RNA fraction (<200 nt, MIRVANA MIRNA ISOLATION kit).
- Library and template preparation, as well as the sequencing protocoll were performed by an external company. The company used the ion torrent software to trimm adaptes and filter reads by quality.
- RNA sample quality was RIN >8 (assessed by analising the remaining RNA fraction, >200 nt).
- Original output quality was assessed by FastQC
- Followed by another FastQC analysis after filtering sequences by size (16-27nt) and QV=17.

Ok, my problem is the following:
- Analyses showed a good quality scores, however, it indicated an apparent contamination of the library. In the FastQC file appears:

[PASS] Basic Statistics
[PASS] Per base sequence quality
[PASS] Per sequence quality scores
[FAIL] Per base sequence content
[FAIL] Per base GC content
[WARNING] Per sequence GC content
[PASS] Per base N content
[WARNING] Sequence Length Distribution
[FAIL] Sequence Duplication Levels
[FAIL] Overrepresented sequences
[FAIL] Kmer Content

I have attached the files

Should I do new run of my sample?
Is it possible that the library/template preparation was performed incorrectly?
Are there chances that overepresented reads are actually highly biologically significant?

If i could get some feed back on this, it would help a lot,
Thanks
Attached Files
File Type: pdf 1.RNA_Barcode_miRNA_-_cattle_fastqc.pdf (459.8 KB, 42 views)
File Type: pdf 4.FASTQ_16-27 length_17 min QV_17_min length.pdf (397.1 KB, 42 views)

Last edited by Sergio.pv; 11-19-2013 at 01:22 AM. Reason: attaching files
Sergio.pv is offline   Reply With Quote
Old 11-18-2013, 09:40 AM   #2
TonyBrooks
Senior Member
 
Location: London

Join Date: Jun 2009
Posts: 298
Default

You need to take the output of fastqc in context. Many of the tests are only applicable if sequencing DNA from a diverse genome with ~50% GC content. Last time I checked, there are only around 2700 known human miRNA's, so you would expect over representation with 484,000 reads - some miRNA's are just highly expressed.

A quick blastn of your top over represented sequences show match with known miRNA's, so I suspect your data is good. Assuming you have designed your experiment correctly and have enough biological, replicates, you should be able to identify any differential expression

p.s. you also need to include the image files in your fastqc report zip.
TonyBrooks is offline   Reply With Quote
Old 11-18-2013, 10:04 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080
Default

As people move more to sequencing non-genomic DNA samples perhaps it is time for Simon to consider adding a disclaimer that some of these metrics are only relevant for "normal" genomic DNA.

Seeing a [FAIL] tag appear in the FastQC report is scary for new users.
GenoMax is offline   Reply With Quote
Old 11-19-2013, 01:02 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Setting these cutoffs is the worst bit about writing QC modules. We did consider not putting them in at all, but they've actually proved to be a net positive despite false positives such as this. We do try to add disclaimers to say that a warning or a fail doesn't mean your library is necessarily bad, just that it doesn't look like a normal diverse library and that you should try to understand why the flag was raised (as in this case). If anyone has a good suggestion for words to use other than warn / fail then I'd happily consider these (although the actual reports only use icons).

In the next release we'll actually have user-tunable warn / error thresholds for all modules so sites can choose to change the stringencies of the different tests to match what they want to highlight.
simonandrews is offline   Reply With Quote
Old 11-19-2013, 01:36 AM   #5
Sergio.pv
Member
 
Location: Berlin

Join Date: Jul 2013
Posts: 20
Default

Thank you very much to you all for the quick responses, it's now clearer.

If I understood correctly, the analyses performed by FastQC are modeled for genomic DNA, where the sequenced fragments belong to a very large population, hence reads are likely to be unique.

In case of the RNA small fraction, where the population of potential reads is much lower, chances to find repeated fragments are much higher. Specially those which are highly expressed in the sample.

Is that correct?

Best!

PD: i have resumbmited the files, this time in .pdf
Sergio.pv is offline   Reply With Quote
Old 11-19-2013, 01:46 AM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

The FastQC modules aren't really aimed at any specific library type, but I guess that a randomly fragmented genomic library would be the sort of thing which is most likely not to trigger any warnings.

The basic assumptions are that you expect that the library you are sequencing comes from a single source, has no intrinsic positional sequence bias, and that you don't want to saturate the diversity in the library. Anything which violates these assumptions will be flagged up for you to look at and understand. I've said before that for pretty much all of the modules there are perfectly good libraries which we've seen which fail the module but that doesn't mean that either the data or the module are wrong.

Most of the common failures you'll see in the report come from library types where you don't have an even representation of the different sequences in the library and where you deliberately over-sequence some parts of the library to be able to look at the parts with lower representation.

Examples of this would be over-sequencing highly expressed genes in an RNA-Seq library to be able to see lo expressed genes. RNA-Seq libraries will usually trigger the duplication module since you are saturating at least part of your library. Small RNA libraries are even worse since some sequences will comprise a very high proportion of the library and will be hugely over-sequenced to the point where they trigger the over-represented sequences module (once they make up 0.1% of the whole library). The program is trying to warn you that you're wasting your sequencing capacity by continuing to sequence these overrepresented sequences, but then you probably knew that anyway.
simonandrews is offline   Reply With Quote
Old 11-19-2013, 02:11 AM   #7
Sergio.pv
Member
 
Location: Berlin

Join Date: Jul 2013
Posts: 20
Default

Thanks Simon!
Sergio.pv is offline   Reply With Quote
Old 11-19-2013, 04:07 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080
Default

Quote:
Originally Posted by simonandrews View Post

In the next release we'll actually have user-tunable warn / error thresholds for all modules so sites can choose to change the stringencies of the different tests to match what they want to highlight.
This would be great. Make it a settings file that can be customized for various scenarios. That way a few conservative options (that you know work) can be included in FastQC (and others can be shared by people). This can potentially allow exclusion of modules (e.g. duplication module) altogether in cases where they may not be useful.

Last edited by GenoMax; 11-19-2013 at 04:10 AM.
GenoMax is offline   Reply With Quote
Old 11-19-2013, 04:14 AM   #9
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by GenoMax View Post
This would be great. Make it a settings file that can be customized for various scenarios. That way a few conservative options (that you know work) can be included in FastQC (and others can be shared by people). This can potentially allow exclusion of modules (e.g. duplication module) altogether in cases where they may not be useful.
It is a configuration file at the moment, but I don't think we had the option to specify which file to use but that should be easy enough to add. We also didn't have the option to exclude certain modules all together, but that could also be added (we now have the concept of optional modules so that could work easily). I might put these off for another version though as I really need to get the current development version out of the door!
simonandrews is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO