Seqanswers Leaderboard Ad

**TonyBrooks** · 11-18-2013, 09:40 AM

You need to take the output of fastqc in context. Many of the tests are only applicable if sequencing DNA from a diverse genome with ~50% GC content. Last time I checked, there are only around 2700 known human miRNA's, so you would expect over representation with 484,000 reads - some miRNA's are just highly expressed.

A quick blastn of your top over represented sequences show match with known miRNA's, so I suspect your data is good. Assuming you have designed your experiment correctly and have enough biological, replicates, you should be able to identify any differential expression

p.s. you also need to include the image files in your fastqc report zip.

**GenoMax** · 11-18-2013, 10:04 AM

As people move more to sequencing non-genomic DNA samples perhaps it is time for Simon to consider adding a disclaimer that some of these metrics are only relevant for "normal" genomic DNA.

Seeing a [FAIL] tag appear in the FastQC report is scary for new users.

**simonandrews** · 11-19-2013, 01:02 AM

Setting these cutoffs is the worst bit about writing QC modules. We did consider not putting them in at all, but they've actually proved to be a net positive despite false positives such as this. We do try to add disclaimers to say that a warning or a fail doesn't mean your library is necessarily bad, just that it doesn't look like a normal diverse library and that you should try to understand why the flag was raised (as in this case). If anyone has a good suggestion for words to use other than warn / fail then I'd happily consider these (although the actual reports only use icons).

In the next release we'll actually have user-tunable warn / error thresholds for all modules so sites can choose to change the stringencies of the different tests to match what they want to highlight.

**Sergio.pv** · 11-19-2013, 01:36 AM

Thank you very much to you all for the quick responses, it's now clearer.

If I understood correctly, the analyses performed by FastQC are modeled for genomic DNA, where the sequenced fragments belong to a very large population, hence reads are likely to be unique.

In case of the RNA small fraction, where the population of potential reads is much lower, chances to find repeated fragments are much higher. Specially those which are highly expressed in the sample.

Is that correct?

Best!

PD: i have resumbmited the files, this time in .pdf

**simonandrews** · 11-19-2013, 01:46 AM

The FastQC modules aren't really aimed at any specific library type, but I guess that a randomly fragmented genomic library would be the sort of thing which is most likely not to trigger any warnings.

The basic assumptions are that you expect that the library you are sequencing comes from a single source, has no intrinsic positional sequence bias, and that you don't want to saturate the diversity in the library. Anything which violates these assumptions will be flagged up for you to look at and understand. I've said before that for pretty much all of the modules there are perfectly good libraries which we've seen which fail the module but that doesn't mean that either the data or the module are wrong.

Most of the common failures you'll see in the report come from library types where you don't have an even representation of the different sequences in the library and where you deliberately over-sequence some parts of the library to be able to look at the parts with lower representation.

Examples of this would be over-sequencing highly expressed genes in an RNA-Seq library to be able to see lo expressed genes. RNA-Seq libraries will usually trigger the duplication module since you are saturating at least part of your library. Small RNA libraries are even worse since some sequences will comprise a very high proportion of the library and will be hugely over-sequenced to the point where they trigger the over-represented sequences module (once they make up 0.1% of the whole library). The program is trying to warn you that you're wasting your sequencing capacity by continuing to sequence these overrepresented sequences, but then you probably knew that anyway.

**Sergio.pv** · 11-19-2013, 02:11 AM

Thanks Simon!

**GenoMax** · 11-19-2013, 04:07 AM

Originally posted by simonandrews View Post

In the next release we'll actually have user-tunable warn / error thresholds for all modules so sites can choose to change the stringencies of the different tests to match what they want to highlight.

This would be great. Make it a settings file that can be customized for various scenarios. That way a few conservative options (that you know work) can be included in FastQC (and others can be shared by people). This can potentially allow exclusion of modules (e.g. duplication module) altogether in cases where they may not be useful.

**simonandrews** · 11-19-2013, 04:14 AM

Originally posted by GenoMax View Post

This would be great. Make it a settings file that can be customized for various scenarios. That way a few conservative options (that you know work) can be included in FastQC (and others can be shared by people). This can potentially allow exclusion of modules (e.g. duplication module) altogether in cases where they may not be useful.

It is a configuration file at the moment, but I don't think we had the option to specify which file to use but that should be easy enough to add. We also didn't have the option to exclude certain modules all together, but that could also be added (we now have the concept of optional modules so that could work easily). I might put these off for another version though as I really need to get the current development version out of the door!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

FastQC Overrepresented Sequences problem

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News