Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQC Overrepresented Sequences problem

    Hi guys,
    i am very new at NGS data procesing.
    First, a small introduction to my current situation:
    - Samples were the small RNA fraction (<200 nt, MIRVANA MIRNA ISOLATION kit).
    - Library and template preparation, as well as the sequencing protocoll were performed by an external company. The company used the ion torrent software to trimm adaptes and filter reads by quality.
    - RNA sample quality was RIN >8 (assessed by analising the remaining RNA fraction, >200 nt).
    - Original output quality was assessed by FastQC
    - Followed by another FastQC analysis after filtering sequences by size (16-27nt) and QV=17.

    Ok, my problem is the following:
    - Analyses showed a good quality scores, however, it indicated an apparent contamination of the library. In the FastQC file appears:

    [PASS] Basic Statistics
    [PASS] Per base sequence quality
    [PASS] Per sequence quality scores
    [FAIL] Per base sequence content
    [FAIL] Per base GC content
    [WARNING] Per sequence GC content
    [PASS] Per base N content
    [WARNING] Sequence Length Distribution
    [FAIL] Sequence Duplication Levels
    [FAIL] Overrepresented sequences
    [FAIL] Kmer Content

    I have attached the files

    Should I do new run of my sample?
    Is it possible that the library/template preparation was performed incorrectly?
    Are there chances that overepresented reads are actually highly biologically significant?

    If i could get some feed back on this, it would help a lot,
    Thanks
    Attached Files
    Last edited by Sergio.pv; 11-19-2013, 01:22 AM. Reason: attaching files

  • #2
    You need to take the output of fastqc in context. Many of the tests are only applicable if sequencing DNA from a diverse genome with ~50% GC content. Last time I checked, there are only around 2700 known human miRNA's, so you would expect over representation with 484,000 reads - some miRNA's are just highly expressed.

    A quick blastn of your top over represented sequences show match with known miRNA's, so I suspect your data is good. Assuming you have designed your experiment correctly and have enough biological, replicates, you should be able to identify any differential expression

    p.s. you also need to include the image files in your fastqc report zip.

    Comment


    • #3
      As people move more to sequencing non-genomic DNA samples perhaps it is time for Simon to consider adding a disclaimer that some of these metrics are only relevant for "normal" genomic DNA.

      Seeing a [FAIL] tag appear in the FastQC report is scary for new users.

      Comment


      • #4
        Setting these cutoffs is the worst bit about writing QC modules. We did consider not putting them in at all, but they've actually proved to be a net positive despite false positives such as this. We do try to add disclaimers to say that a warning or a fail doesn't mean your library is necessarily bad, just that it doesn't look like a normal diverse library and that you should try to understand why the flag was raised (as in this case). If anyone has a good suggestion for words to use other than warn / fail then I'd happily consider these (although the actual reports only use icons).

        In the next release we'll actually have user-tunable warn / error thresholds for all modules so sites can choose to change the stringencies of the different tests to match what they want to highlight.

        Comment


        • #5
          Thank you very much to you all for the quick responses, it's now clearer.

          If I understood correctly, the analyses performed by FastQC are modeled for genomic DNA, where the sequenced fragments belong to a very large population, hence reads are likely to be unique.

          In case of the RNA small fraction, where the population of potential reads is much lower, chances to find repeated fragments are much higher. Specially those which are highly expressed in the sample.

          Is that correct?

          Best!

          PD: i have resumbmited the files, this time in .pdf

          Comment


          • #6
            The FastQC modules aren't really aimed at any specific library type, but I guess that a randomly fragmented genomic library would be the sort of thing which is most likely not to trigger any warnings.

            The basic assumptions are that you expect that the library you are sequencing comes from a single source, has no intrinsic positional sequence bias, and that you don't want to saturate the diversity in the library. Anything which violates these assumptions will be flagged up for you to look at and understand. I've said before that for pretty much all of the modules there are perfectly good libraries which we've seen which fail the module but that doesn't mean that either the data or the module are wrong.

            Most of the common failures you'll see in the report come from library types where you don't have an even representation of the different sequences in the library and where you deliberately over-sequence some parts of the library to be able to look at the parts with lower representation.

            Examples of this would be over-sequencing highly expressed genes in an RNA-Seq library to be able to see lo expressed genes. RNA-Seq libraries will usually trigger the duplication module since you are saturating at least part of your library. Small RNA libraries are even worse since some sequences will comprise a very high proportion of the library and will be hugely over-sequenced to the point where they trigger the over-represented sequences module (once they make up 0.1% of the whole library). The program is trying to warn you that you're wasting your sequencing capacity by continuing to sequence these overrepresented sequences, but then you probably knew that anyway.

            Comment


            • #7
              Thanks Simon!

              Comment


              • #8
                Originally posted by simonandrews View Post

                In the next release we'll actually have user-tunable warn / error thresholds for all modules so sites can choose to change the stringencies of the different tests to match what they want to highlight.
                This would be great. Make it a settings file that can be customized for various scenarios. That way a few conservative options (that you know work) can be included in FastQC (and others can be shared by people). This can potentially allow exclusion of modules (e.g. duplication module) altogether in cases where they may not be useful.
                Last edited by GenoMax; 11-19-2013, 04:10 AM.

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  This would be great. Make it a settings file that can be customized for various scenarios. That way a few conservative options (that you know work) can be included in FastQC (and others can be shared by people). This can potentially allow exclusion of modules (e.g. duplication module) altogether in cases where they may not be useful.
                  It is a configuration file at the moment, but I don't think we had the option to specify which file to use but that should be easy enough to add. We also didn't have the option to exclude certain modules all together, but that could also be added (we now have the concept of optional modules so that could work easily). I might put these off for another version though as I really need to get the current development version out of the door!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  69 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X