Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mgg
    Member
    • Nov 2011
    • 12

    FastQC; overrepresented sequences versus a grep

    Hi,

    I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file.

    FastQC tells me I have adapter contamination. And how much. Excellent tool!

    When I take this info and do a simple egrep for the Universal adapter sequence & for the library-appropriate indexed TruSeq adapter sequence I get waaaaaaay more 'hits' than FastQC reports

    for example, FastQC says 5.31% adapter
    egrep says 21%


    There must be a simple explanation?! Suggestions welcome.

    M
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    What are your percentages? Proportion of reads containing at least one adapter? Proportion of total bases matching adapters?

    How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?

    Comment

    • mgg
      Member
      • Nov 2011
      • 12

      #3
      Originally posted by maubp View Post
      What are your percentages?
      Proportion of reads containing at least one adapter?
      The numbers from egrep and FastQC are (each col is a library

      Adapters as % 0.48 2.93 2.31 2.45 0.93 3.90 4.43 21.20 8.60 5.77

      “ from FastQC 0.27 2.07 1.32 1.37 0.47 1.62 2.39 4.15 3.45 3.25


      Originally posted by maubp View Post
      Proportion of total bases matching adapters?
      FastQC over represented sequences tool generally reports matches of >97% over the length


      Originally posted by maubp View Post
      How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?
      [/QUOTE]

      I'm using egrep in bash script. I count using -c option. I also count with
      pattern ^start anchored to see where the adapter is.
      total=`egrep ${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

      atstart=`egrep ^${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

      (that hideous expression in the middle ${indexseq[${libindex[${sample}]}]}
      pulls from an array the indexed adapter sequence appropriate to the library )


      I'm new to this, so quite possibly this can/could count multiple matches/line.
      But I don't think that's the source of the observation; the ^start-anchored
      egrep returns figures which with but one exception show the vast majority of
      adapters are at the start of the reads.

      still baffled ...
      m

      Comment

      • mgogol
        Senior Member
        • Mar 2008
        • 197

        #4
        Here's a little from the documentation...



        Don't know if that really helps, though. You might want to contact the author.

        From my email with him, I was asking him "what does the "(96% over 25bp)" mean?"

        "the program does a simple ungapped matching to find the best region of match to a known contaminant. The hit description simply means that the match found covered only 25bp of the original sequence, but that this had 96% identity to the sequence in the contaminants file."

        Comment

        • simonandrews
          Simon Andrews
          • May 2009
          • 870

          #5
          When you are grepping with your adapter sequence are you putting in a pattern which runs the whole length of your read? The most obvious reason for the discrepancy is that there are more reads which start with adapter than have adapter over their whole length.

          The overrepresented sequences report in FastQC requires an exact match over either the whole read length or the first 50bp (whichever is shorter). If you have only partial adapter sequences in some reads, or if you have a high level of base miscalls then the value reported by FastQC would be less than the true amount of adapter.

          Comment

          • analyst
            Member
            • Jan 2011
            • 18

            #6
            Originally posted by mgg View Post
            Hi,

            I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file.

            There must be a simple explanation?! Suggestions welcome.

            M
            Simple it is.

            From FastQC's manual:

            To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

            Comment

            • simonandrews
              Simon Andrews
              • May 2009
              • 870

              #7
              Originally posted by analyst View Post
              Simple it is.

              From FastQC's manual:

              To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.
              Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.

              Comment

              • analyst
                Member
                • Jan 2011
                • 18

                #8
                Originally posted by simonandrews View Post
                Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.
                true, and i agree with your earlier explanation as well.
                since the grepped pattern probably would not correspond to the whole read, which is what FastQC reports, counts wont match. However, if it could run beyond 200,000, a number of other reads could turn up containing the same adapter, so he would come close to grep count. Thats what I saw in one of my datasets.

                btw, will appreciate if anyone has any comment re: this

                Comment

                • arrchi
                  Member
                  • Mar 2011
                  • 46

                  #9
                  Hi mgg,

                  How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.

                  At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

                  Thanks again.

                  Comment

                  • mgg
                    Member
                    • Nov 2011
                    • 12

                    #10
                    Originally posted by arrchi View Post
                    Hi mgg,

                    How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.
                    It's in the Web page, over-represented sequences, column3, and also available from the fastqc_data text file output.

                    Rgds

                    m

                    Comment

                    • mgg
                      Member
                      • Nov 2011
                      • 12

                      #11
                      Originally posted by arrchi View Post
                      @ arrchi

                      At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

                      Thanks again.
                      The sequences attached to the Index6 TruSeq Adapter may not be redundant; its more likely that only the TruSeq adapter itself is over-represented. I'd be inclined to trim these adapter sequences off, rather than using them as a handle to filter the entire reads out (which would lose you 50% of your reads).

                      rgds

                      m

                      Comment

                      • arrchi
                        Member
                        • Mar 2011
                        • 46

                        #12
                        Thanks, mgg.

                        Sorry, I think I did not describe my question clearly.

                        The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:

                        #Sequence Count Percentage Possible Source
                        AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG 329332 57.09431522083974 TruSeq Adapter, Index 6 (100% over 49bp)
                        GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC 69354 12.023487355696135 TruSeq Adapter, Index 6 (100% over 50bp)
                        Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?

                        Comment

                        • mgg
                          Member
                          • Nov 2011
                          • 12

                          #13
                          Originally posted by arrchi View Post
                          Thanks, mgg.

                          Sorry, I think I did not describe my question clearly.

                          The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:



                          Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?
                          Well there was also evidence from the kmer analysis, which given your 57% figure I would guess would also be the case for your dataset. But yes, column 3 & 4 were the source for my (rounded) figure.

                          If your data look anything like mine, take a good look at the kmer analysis; I had a series of peaks from the left side - if you look at the legend for each, you can discern the sequence of the adapter itself.

                          best

                          m

                          Comment

                          • arrchi
                            Member
                            • Mar 2011
                            • 46

                            #14
                            Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

                            Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?
                            Attached Files
                            Last edited by arrchi; 12-06-2011, 01:10 PM.

                            Comment

                            • mgg
                              Member
                              • Nov 2011
                              • 12

                              #15
                              Originally posted by arrchi View Post
                              Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

                              Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?
                              The position of these peaks in the kmer plot is a function of read lengths. Your reads are ~ the length of the adapter, so you've got some to the right of the plot. (my experience is solely with some 105nu read length libraries, so I'm more used to seeing these to the left)

                              You're absolutely right about the indexing (though I think there are 27 of them rather than just 12). It's straighforward enough to derive the adapter sequence, although having a rubbish the library does make this easier. Your kmer plot is much cleaner than mine so it's more of a challenge. Nontheless, your plot has ...

                              PHP Code:
                              ... on the left
                              CGTCT 
                              (pinkcentered at 17,
                               
                              GTCTG (redcentered at 18 ...
                              I read that as CGTCTGwhich is nuc 16..21 of any of the indexed adapter oligos

                              To the right you have
                              TATCT 
                              (yellowat 40
                                TCTCG 
                              (black)
                                 
                              CTCGT (green)
                                    
                              GTATG (blue)         
                              I read that as TATCTCGTATGwhich is nuc 39..40 of index 2or 10
                              (the leading 'T' is the last position of the 6 nuc indexwhich is T for 2610
                              best

                              m

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 11:08 AM
                              0 responses
                              6 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              53 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...