Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQC; overrepresented sequences versus a grep

    Hi,

    I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file.

    FastQC tells me I have adapter contamination. And how much. Excellent tool!

    When I take this info and do a simple egrep for the Universal adapter sequence & for the library-appropriate indexed TruSeq adapter sequence I get waaaaaaay more 'hits' than FastQC reports

    for example, FastQC says 5.31% adapter
    egrep says 21%


    There must be a simple explanation?! Suggestions welcome.

    M

  • #2
    What are your percentages? Proportion of reads containing at least one adapter? Proportion of total bases matching adapters?

    How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?

    Comment


    • #3
      Originally posted by maubp View Post
      What are your percentages?
      Proportion of reads containing at least one adapter?
      The numbers from egrep and FastQC are (each col is a library

      Adapters as % 0.48 2.93 2.31 2.45 0.93 3.90 4.43 21.20 8.60 5.77

      “ from FastQC 0.27 2.07 1.32 1.37 0.47 1.62 2.39 4.15 3.45 3.25


      Originally posted by maubp View Post
      Proportion of total bases matching adapters?
      FastQC over represented sequences tool generally reports matches of >97% over the length


      Originally posted by maubp View Post
      How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?
      [/QUOTE]

      I'm using egrep in bash script. I count using -c option. I also count with
      pattern ^start anchored to see where the adapter is.
      total=`egrep ${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

      atstart=`egrep ^${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

      (that hideous expression in the middle ${indexseq[${libindex[${sample}]}]}
      pulls from an array the indexed adapter sequence appropriate to the library )


      I'm new to this, so quite possibly this can/could count multiple matches/line.
      But I don't think that's the source of the observation; the ^start-anchored
      egrep returns figures which with but one exception show the vast majority of
      adapters are at the start of the reads.

      still baffled ...
      m

      Comment


      • #4
        Here's a little from the documentation...



        Don't know if that really helps, though. You might want to contact the author.

        From my email with him, I was asking him "what does the "(96% over 25bp)" mean?"

        "the program does a simple ungapped matching to find the best region of match to a known contaminant. The hit description simply means that the match found covered only 25bp of the original sequence, but that this had 96% identity to the sequence in the contaminants file."

        Comment


        • #5
          When you are grepping with your adapter sequence are you putting in a pattern which runs the whole length of your read? The most obvious reason for the discrepancy is that there are more reads which start with adapter than have adapter over their whole length.

          The overrepresented sequences report in FastQC requires an exact match over either the whole read length or the first 50bp (whichever is shorter). If you have only partial adapter sequences in some reads, or if you have a high level of base miscalls then the value reported by FastQC would be less than the true amount of adapter.

          Comment


          • #6
            Originally posted by mgg View Post
            Hi,

            I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file.

            There must be a simple explanation?! Suggestions welcome.

            M
            Simple it is.

            From FastQC's manual:

            To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

            Comment


            • #7
              Originally posted by analyst View Post
              Simple it is.

              From FastQC's manual:

              To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.
              Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.

              Comment


              • #8
                Originally posted by simonandrews View Post
                Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.
                true, and i agree with your earlier explanation as well.
                since the grepped pattern probably would not correspond to the whole read, which is what FastQC reports, counts wont match. However, if it could run beyond 200,000, a number of other reads could turn up containing the same adapter, so he would come close to grep count. Thats what I saw in one of my datasets.

                btw, will appreciate if anyone has any comment re: this
                http://seqanswers.com/forums/showthread.php?t=15716

                Comment


                • #9
                  Hi mgg,

                  How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.

                  At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

                  Thanks again.

                  Comment


                  • #10
                    Originally posted by arrchi View Post
                    Hi mgg,

                    How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.
                    It's in the Web page, over-represented sequences, column3, and also available from the fastqc_data text file output.

                    Rgds

                    m

                    Comment


                    • #11
                      Originally posted by arrchi View Post
                      @ arrchi

                      At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

                      Thanks again.
                      The sequences attached to the Index6 TruSeq Adapter may not be redundant; its more likely that only the TruSeq adapter itself is over-represented. I'd be inclined to trim these adapter sequences off, rather than using them as a handle to filter the entire reads out (which would lose you 50% of your reads).

                      rgds

                      m

                      Comment


                      • #12
                        Thanks, mgg.

                        Sorry, I think I did not describe my question clearly.

                        The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:

                        #Sequence Count Percentage Possible Source
                        AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG 329332 57.09431522083974 TruSeq Adapter, Index 6 (100% over 49bp)
                        GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC 69354 12.023487355696135 TruSeq Adapter, Index 6 (100% over 50bp)
                        Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?

                        Comment


                        • #13
                          Originally posted by arrchi View Post
                          Thanks, mgg.

                          Sorry, I think I did not describe my question clearly.

                          The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:



                          Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?
                          Well there was also evidence from the kmer analysis, which given your 57% figure I would guess would also be the case for your dataset. But yes, column 3 & 4 were the source for my (rounded) figure.

                          If your data look anything like mine, take a good look at the kmer analysis; I had a series of peaks from the left side - if you look at the legend for each, you can discern the sequence of the adapter itself.

                          best

                          m

                          Comment


                          • #14
                            Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

                            Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?
                            Attached Files
                            Last edited by arrchi; 12-06-2011, 01:10 PM.

                            Comment


                            • #15
                              Originally posted by arrchi View Post
                              Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

                              Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?
                              The position of these peaks in the kmer plot is a function of read lengths. Your reads are ~ the length of the adapter, so you've got some to the right of the plot. (my experience is solely with some 105nu read length libraries, so I'm more used to seeing these to the left)

                              You're absolutely right about the indexing (though I think there are 27 of them rather than just 12). It's straighforward enough to derive the adapter sequence, although having a rubbish the library does make this easier. Your kmer plot is much cleaner than mine so it's more of a challenge. Nontheless, your plot has ...

                              PHP Code:
                              ... on the left
                              CGTCT 
                              (pinkcentered at 17,
                               
                              GTCTG (redcentered at 18 ...
                              I read that as CGTCTGwhich is nuc 16..21 of any of the indexed adapter oligos

                              To the right you have
                              TATCT 
                              (yellowat 40
                                TCTCG 
                              (black)
                                 
                              CTCGT (green)
                                    
                              GTATG (blue)         
                              I read that as TATCTCGTATGwhich is nuc 39..40 of index 2or 10
                              (the leading 'T' is the last position of the 6 nuc indexwhich is T for 2610
                              best

                              m

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X