Unconfigured Ad

**maubp** · 11-22-2011, 06:02 AM

What are your percentages? Proportion of reads containing at least one adapter? Proportion of total bases matching adapters?

How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?

**mgg** · 11-22-2011, 06:41 AM

Originally posted by maubp View Post

What are your percentages?
Proportion of reads containing at least one adapter?

The numbers from egrep and FastQC are (each col is a library

Adapters as % 0.48 2.93 2.31 2.45 0.93 3.90 4.43 21.20 8.60 5.77

“ from FastQC 0.27 2.07 1.32 1.37 0.47 1.62 2.39 4.15 3.45 3.25

Originally posted by maubp View Post

Proportion of total bases matching adapters?

FastQC over represented sequences tool generally reports matches of >97% over the length

Originally posted by maubp View Post

How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?

[/QUOTE]

I'm using egrep in bash script. I count using -c option. I also count with
pattern ^start anchored to see where the adapter is.
total=`egrep ${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

atstart=`egrep ^${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

(that hideous expression in the middle ${indexseq[${libindex[${sample}]}]}
pulls from an array the indexed adapter sequence appropriate to the library )

I'm new to this, so quite possibly this can/could count multiple matches/line.
But I don't think that's the source of the observation; the ^start-anchored
egrep returns figures which with but one exception show the vast majority of
adapters are at the start of the reads.

still baffled ...
m

**mgogol** · 11-22-2011, 07:20 AM

Here's a little from the documentation...

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Overrepresented%20Sequences.html

Don't know if that really helps, though. You might want to contact the author.

From my email with him, I was asking him "what does the "(96% over 25bp)" mean?"

"the program does a simple ungapped matching to find the best region of match to a known contaminant. The hit description simply means that the match found covered only 25bp of the original sequence, but that this had 96% identity to the sequence in the contaminants file."

**simonandrews** · 11-23-2011, 01:43 AM

When you are grepping with your adapter sequence are you putting in a pattern which runs the whole length of your read? The most obvious reason for the discrepancy is that there are more reads which start with adapter than have adapter over their whole length.

The overrepresented sequences report in FastQC requires an exact match over either the whole read length or the first 50bp (whichever is shorter). If you have only partial adapter sequences in some reads, or if you have a high level of base miscalls then the value reported by FastQC would be less than the true amount of adapter.

**analyst** · 11-23-2011, 01:49 PM

Originally posted by mgg View Post

Hi,

I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file.

There must be a simple explanation?! Suggestions welcome.

M

Simple it is.

From FastQC's manual:

To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

**simonandrews** · 11-24-2011, 12:30 AM

Originally posted by analyst View Post

Simple it is.

From FastQC's manual:

To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.

**analyst** · 11-25-2011, 01:25 PM

Originally posted by simonandrews View Post

Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.

true, and i agree with your earlier explanation as well.
since the grepped pattern probably would not correspond to the whole read, which is what FastQC reports, counts wont match. However, if it could run beyond 200,000, a number of other reads could turn up containing the same adapter, so he would come close to grep count. Thats what I saw in one of my datasets.

btw, will appreciate if anyone has any comment re: this

Just a moment...

http://seqanswers.com/forums/showthread.php?t=15716

**arrchi** · 12-06-2011, 08:44 AM

Hi mgg,

How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.

At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

Thanks again.

**mgg** · 12-06-2011, 08:56 AM

Originally posted by arrchi View Post

Hi mgg,

How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.

It's in the Web page, over-represented sequences, column3, and also available from the fastqc_data text file output.

Rgds

m

**mgg** · 12-06-2011, 09:03 AM

Originally posted by arrchi View Post

@ arrchi

At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

Thanks again.

The sequences attached to the Index6 TruSeq Adapter may not be redundant; its more likely that only the TruSeq adapter itself is over-represented. I'd be inclined to trim these adapter sequences off, rather than using them as a handle to filter the entire reads out (which would lose you 50% of your reads).

rgds

m

**arrchi** · 12-06-2011, 12:10 PM

Thanks, mgg.

Sorry, I think I did not describe my question clearly.

The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:

#Sequence Count Percentage Possible Source
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG 329332 57.09431522083974 TruSeq Adapter, Index 6 (100% over 49bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC 69354 12.023487355696135 TruSeq Adapter, Index 6 (100% over 50bp)

Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?

**mgg** · 12-06-2011, 12:20 PM

Originally posted by arrchi View Post

Thanks, mgg.

Sorry, I think I did not describe my question clearly.

The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:

Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?

Well there was also evidence from the kmer analysis, which given your 57% figure I would guess would also be the case for your dataset. But yes, column 3 & 4 were the source for my (rounded) figure.

If your data look anything like mine, take a good look at the kmer analysis; I had a series of peaks from the left side - if you look at the legend for each, you can discern the sequence of the adapter itself.

best

m

**arrchi** · 12-06-2011, 01:05 PM

Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?

Attached Files

kmer_profiles.png (37.3 KB, 316 views)

**mgg** · 12-06-2011, 01:52 PM

Originally posted by arrchi View Post

Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?

The position of these peaks in the kmer plot is a function of read lengths. Your reads are ~ the length of the adapter, so you've got some to the right of the plot. (my experience is solely with some 105nu read length libraries, so I'm more used to seeing these to the left)

You're absolutely right about the indexing (though I think there are 27 of them rather than just 12). It's straighforward enough to derive the adapter sequence, although having a rubbish the library does make this easier. Your kmer plot is much cleaner than mine so it's more of a challenge. Nontheless, your plot has ...

PHP Code:


... on the left

CGTCT (pink) centered at 17,

 GTCTG (red) centered at 18 ...

I read that as CGTCTG, which is nuc 16..21 of any of the indexed adapter oligos



To the right you have

TATCT (yellow) at 40

  TCTCG (black)

   CTCGT (green)

      GTATG (blue)         

I read that as TATCTCGTATG, which is nuc 39..40 of index 2, 6 or 10

(the leading 'T' is the last position of the 6 nuc index, which is T for 2, 6, 10)

best

m

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

FastQC; overrepresented sequences versus a grep

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News