SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Overrepresented kmers at the start of reads kentk Bioinformatics 20 07-23-2014 02:23 AM
fastQC papori RNA Sequencing 3 02-04-2012 02:48 PM
Velvet Assembler: expected coverage versus estimated coverage versus effective covera DMCH Bioinformatics 1 11-30-2011 05:21 AM
interpretation of FASTQC Overrepresented Kmers mattanswers Bioinformatics 1 09-20-2011 01:40 PM
fastqc - overrepresented sequences PFS Bioinformatics 3 07-05-2011 07:18 PM

Reply
 
Thread Tools
Old 11-22-2011, 04:22 AM   #1
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default FastQC; overrepresented sequences versus a grep

Hi,

I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file.

FastQC tells me I have adapter contamination. And how much. Excellent tool!

When I take this info and do a simple egrep for the Universal adapter sequence & for the library-appropriate indexed TruSeq adapter sequence I get waaaaaaay more 'hits' than FastQC reports

for example, FastQC says 5.31% adapter
egrep says 21%


There must be a simple explanation?! Suggestions welcome.

M
mgg is offline   Reply With Quote
Old 11-22-2011, 06:02 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

What are your percentages? Proportion of reads containing at least one adapter? Proportion of total bases matching adapters?

How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?
maubp is offline   Reply With Quote
Old 11-22-2011, 06:41 AM   #3
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default

Quote:
Originally Posted by maubp View Post
What are your percentages?
Proportion of reads containing at least one adapter?
The numbers from egrep and FastQC are (each col is a library

Adapters as % 0.48 2.93 2.31 2.45 0.93 3.90 4.43 21.20 8.60 5.77

from FastQC 0.27 2.07 1.32 1.37 0.47 1.62 2.39 4.15 3.45 3.25


Quote:
Originally Posted by maubp View Post
Proportion of total bases matching adapters?
FastQC over represented sequences tool generally reports matches of >97% over the length


Quote:
Originally Posted by maubp View Post
How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line?
[/QUOTE]

I'm using egrep in bash script. I count using -c option. I also count with
pattern ^start anchored to see where the adapter is.
total=`egrep ${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

atstart=`egrep ^${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c`

(that hideous expression in the middle ${indexseq[${libindex[${sample}]}]}
pulls from an array the indexed adapter sequence appropriate to the library )


I'm new to this, so quite possibly this can/could count multiple matches/line.
But I don't think that's the source of the observation; the ^start-anchored
egrep returns figures which with but one exception show the vast majority of
adapters are at the start of the reads.

still baffled ...
m
mgg is offline   Reply With Quote
Old 11-22-2011, 07:20 AM   #4
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

Here's a little from the documentation...

http://www.bioinformatics.bbsrc.ac.u...Sequences.html

Don't know if that really helps, though. You might want to contact the author.

From my email with him, I was asking him "what does the "(96% over 25bp)" mean?"

"the program does a simple ungapped matching to find the best region of match to a known contaminant. The hit description simply means that the match found covered only 25bp of the original sequence, but that this had 96% identity to the sequence in the contaminants file."
mgogol is offline   Reply With Quote
Old 11-23-2011, 01:43 AM   #5
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

When you are grepping with your adapter sequence are you putting in a pattern which runs the whole length of your read? The most obvious reason for the discrepancy is that there are more reads which start with adapter than have adapter over their whole length.

The overrepresented sequences report in FastQC requires an exact match over either the whole read length or the first 50bp (whichever is shorter). If you have only partial adapter sequences in some reads, or if you have a high level of base miscalls then the value reported by FastQC would be less than the true amount of adapter.
simonandrews is offline   Reply With Quote
Old 11-23-2011, 01:49 PM   #6
analyst
Member
 
Location: US

Join Date: Jan 2011
Posts: 18
Default

Quote:
Originally Posted by mgg View Post
Hi,

I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file.

There must be a simple explanation?! Suggestions welcome.

M
Simple it is.

From FastQC's manual:

To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.
analyst is offline   Reply With Quote
Old 11-24-2011, 12:30 AM   #7
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by analyst View Post
Simple it is.

From FastQC's manual:

To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.
Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.
simonandrews is offline   Reply With Quote
Old 11-25-2011, 01:25 PM   #8
analyst
Member
 
Location: US

Join Date: Jan 2011
Posts: 18
Default

Quote:
Originally Posted by simonandrews View Post
Except that that wouldn't explain getting different numbers. If a sequence is seen in the first 200,000 then it will be tracked right through the file and the final count should be accurate. This might explain a sequence being absent all together, but it's there the numbers should match up.
true, and i agree with your earlier explanation as well.
since the grepped pattern probably would not correspond to the whole read, which is what FastQC reports, counts wont match. However, if it could run beyond 200,000, a number of other reads could turn up containing the same adapter, so he would come close to grep count. Thats what I saw in one of my datasets.

btw, will appreciate if anyone has any comment re: this
http://seqanswers.com/forums/showthread.php?t=15716
analyst is offline   Reply With Quote
Old 12-06-2011, 08:44 AM   #9
arrchi
Member
 
Location: ma

Join Date: Mar 2011
Posts: 46
Default

Hi mgg,

How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.

At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

Thanks again.
arrchi is offline   Reply With Quote
Old 12-06-2011, 08:56 AM   #10
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default

Quote:
Originally Posted by arrchi View Post
Hi mgg,

How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks.
It's in the Web page, over-represented sequences, column3, and also available from the fastqc_data text file output.

Rgds

m
mgg is offline   Reply With Quote
Old 12-06-2011, 09:03 AM   #11
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default

Quote:
Originally Posted by arrchi View Post
@ arrchi

At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences?

Thanks again.
The sequences attached to the Index6 TruSeq Adapter may not be redundant; its more likely that only the TruSeq adapter itself is over-represented. I'd be inclined to trim these adapter sequences off, rather than using them as a handle to filter the entire reads out (which would lose you 50% of your reads).

rgds

m
mgg is offline   Reply With Quote
Old 12-06-2011, 12:10 PM   #12
arrchi
Member
 
Location: ma

Join Date: Mar 2011
Posts: 46
Default

Thanks, mgg.

Sorry, I think I did not describe my question clearly.

The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:

Quote:
#Sequence Count Percentage Possible Source
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG 329332 57.09431522083974 TruSeq Adapter, Index 6 (100% over 49bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC 69354 12.023487355696135 TruSeq Adapter, Index 6 (100% over 50bp)
Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?
arrchi is offline   Reply With Quote
Old 12-06-2011, 12:20 PM   #13
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default

Quote:
Originally Posted by arrchi View Post
Thanks, mgg.

Sorry, I think I did not describe my question clearly.

The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this:



Column 3 just says the percentage, did you conclude that "you have adapter contamination" by looking at this column and/or column 4 (Possible source)? Could you please let me know what the "possible source" of your data corresponding to "5.1%"? Is it the same (or similar) as mine?
Well there was also evidence from the kmer analysis, which given your 57% figure I would guess would also be the case for your dataset. But yes, column 3 & 4 were the source for my (rounded) figure.

If your data look anything like mine, take a good look at the kmer analysis; I had a series of peaks from the left side - if you look at the legend for each, you can discern the sequence of the adapter itself.

best

m
mgg is offline   Reply With Quote
Old 12-06-2011, 01:05 PM   #14
arrchi
Member
 
Location: ma

Join Date: Mar 2011
Posts: 46
Default

Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?
Attached Images
File Type: png kmer_profiles.png (37.3 KB, 78 views)

Last edited by arrchi; 12-06-2011 at 01:10 PM.
arrchi is offline   Reply With Quote
Old 12-06-2011, 01:52 PM   #15
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default

Quote:
Originally Posted by arrchi View Post
Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.

Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong?
The position of these peaks in the kmer plot is a function of read lengths. Your reads are ~ the length of the adapter, so you've got some to the right of the plot. (my experience is solely with some 105nu read length libraries, so I'm more used to seeing these to the left)

You're absolutely right about the indexing (though I think there are 27 of them rather than just 12). It's straighforward enough to derive the adapter sequence, although having a rubbish the library does make this easier. Your kmer plot is much cleaner than mine so it's more of a challenge. Nontheless, your plot has ...

PHP Code:
... on the left
CGTCT 
(pinkcentered at 17,
 
GTCTG (redcentered at 18 ...
I read that as CGTCTGwhich is nuc 16..21 of any of the indexed adapter oligos

To the right you have
TATCT 
(yellowat 40
  TCTCG 
(black)
   
CTCGT (green)
      
GTATG (blue)         
I read that as TATCTCGTATGwhich is nuc 39..40 of index 2or 10
(the leading 'T' is the last position of the 6 nuc indexwhich is T for 2610
best

m
mgg is offline   Reply With Quote
Old 12-06-2011, 02:20 PM   #16
arrchi
Member
 
Location: ma

Join Date: Mar 2011
Posts: 46
Default

Thanks again. I think these sequences (CGTCTG and TATCTCGTATG) will be removed if i do adapter trimming?

Do you have experience of using any software to trim adapter sequence?
arrchi is offline   Reply With Quote
Old 12-23-2011, 02:51 AM   #17
moritzhess
Member
 
Location: freiburg

Join Date: Apr 2010
Posts: 25
Default

In my opinion the most convenient and sensitive but also potentially slowest way is to align the illumina adapters using e.g. SSAHA2 which will even detect adapters when there is a rather high error rate in the data. The SSAHA2 output can then be parsed to cut the first base in a read with aligned adapter or to set the PHRED Score to zero from this position on.

UPDATE:

In fact it is not that slow: 2 min per 1M reads.

Last edited by moritzhess; 12-23-2011 at 02:57 AM.
moritzhess is offline   Reply With Quote
Reply

Tags
adapter contamination, fastqc

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO