![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Overrepresented kmers at the start of reads | kentk | Bioinformatics | 20 | 07-23-2014 02:23 AM |
fastQC | papori | RNA Sequencing | 3 | 02-04-2012 02:48 PM |
Velvet Assembler: expected coverage versus estimated coverage versus effective covera | DMCH | Bioinformatics | 1 | 11-30-2011 05:21 AM |
interpretation of FASTQC Overrepresented Kmers | mattanswers | Bioinformatics | 1 | 09-20-2011 01:40 PM |
fastqc - overrepresented sequences | PFS | Bioinformatics | 3 | 07-05-2011 07:18 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]()
Hi,
I have an odd observation with fastQC's figure for over-represented sequences versus the number I get out when I do a simple egrep for adapter sequences in the .fastq file. FastQC tells me I have adapter contamination. And how much. Excellent tool! When I take this info and do a simple egrep for the Universal adapter sequence & for the library-appropriate indexed TruSeq adapter sequence I get waaaaaaay more 'hits' than FastQC reports for example, FastQC says 5.31% adapter egrep says 21% There must be a simple explanation?! Suggestions welcome. M |
![]() |
![]() |
![]() |
#2 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
What are your percentages? Proportion of reads containing at least one adapter? Proportion of total bases matching adapters?
How are you counting the grep matches? One per line (i.e. one per sequence), or might it count multiple matches per line? |
![]() |
![]() |
![]() |
#3 | ||
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]() Quote:
Adapters as % 0.48 2.93 2.31 2.45 0.93 3.90 4.43 21.20 8.60 5.77 “ from FastQC 0.27 2.07 1.32 1.37 0.47 1.62 2.39 4.15 3.45 3.25 FastQC over represented sequences tool generally reports matches of >97% over the length Quote:
I'm using egrep in bash script. I count using -c option. I also count with pattern ^start anchored to see where the adapter is. total=`egrep ${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c` atstart=`egrep ^${indexseq[${libindex[${sample}]}]} $pathFastq$sn_1 -c` (that hideous expression in the middle ${indexseq[${libindex[${sample}]}]} pulls from an array the indexed adapter sequence appropriate to the library ) I'm new to this, so quite possibly this can/could count multiple matches/line. But I don't think that's the source of the observation; the ^start-anchored egrep returns figures which with but one exception show the vast majority of adapters are at the start of the reads. still baffled ... m |
||
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Kansas City Join Date: Mar 2008
Posts: 197
|
![]()
Here's a little from the documentation...
http://www.bioinformatics.bbsrc.ac.u...Sequences.html Don't know if that really helps, though. You might want to contact the author. From my email with him, I was asking him "what does the "(96% over 25bp)" mean?" "the program does a simple ungapped matching to find the best region of match to a known contaminant. The hit description simply means that the match found covered only 25bp of the original sequence, but that this had 96% identity to the sequence in the contaminants file." |
![]() |
![]() |
![]() |
#5 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
When you are grepping with your adapter sequence are you putting in a pattern which runs the whole length of your read? The most obvious reason for the discrepancy is that there are more reads which start with adapter than have adapter over their whole length.
The overrepresented sequences report in FastQC requires an exact match over either the whole read length or the first 50bp (whichever is shorter). If you have only partial adapter sequences in some reads, or if you have a high level of base miscalls then the value reported by FastQC would be less than the true amount of adapter. |
![]() |
![]() |
![]() |
#6 | |
Member
Location: US Join Date: Jan 2011
Posts: 18
|
![]() Quote:
From FastQC's manual: To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module. |
|
![]() |
![]() |
![]() |
#7 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: US Join Date: Jan 2011
Posts: 18
|
![]() Quote:
since the grepped pattern probably would not correspond to the whole read, which is what FastQC reports, counts wont match. However, if it could run beyond 200,000, a number of other reads could turn up containing the same adapter, so he would come close to grep count. Thats what I saw in one of my datasets. btw, will appreciate if anyone has any comment re: this http://seqanswers.com/forums/showthread.php?t=15716 |
|
![]() |
![]() |
![]() |
#9 |
Member
Location: ma Join Date: Mar 2011
Posts: 46
|
![]()
Hi mgg,
How did you find this information from fastQC report "FastQC says 5.31% adapter"? Thanks. At the meantime, I have a question for anybody. We have a human RNA seq data generated by Hiseq, and a fastQC report showing that the percentage of one sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATG) is more than 50%. The possible source is "TruSeq Adapter, Index 6). I wonder if this means that this sequence contains adapter sequence. Should I filter out the redundant sequences or should I trim the adapter from these sequences? Thanks again. |
![]() |
![]() |
![]() |
#10 |
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]() |
![]() |
![]() |
![]() |
#11 | |
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]() Quote:
rgds m |
|
![]() |
![]() |
![]() |
#12 | |
Member
Location: ma Join Date: Mar 2011
Posts: 46
|
![]()
Thanks, mgg.
Sorry, I think I did not describe my question clearly. The 57% is from fastQC "Overrepresented sequences" section. The first two rows look like this: Quote:
|
|
![]() |
![]() |
![]() |
#13 | |
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]() Quote:
If your data look anything like mine, take a good look at the kmer analysis; I had a series of peaks from the left side - if you look at the legend for each, you can discern the sequence of the adapter itself. best m |
|
![]() |
![]() |
![]() |
#14 |
Member
Location: ma Join Date: Mar 2011
Posts: 46
|
![]()
Great. Thanks. But the k-mer plot shows peaks at the right side instead of left side.
Did you know how to derive the adapter sequence? I thought an adapter from Illumina is 12 fixed 6-bp sequences. Am I wrong? Last edited by arrchi; 12-06-2011 at 01:10 PM. |
![]() |
![]() |
![]() |
#15 | |
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]() Quote:
You're absolutely right about the indexing (though I think there are 27 of them rather than just 12). It's straighforward enough to derive the adapter sequence, although having a rubbish the library does make this easier. Your kmer plot is much cleaner than mine so it's more of a challenge. Nontheless, your plot has ... PHP Code:
m |
|
![]() |
![]() |
![]() |
#16 |
Member
Location: ma Join Date: Mar 2011
Posts: 46
|
![]()
Thanks again. I think these sequences (CGTCTG and TATCTCGTATG) will be removed if i do adapter trimming?
Do you have experience of using any software to trim adapter sequence? |
![]() |
![]() |
![]() |
#17 |
Member
Location: freiburg Join Date: Apr 2010
Posts: 25
|
![]()
In my opinion the most convenient and sensitive but also potentially slowest way is to align the illumina adapters using e.g. SSAHA2 which will even detect adapters when there is a rather high error rate in the data. The SSAHA2 output can then be parsed to cut the first base in a read with aligned adapter or to set the PHRED Score to zero from this position on.
UPDATE: In fact it is not that slow: 2 min per 1M reads. Last edited by moritzhess; 12-23-2011 at 02:57 AM. |
![]() |
![]() |
![]() |
Tags |
adapter contamination, fastqc |
Thread Tools | |
|
|