![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
interpretation of FASTQC Overrepresented Kmers | mattanswers | Bioinformatics | 1 | 09-20-2011 12:40 PM |
fastqc - overrepresented sequences | PFS | Bioinformatics | 3 | 07-05-2011 06:18 PM |
splitting 454 reads into kmers for diff expression | Jeremy | RNA Sequencing | 0 | 01-18-2011 06:17 PM |
Duplicate reads ("same start" reads) in 454 FLX/Titanium shotgun runs | [c]oma | 454 Pyrosequencing | 20 | 08-28-2009 06:12 AM |
start position of reads and its distribution | baohua100 | Bioinformatics | 0 | 11-18-2008 05:20 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Philippines Join Date: Dec 2011
Posts: 17
|
![]()
I recently just discovered FastQC and I ran it in one of our datasets that's having difficulty in assembly. I was wondering how to interpret this piece of result from FastQC
![]() Any ideas? |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
Is this RNA-Seq? If so, this looks like it could be the result of random hexamer priming. Does the nucleotide distribution look off at the beginning too?
Hansen, K. D., S. E. Brenner, et al. (2010). "Biases in Illumina transcriptome sequencing caused by random hexamer priming." Nucleic Acids Research 38(12): e131. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Philippines Join Date: Dec 2011
Posts: 17
|
![]() |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
I have seen Nextera libraries show a very similar bias. My guess is that this is just an artifact of the library prep. In the past, I would trim off these regions before mapping, but then I found that it didn't make a big difference, so I just left them there.
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,318
|
![]()
I agree. Probably reflects a sequence bias for the transposase used by Nextera. It will have its own agenda -- and it may not correspond perfectly with yours. But is it good enough? Assemble and see...
-- Phillip |
![]() |
![]() |
![]() |
#6 |
Member
Location: Boston Join Date: Oct 2009
Posts: 65
|
![]()
Looking at the positions of the sequences, I would see if the sequences: CAGCACCAGCA or CAGCACCACC are part of your primers.
|
![]() |
![]() |
![]() |
#7 | |
Junior Member
Location: new zealand Join Date: Feb 2012
Posts: 6
|
![]() Quote:
I have the same issue with 80 multiplexed Nextera libraries run on a HiSeq. Their QC graphs all look the same for the first 13bp. I'm wondering if I should just trim them? |
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
I wouldn't bother trimming them. You could always take a sample of your reads and map them trimmed and untrimmed to see which works better. Whenever I did this, I never saw big differences.
|
![]() |
![]() |
![]() |
#9 |
Member
Location: Ithaca, NY Join Date: Jun 2012
Posts: 38
|
![]()
Hello All,
Well, I've actively pursued a similar question as the initial post and have found a variety of perspectives on the matter, but none really do the problem justice. It appears to be a far reaching phenomenon that appears across a variety of samples from a variety of users. I was able to find four different postings on the subject and EVERY single FastQC graph they show has an identical, or near identical patterning. I summarized all of the information in a blog post. I will be forwarding it to Illumina for their response. BUT, please comment if you think I'm missing something obvious. In short, I find the pattern too consistent for just transposon bias. I would expect there to be more variability in such an affect, one that would be less prominent in four out of four cases publicly reported. Thanks! Last edited by roliwilhelm; 05-02-2014 at 07:10 PM. |
![]() |
![]() |
![]() |
#10 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Yeah, the random hexamer priming effect is almost always identical, regardless of who makes the library. This is unsurprising since the library prep. components are identical.
|
![]() |
![]() |
![]() |
#11 |
Member
Location: Ithaca, NY Join Date: Jun 2012
Posts: 38
|
![]()
I didn't think that the Nextera kits used random hexamers for amplification? I assumed that the tagmentation step inserted the sequence needed for annealing. Am I incorrect? Here's the best description of the process I could find.
You do make a good point, since all of the recurring sequences are hexamers. Still, how would the hexamers which are initiating strand amplification end up included in the read during extension? Why would that occur more frequently and predictably at the start of the read? Obviously these answers aren't completely relevant to the technical concerns of processing the data for assembly, but I would like to know more. Last edited by roliwilhelm; 05-02-2014 at 11:36 PM. |
![]() |
![]() |
![]() |
#12 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,143
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#13 |
Member
Location: Ithaca, NY Join Date: Jun 2012
Posts: 38
|
![]()
Thanks for your comment GenoMax, I would give you a penny if we had any left up here in Canada.
Perhaps I wasn't completely clear, but I'm not using multiple displacement amplification of my DNA, nor do I believe that there are any random hexamer priming steps in the Nextera library prep that I used. The information you linked to is related to those forms of sequencing prep. But, I am in doubt about my understanding of the Nextera process, especially since the repeats appear to be random hexamers! (Also: I couldn't find any examples of this on the FastQC help page, even though there was some suggestion there would be) |
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
Have you had a look at this paper "Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition", Adey et al. Genome Biology 2010, 11:R119? I would draw your attention to Supplementary Figure 1. The authors show a consistent base composition bias in the region surrounding the transposition site. This composition is found in both E. coli and H. sapiens gDNA. Despite the bias in locations of transposase activity the authors did not detect any bias in genome coverage in E. coli, H. sapiens or D. melanogaster compared to physical fragmentation (sonication) or endonuclease cleavage.
I don't really follow your argument that consistency of the base composition suggests that the effect is not due to the transposase. Such may be true in the case of the other fragmentation methods (and the authors of the above paper suggest this) as they include post fragmentation steps such as end repair and A-tailing which may introduce their own biases. The Nextera protocol includes only a PCR amplification, which primes off the inserted transposon, post fragmentation. An argument could be made that the PCR amplification of the fragmented DNA could contribute to a composition bias downstream of the fragmentation site but can not explain the composition bias upstream of the site as that chunk of DNA is long gone by the time PCR happens. |
![]() |
![]() |
![]() |
#15 |
Jafar Jabbari
Location: Melbourne Join Date: Jan 2013
Posts: 1,248
|
![]()
I would like to make a distinction in 5’ bias observed in TruSeq RNA libraries and transposon based Nextera. During first strand synthesis, random hexamers with higher GC content are more likely to pair with their complementary bases for long enough to prime cDNA synthesis and therefore there is tendency toward higher GC in 5’ six nucleotides. I have seen this trend in EpiGnome kit used for of library prep from bisulfite converted DNA which uses random hexamers to prime complementary strand synthesis. Mapping reads from non-converted library reads prepared with that kit also reveals more mismatches at initial 1-4 nucleotides indicating that full complementarity along template is not required for progression of synthesis and two 3’ end nucleotide of hexamers provides enough contact for polymerase activity.
Tn5 transposase and by extension Nextera transposase uses a cut and paste mechanism to integrate its recognition sequence into DNA. During transposition a 9 base single stranded gaps is left in the fragments which results in duplication of termini. This gap is filled during initial 3 min incubation at 72°C before PCR cycling. If all the fragments in a library are sequenced by saturation (deeper sequencing or limited template use), duplicated region could be recognised and I think that Molecula uses this to stich back short read fragments to form longer synthetic reads. The unbalanced 5’ region observed in FASTQ graphs extends 9 bases in Nextera library reads and end duplication in combination with insertion site bias, might explain this observation. |
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,318
|
![]()
A couple of points:
(1) Transposases commonly have target site preferences. Already said, but apparently needs to be repeated. There is nothing surprising about a transposase retaining those site preferences as it inserts into the DNA of a variety of different species. DNA is DNA, right? (2) I think this preference makes it non-ideal for the construction of genomic shotgun libraries. But, let's not exaggerate the situation. The deflections from perfect randomness look to be in the 10-20% range. Most assemblers probably work better with less biased end points. But there are lots of fluctuations from the non-ideal in our data sets. You assess the pros and cons and move on. -- Phillip Last edited by pmiguel; 05-05-2014 at 04:33 AM. |
![]() |
![]() |
![]() |
#17 | |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,318
|
![]() Quote:
![]() shows an increase in A composition towards the end of your reads. I think this usually means that there are a high frequency of very short amplicons reads in your data set. That is, many of them have read through the insert, the right adapter and into the polyA (or polyT, depending on your strand of reference) attachment of the flow cell oligos to the surface of the flowcell. Did you run FastQC on the clipped reads? If so, my guess is that your clipper is missing lots of adapters. By the way, one factor that makes the default settings for FastQC a poor choice for this sort of analysis are the unequal bin widths it uses. Yeah, I know it isn't convenient to scroll right really far in your browser to see the whole image, but given the distortion it causes I prefer to have to do that. -- Phillip |
|
![]() |
![]() |
![]() |
#18 |
Member
Location: Ithaca, NY Join Date: Jun 2012
Posts: 38
|
![]()
@kmcarr: That paper was very useful; thanks for sharing it. It is also the same paper the Illumina representative referenced. It enabled me to match some of the recurring sequences in the first 14bp of my reads to the Tn5 recognition site they cite.
I also realized that the proportion of reads with this bias is quite small (0.3%), though initially I thought it was far greater of an effect. This misconception was due to a miscalculation on my part. I summed the "counts" column for the top 7 overrepresented k-mer in the FastQC report and divided by the totoal number of sequences in my library and came up with > 95% of reads containing "over-represented" sequences. In reality, the "counts" column is the total observed frequency, not the number of occurrences at the start of the read, so this was a vast overestimate. Thank you all for your thoughtful responses. |
![]() |
![]() |
![]() |
#19 |
Junior Member
Location: London Join Date: Jul 2014
Posts: 4
|
![]()
Is there an explanation for Kmers in the mid part of sequence?
The capture is Nextera whole exome, sequenced in Illumina Hiseq pairend 100bp. The Kmers persist after Trimmomatic. The quality of the data from fastqc after the trimming is better. Such appearance occurs in multiple samples. I have asked Illumina 2 weeks ago but still pending answers. Thanks |
![]() |
![]() |
![]() |
#20 |
Senior Member
Location: Spain Join Date: Jul 2009
Posts: 133
|
![]()
Hi,
we are seeing a similar issue using the Agilent QXT kit, on captured and whole genome experiments. This kit also uses transposases. HTH Dave |
![]() |
![]() |
![]() |
Tags |
fastqc, miseq |
Thread Tools | |
|
|