SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
interpretation of FASTQC Overrepresented Kmers mattanswers Bioinformatics 1 09-20-2011 12:40 PM
fastqc - overrepresented sequences PFS Bioinformatics 3 07-05-2011 06:18 PM
splitting 454 reads into kmers for diff expression Jeremy RNA Sequencing 0 01-18-2011 06:17 PM
Duplicate reads ("same start" reads) in 454 FLX/Titanium shotgun runs [c]oma 454 Pyrosequencing 20 08-28-2009 06:12 AM
start position of reads and its distribution baohua100 Bioinformatics 0 11-18-2008 05:20 AM

Reply
 
Thread Tools
Old 01-30-2012, 05:38 AM   #1
kentk
Member
 
Location: Philippines

Join Date: Dec 2011
Posts: 17
Default Overrepresented kmers at the start of reads

I recently just discovered FastQC and I ran it in one of our datasets that's having difficulty in assembly. I was wondering how to interpret this piece of result from FastQC



Any ideas?
kentk is offline   Reply With Quote
Old 01-30-2012, 09:40 AM   #2
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Is this RNA-Seq? If so, this looks like it could be the result of random hexamer priming. Does the nucleotide distribution look off at the beginning too?

Hansen, K. D., S. E. Brenner, et al. (2010). "Biases in Illumina transcriptome sequencing caused by random hexamer priming." Nucleic Acids Research 38(12): e131.
pbluescript is offline   Reply With Quote
Old 01-30-2012, 03:16 PM   #3
kentk
Member
 
Location: Philippines

Join Date: Dec 2011
Posts: 17
Default

Quote:
Originally Posted by pbluescript View Post
Is this RNA-Seq?
Its a bacterial genome run prepared using Nextera. And yes the %A, %T, %C, %G graph also looks like the kmer graph
kentk is offline   Reply With Quote
Old 01-31-2012, 03:47 AM   #4
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

I have seen Nextera libraries show a very similar bias. My guess is that this is just an artifact of the library prep. In the past, I would trim off these regions before mapping, but then I found that it didn't make a big difference, so I just left them there.
pbluescript is offline   Reply With Quote
Old 02-02-2012, 05:17 AM   #5
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,313
Default

I agree. Probably reflects a sequence bias for the transposase used by Nextera. It will have its own agenda -- and it may not correspond perfectly with yours. But is it good enough? Assemble and see...

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-02-2012, 01:32 PM   #6
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default

Looking at the positions of the sequences, I would see if the sequences: CAGCACCAGCA or CAGCACCACC are part of your primers.
mattanswers is offline   Reply With Quote
Old 04-18-2012, 10:40 PM   #7
mxr1895
Junior Member
 
Location: new zealand

Join Date: Feb 2012
Posts: 6
Default

Quote:
Originally Posted by pbluescript View Post
I have seen Nextera libraries show a very similar bias. My guess is that this is just an artifact of the library prep. In the past, I would trim off these regions before mapping, but then I found that it didn't make a big difference, so I just left them there.
Hi, what were you using your reads for?
I have the same issue with 80 multiplexed Nextera libraries run on a HiSeq. Their QC graphs all look the same for the first 13bp.
I'm wondering if I should just trim them?
Attached Images
File Type: png kmer_content.png (112.7 KB, 95 views)
File Type: png per_base_sequence_content.png (78.4 KB, 85 views)
File Type: png per_base_GC_content.png (47.4 KB, 55 views)
mxr1895 is offline   Reply With Quote
Old 04-19-2012, 04:11 AM   #8
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Quote:
Originally Posted by mxr1895 View Post
Hi, what were you using your reads for?
I have the same issue with 80 multiplexed Nextera libraries run on a HiSeq. Their QC graphs all look the same for the first 13bp.
I'm wondering if I should just trim them?
I wouldn't bother trimming them. You could always take a sample of your reads and map them trimmed and untrimmed to see which works better. Whenever I did this, I never saw big differences.
pbluescript is offline   Reply With Quote
Old 05-02-2014, 07:04 PM   #9
roliwilhelm
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 38
Default New Evidence of Strangeness re: a consistent k-mer bias for various Nextera preps

Hello All,

Well, I've actively pursued a similar question as the initial post and have found a variety of perspectives on the matter, but none really do the problem justice. It appears to be a far reaching phenomenon that appears across a variety of samples from a variety of users. I was able to find four different postings on the subject and EVERY single FastQC graph they show has an identical, or near identical patterning. I summarized all of the information in a blog post. I will be forwarding it to Illumina for their response. BUT, please comment if you think I'm missing something obvious. In short, I find the pattern too consistent for just transposon bias. I would expect there to be more variability in such an affect, one that would be less prominent in four out of four cases publicly reported.

Thanks!

Last edited by roliwilhelm; 05-02-2014 at 07:10 PM.
roliwilhelm is offline   Reply With Quote
Old 05-02-2014, 11:01 PM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Yeah, the random hexamer priming effect is almost always identical, regardless of who makes the library. This is unsurprising since the library prep. components are identical.
dpryan is offline   Reply With Quote
Old 05-02-2014, 11:19 PM   #11
roliwilhelm
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 38
Default

I didn't think that the Nextera kits used random hexamers for amplification? I assumed that the tagmentation step inserted the sequence needed for annealing. Am I incorrect? Here's the best description of the process I could find.

You do make a good point, since all of the recurring sequences are hexamers.

Still, how would the hexamers which are initiating strand amplification end up included in the read during extension? Why would that occur more frequently and predictably at the start of the read?

Obviously these answers aren't completely relevant to the technical concerns of processing the data for assembly, but I would like to know more.

Last edited by roliwilhelm; 05-02-2014 at 11:36 PM.
roliwilhelm is offline   Reply With Quote
Old 05-03-2014, 01:58 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,974
Default

Quote:
Originally Posted by roliwilhelm View Post
Obviously these answers aren't completely relevant to the technical concerns of processing the data for assembly, but I would like to know more.
See posts #261 and 263: http://seqanswers.com/forums/showthr...t=4846&page=14
GenoMax is offline   Reply With Quote
Old 05-03-2014, 08:08 AM   #13
roliwilhelm
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 38
Default

Thanks for your comment GenoMax, I would give you a penny if we had any left up here in Canada.

Perhaps I wasn't completely clear, but I'm not using multiple displacement amplification of my DNA, nor do I believe that there are any random hexamer priming steps in the Nextera library prep that I used. The information you linked to is related to those forms of sequencing prep.

But, I am in doubt about my understanding of the Nextera process, especially since the repeats appear to be random hexamers!

(Also: I couldn't find any examples of this on the FastQC help page, even though there was some suggestion there would be)
roliwilhelm is offline   Reply With Quote
Old 05-03-2014, 12:38 PM   #14
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,165
Default

Have you had a look at this paper "Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition", Adey et al. Genome Biology 2010, 11:R119? I would draw your attention to Supplementary Figure 1. The authors show a consistent base composition bias in the region surrounding the transposition site. This composition is found in both E. coli and H. sapiens gDNA. Despite the bias in locations of transposase activity the authors did not detect any bias in genome coverage in E. coli, H. sapiens or D. melanogaster compared to physical fragmentation (sonication) or endonuclease cleavage.

I don't really follow your argument that consistency of the base composition suggests that the effect is not due to the transposase. Such may be true in the case of the other fragmentation methods (and the authors of the above paper suggest this) as they include post fragmentation steps such as end repair and A-tailing which may introduce their own biases. The Nextera protocol includes only a PCR amplification, which primes off the inserted transposon, post fragmentation. An argument could be made that the PCR amplification of the fragmented DNA could contribute to a composition bias downstream of the fragmentation site but can not explain the composition bias upstream of the site as that chunk of DNA is long gone by the time PCR happens.
kmcarr is offline   Reply With Quote
Old 05-03-2014, 06:36 PM   #15
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,224
Default

I would like to make a distinction in 5 bias observed in TruSeq RNA libraries and transposon based Nextera. During first strand synthesis, random hexamers with higher GC content are more likely to pair with their complementary bases for long enough to prime cDNA synthesis and therefore there is tendency toward higher GC in 5 six nucleotides. I have seen this trend in EpiGnome kit used for of library prep from bisulfite converted DNA which uses random hexamers to prime complementary strand synthesis. Mapping reads from non-converted library reads prepared with that kit also reveals more mismatches at initial 1-4 nucleotides indicating that full complementarity along template is not required for progression of synthesis and two 3 end nucleotide of hexamers provides enough contact for polymerase activity.

Tn5 transposase and by extension Nextera transposase uses a cut and paste mechanism to integrate its recognition sequence into DNA. During transposition a 9 base single stranded gaps is left in the fragments which results in duplication of termini. This gap is filled during initial 3 min incubation at 72C before PCR cycling. If all the fragments in a library are sequenced by saturation (deeper sequencing or limited template use), duplicated region could be recognised and I think that Molecula uses this to stich back short read fragments to form longer synthetic reads. The unbalanced 5 region observed in FASTQ graphs extends 9 bases in Nextera library reads and end duplication in combination with insertion site bias, might explain this observation.
nucacidhunter is offline   Reply With Quote
Old 05-05-2014, 04:09 AM   #16
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,313
Default

A couple of points:
(1) Transposases commonly have target site preferences. Already said, but apparently needs to be repeated. There is nothing surprising about a transposase retaining those site preferences as it inserts into the DNA of a variety of different species. DNA is DNA, right?
(2) I think this preference makes it non-ideal for the construction of genomic shotgun libraries. But, let's not exaggerate the situation. The deflections from perfect randomness look to be in the 10-20% range. Most assemblers probably work better with less biased end points. But there are lots of fluctuations from the non-ideal in our data sets. You assess the pros and cons and move on.

--
Phillip

Last edited by pmiguel; 05-05-2014 at 04:33 AM.
pmiguel is offline   Reply With Quote
Old 05-05-2014, 04:32 AM   #17
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,313
Default

Quote:
Originally Posted by roliwilhelm View Post
Hello All,

I summarized all of the information in a blog post.

Thanks!
By the way, the image from your blog:


shows an increase in A composition towards the end of your reads. I think this usually means that there are a high frequency of very short amplicons reads in your data set. That is, many of them have read through the insert, the right adapter and into the polyA (or polyT, depending on your strand of reference) attachment of the flow cell oligos to the surface of the flowcell.

Did you run FastQC on the clipped reads? If so, my guess is that your clipper is missing lots of adapters.

By the way, one factor that makes the default settings for FastQC a poor choice for this sort of analysis are the unequal bin widths it uses. Yeah, I know it isn't convenient to scroll right really far in your browser to see the whole image, but given the distortion it causes I prefer to have to do that.

--
Phillip
pmiguel is offline   Reply With Quote
Old 05-09-2014, 10:33 AM   #18
roliwilhelm
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 38
Default

@kmcarr: That paper was very useful; thanks for sharing it. It is also the same paper the Illumina representative referenced. It enabled me to match some of the recurring sequences in the first 14bp of my reads to the Tn5 recognition site they cite.

I also realized that the proportion of reads with this bias is quite small (0.3%), though initially I thought it was far greater of an effect. This misconception was due to a miscalculation on my part. I summed the "counts" column for the top 7 overrepresented k-mer in the FastQC report and divided by the totoal number of sequences in my library and came up with > 95% of reads containing "over-represented" sequences. In reality, the "counts" column is the total observed frequency, not the number of occurrences at the start of the read, so this was a vast overestimate.

Thank you all for your thoughtful responses.
roliwilhelm is offline   Reply With Quote
Old 07-22-2014, 10:50 AM   #19
Kmok
Junior Member
 
Location: London

Join Date: Jul 2014
Posts: 4
Question Kmers in mid part of sequence

Is there an explanation for Kmers in the mid part of sequence?
The capture is Nextera whole exome, sequenced in Illumina Hiseq pairend 100bp.
The Kmers persist after Trimmomatic. The quality of the data from fastqc after the trimming is better. Such appearance occurs in multiple samples. I have asked Illumina 2 weeks ago but still pending answers.

Thanks
Attached Images
File Type: png pre.PNG (131.5 KB, 26 views)
File Type: png post.PNG (131.0 KB, 23 views)
File Type: png post_perbase.PNG (89.5 KB, 16 views)
Kmok is offline   Reply With Quote
Old 07-23-2014, 12:43 AM   #20
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi,

we are seeing a similar issue using the Agilent QXT kit, on captured and whole genome experiments. This kit also uses transposases.

HTH

Dave
dnusol is offline   Reply With Quote
Reply

Tags
fastqc, miseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO