![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
WGS in one region, Amplicons in another | nickloman | 454 Pyrosequencing | 13 | 07-04-2014 01:12 AM |
Looking for the right WGS simulator | oiiio | Bioinformatics | 5 | 07-20-2012 11:59 AM |
bioinformatics training for WGS 454 and Illumina | hmmngs | Bioinformatics | 2 | 09-21-2011 09:01 AM |
Celera WGS requires paired data? | k-gun12 | Bioinformatics | 0 | 03-11-2011 11:40 AM |
Celera Assembler (WGS) - splice site file? | dan | Bioinformatics | 4 | 09-28-2009 03:56 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: USA Join Date: Feb 2009
Posts: 15
|
![]()
Hi,
I am wondering if anyone here can provide me an answer to my question. I am working on couple of WGS data - the libs are prepared using Illumina nextera and Truseq WG amplification kit . When initial post-sequencing QC was done, the Nextera samples showed a weird first 14 bp (5 prime ) nucleotide distribution,unlike the Truseq (Please see the attachments here ). we checked the data for any adapter / primer contamination against Illumina Nextera / Epicenter Nextera sequences, using various tools ( fastx , cross match allowing 2 mismatches ) - expecting some of them will map to these first 14 bp or more. But none or few thousands of the reads were mapped - indicating that these first ( 5 prime ) 14 bp are not adapters/primer products. Wondering if any user here experienced similar with Nextera kits or if any one could give me clue as to what these 5prime 14 bp could be... Thanks in advance, Aparna Last edited by aparna; 06-04-2012 at 02:11 PM. Reason: no attachements |
![]() |
![]() |
![]() |
#2 |
--Site Admin--
Location: SF Bay Area, CA, USA Join Date: Oct 2007
Posts: 1,358
|
![]()
Disclaimer: I haven't used Nextera (but my graduate work involved recombinase specificity in the human genome).
I would bet that this is insertion site bias. Tn5's footprint is ~13-15bp...so this doesn't surprise me at all (in fact I would be amazed if Epicentre evolved it out of their production enzyme...). Check out the figure in this paper...it has a consensus insertional bias for Tn5...(I'd love to see you run something like MEME on the starts of all your reads...) |
![]() |
![]() |
![]() |
#3 |
Member
Location: Sydney, Australia Join Date: May 2010
Posts: 65
|
![]()
Hi aparna,
Yes, the Nextera kits use an engineered Tn5 that has a target site preference. In addition to the paper ECO mentioned you might look at the Supplementary Figure 1 for Adey et al 2010: http://genomebiology.com/content/sup...12-r119-s2.pdf where they show the nucleotide distribution for the first several sites sequenced with those kits and compare it to sonication. Whether this bias is a bane or a boon really depends on your application. For de novo assembly it may be troublesome and as I mentioned in another post the group I work in has found (this is not yet published! treat as anecdote!) that overdigesting with Tn5 and eliminating the small fragments, e.g. <350nt, helps to reduce the target site bias in the resulting library. If doing this kind of assembly you might want to investigate some of the newer assemblers designed for data with uneven coverage such as whole genome amplification data. IDBA-UD, diginorm, and one of the newer euler releases come to mind, probably there is also something from the group at UMD/Johns Hopkins that would work too. We developed our own pipeline called A5 for these assemblies. The paper is in press. On the other hand if you want to do SNP profiling the bias might be helpful in the same way that people use RAD-seq to focus sequencing reads to assay polymorphisms in a subset of the genome. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Ohio Join Date: Jan 2010
Posts: 144
|
![]()
We have seen the Nextera bias as well. Our data has seemed to assemble ok though. At any rate, the effect is nothing like whole genome amplification - orders of magnitude different.
For a 51 SE library in a reference-guided assembly we got N50 of 14 kb, 94% coverage of a reference genome (most all of the missing 6% appeared to be mobile DNA, presumably strain polymorphisms). This is a 44% GC bacterium with 2MB genome. I haven't compared it directly to Tru-Seq, would be very interested if someone has. But it shouldn't totally break de novo assembly from my experience. |
![]() |
![]() |
![]() |
#5 |
Member
Location: USA Join Date: Feb 2009
Posts: 15
|
![]()
Hi Koadman and ECO ,
Thank you so much for your valuble insights and for the attachments. I have not used MEME yet but looks like those 5 prime 14 bps are IS bias indeed. I see the first 14 bp in the paired end sequencing data as CCCTAACCCTAACC or GGGTTAGGGTTAGG. We are comparing Nextera vs Truseq WG amplification methods.As a part of it I am also interested in variant calling to see the differences and take it from there. As part of this effort I originally mapped this data to human reference hg19 using bwa defualt settings.Difference is quite noticeable in mapping and mates pairing. Nextera Untrimmed: 1,214,797,540 in total 91017689 duplicates 915682948 mapped (75.38%) 843158634 properly paired (69.41%) 17398062 singletons (1.43%) Nextera Trimmed: (using bwa aln -B 14 ) 1,214,797,540 in total 90178338 duplicates 900457327 mapped (74.12%) 735216938 properly paired (60.52%) 21143784 singletons (1.74%) With trimming we were expecting good mapping comparable to Truseq data wich was like 94% mapped reads with 92% pairing - but no. wondering what went wrong. Do you suggest any thing else? |
![]() |
![]() |
![]() |
#6 |
Member
Location: Sydney, Australia Join Date: May 2010
Posts: 65
|
![]()
Have you filtered out nextera adapter sequences from the reads with something like tagdust or scythe for 3' contamination? If not, what does your insert size distribution look like? Do you have a bioanalyzer trace? What method did you use for size selection?
In our early attempts at nextera where we relied on the Ampure XP beads for size selection we would see high rates of adapter contamination. We now do a broad swath gel cut 320-600nt for all nextera libraries and the adapter contamination rates are much lower, usually 1% or less. Illumina has finally shared their Nextera adapter sequences so you could try filtering those reads and see whether your mapping rate goes up. The apparent duplicate rate of 10% is also a bit worrisome, although with nextera libraries this number can also be influenced by transposition bias and not just PCR cycles. If the tagmentation is heavily biased, two read pairs that are not PCR duplicates will be much more likely to start in the same positions. Last edited by koadman; 06-05-2012 at 08:31 PM. |
![]() |
![]() |
![]() |
#7 |
Member
Location: USA Join Date: Feb 2009
Posts: 15
|
![]()
Hi Koadman,
I had around 3million reads that were 3 prime contaminants and thats about it. As mentioned in my initial post, we have tried to look for Illumina/Epicenter adapters/ primers in our data and found few hundreds of them. I need to ask in the lab about the size selection and traces. Post mapping median insert size was falling at 200 bp compared to Truseq 400 bp. This data is real puzzling to me. We will figure out and post an update here if possible. |
![]() |
![]() |
![]() |
#8 |
Member
Location: Sydney, Australia Join Date: May 2010
Posts: 65
|
![]()
Oh, sorry, somehow I missed or didn't understand between your first two posts that the read mapping results in the 2nd post had been adapter filtered. Thanks for clarifying. 3 million reads out of 1.2 billion is not bad for adapter.
As for the pairing and insert distribution issue, thanks for telling us the median insert size but what does the entire distribution look like? If you did not do a gel cut during library prep there might be a long tail to this distribution. I am not sure exactly what threshold bwa uses to decide whether a pairing is "proper" or not, I wonder if many of your reads are mapping just barely too far apart for bwa to call them proper. As for the first 14bp, this is indeed puzzling. I notice that the two sequences you're observing are reverse complements of each other, and that they also contain a 6nt direct tandem repeat. I wonder if this might be some kind of PCR artefact but really don't have much of a clue. Does the remaining portion of those reads contain the expected target sequence (human?). If so, I wonder how the mapping looks with those sequences trimmed? Last edited by koadman; 06-06-2012 at 10:54 PM. Reason: oops mistakenly read bowtie as read mapper instead of bwa |
![]() |
![]() |
![]() |
#9 |
Junior Member
Location: Oregon Join Date: Dec 2012
Posts: 7
|
![]()
Aparna,
I'm doing RNAseq with Nextera and oberved exactly the same pattern on the first 15bp, with overrepresented sequences. I'm doing de novo assembly, and thought about running the assembly trimming this portion and not trimming it. I was interested in knowing what you finally did. Thanks! |
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,087
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: Oregon Join Date: Dec 2012
Posts: 7
|
![]()
Thank you so much GenoMax! I'm newbie to RNAseq and all that information has been really helpful...
|
![]() |
![]() |
![]() |
#12 |
Member
Location: Davis, CA Join Date: Mar 2009
Posts: 82
|
![]()
Does anyone happen to have a motif file for this insertion site described in this reference above?
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3292447/ I'm trying to determine why there are some holes in coverage for a Nextera library mapped to our reference sequences, and thought it might be useful to search for the abundance of the transposase insertion site motifs. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|