SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
WGS in one region, Amplicons in another nickloman 454 Pyrosequencing 13 07-04-2014 12:12 AM
Looking for the right WGS simulator oiiio Bioinformatics 5 07-20-2012 10:59 AM
bioinformatics training for WGS 454 and Illumina hmmngs Bioinformatics 2 09-21-2011 08:01 AM
Celera WGS requires paired data? k-gun12 Bioinformatics 0 03-11-2011 10:40 AM
Celera Assembler (WGS) - splice site file? dan Bioinformatics 4 09-28-2009 02:56 AM

Reply
 
Thread Tools
Old 06-04-2012, 01:03 PM   #1
aparna
Member
 
Location: USA

Join Date: Feb 2009
Posts: 15
Default Help with Nextera WGS data

Hi,

I am wondering if anyone here can provide me an answer to my question.

I am working on couple of WGS data - the libs are prepared using Illumina nextera and Truseq WG amplification kit . When initial post-sequencing QC was done, the Nextera samples showed a weird first 14 bp (5 prime ) nucleotide distribution,unlike the Truseq (Please see the attachments here ).

we checked the data for any adapter / primer contamination against Illumina Nextera / Epicenter Nextera sequences, using various tools ( fastx , cross match allowing 2 mismatches ) - expecting some of them will map to these first 14 bp or more. But none or few thousands of the reads were mapped - indicating that these first ( 5 prime ) 14 bp are not adapters/primer products.

Wondering if any user here experienced similar with Nextera kits or if any one could give me clue as to what these 5prime 14 bp could be...

Thanks in advance,
Aparna

Last edited by aparna; 06-04-2012 at 01:11 PM. Reason: no attachements
aparna is offline   Reply With Quote
Old 06-04-2012, 03:42 PM   #2
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Disclaimer: I haven't used Nextera (but my graduate work involved recombinase specificity in the human genome).

I would bet that this is insertion site bias. Tn5's footprint is ~13-15bp...so this doesn't surprise me at all (in fact I would be amazed if Epicentre evolved it out of their production enzyme...).

Check out the figure in this paper...it has a consensus insertional bias for Tn5...(I'd love to see you run something like MEME on the starts of all your reads...)
ECO is offline   Reply With Quote
Old 06-04-2012, 07:54 PM   #3
koadman
Member
 
Location: Sydney, Australia

Join Date: May 2010
Posts: 65
Default

Hi aparna,

Yes, the Nextera kits use an engineered Tn5 that has a target site preference. In addition to the paper ECO mentioned you might look at the Supplementary Figure 1 for Adey et al 2010:
http://genomebiology.com/content/sup...12-r119-s2.pdf
where they show the nucleotide distribution for the first several sites sequenced with those kits and compare it to sonication.

Whether this bias is a bane or a boon really depends on your application. For de novo assembly it may be troublesome and as I mentioned in another post the group I work in has found (this is not yet published! treat as anecdote!) that overdigesting with Tn5 and eliminating the small fragments, e.g. <350nt, helps to reduce the target site bias in the resulting library. If doing this kind of assembly you might want to investigate some of the newer assemblers designed for data with uneven coverage such as whole genome amplification data. IDBA-UD, diginorm, and one of the newer euler releases come to mind, probably there is also something from the group at UMD/Johns Hopkins that would work too. We developed our own pipeline called A5 for these assemblies. The paper is in press.

On the other hand if you want to do SNP profiling the bias might be helpful in the same way that people use RAD-seq to focus sequencing reads to assay polymorphisms in a subset of the genome.
koadman is offline   Reply With Quote
Old 06-05-2012, 07:11 AM   #4
cliffbeall
Senior Member
 
Location: Ohio

Join Date: Jan 2010
Posts: 144
Default

We have seen the Nextera bias as well. Our data has seemed to assemble ok though. At any rate, the effect is nothing like whole genome amplification - orders of magnitude different.

For a 51 SE library in a reference-guided assembly we got N50 of 14 kb, 94% coverage of a reference genome (most all of the missing 6% appeared to be mobile DNA, presumably strain polymorphisms). This is a 44% GC bacterium with 2MB genome.

I haven't compared it directly to Tru-Seq, would be very interested if someone has. But it shouldn't totally break de novo assembly from my experience.
cliffbeall is offline   Reply With Quote
Old 06-05-2012, 10:37 AM   #5
aparna
Member
 
Location: USA

Join Date: Feb 2009
Posts: 15
Default

Hi Koadman and ECO ,
Thank you so much for your valuble insights and for the attachments.
I have not used MEME yet but looks like those 5 prime 14 bps are IS bias indeed.
I see the first 14 bp in the paired end sequencing data as CCCTAACCCTAACC or GGGTTAGGGTTAGG.

We are comparing Nextera vs Truseq WG amplification methods.As a part of it I am also interested in variant calling to see the differences and take it from there. As part of this effort I originally mapped this data to human reference hg19 using bwa defualt settings.Difference is quite noticeable in mapping and mates pairing.

Nextera Untrimmed:

1,214,797,540 in total
91017689 duplicates
915682948 mapped (75.38%)
843158634 properly paired (69.41%)
17398062 singletons (1.43%)

Nextera Trimmed: (using bwa aln -B 14 )

1,214,797,540 in total
90178338 duplicates
900457327 mapped (74.12%)
735216938 properly paired (60.52%)
21143784 singletons (1.74%)

With trimming we were expecting good mapping comparable to Truseq data wich was like 94% mapped reads with 92% pairing - but no. wondering what went wrong. Do you suggest any thing else?
aparna is offline   Reply With Quote
Old 06-05-2012, 07:28 PM   #6
koadman
Member
 
Location: Sydney, Australia

Join Date: May 2010
Posts: 65
Default

Have you filtered out nextera adapter sequences from the reads with something like tagdust or scythe for 3' contamination? If not, what does your insert size distribution look like? Do you have a bioanalyzer trace? What method did you use for size selection?

In our early attempts at nextera where we relied on the Ampure XP beads for size selection we would see high rates of adapter contamination. We now do a broad swath gel cut 320-600nt for all nextera libraries and the adapter contamination rates are much lower, usually 1% or less.

Illumina has finally shared their Nextera adapter sequences so you could try filtering those reads and see whether your mapping rate goes up.

The apparent duplicate rate of 10% is also a bit worrisome, although with nextera libraries this number can also be influenced by transposition bias and not just PCR cycles. If the tagmentation is heavily biased, two read pairs that are not PCR duplicates will be much more likely to start in the same positions.

Last edited by koadman; 06-05-2012 at 07:31 PM.
koadman is offline   Reply With Quote
Old 06-06-2012, 04:20 AM   #7
aparna
Member
 
Location: USA

Join Date: Feb 2009
Posts: 15
Default

Hi Koadman,

I had around 3million reads that were 3 prime contaminants and thats about it. As mentioned in my initial post, we have tried to look for Illumina/Epicenter adapters/ primers in our data and found few hundreds of them.
I need to ask in the lab about the size selection and traces.
Post mapping median insert size was falling at 200 bp compared to Truseq 400 bp.

This data is real puzzling to me. We will figure out and post an update here if possible.
aparna is offline   Reply With Quote
Old 06-06-2012, 12:32 PM   #8
koadman
Member
 
Location: Sydney, Australia

Join Date: May 2010
Posts: 65
Default

Oh, sorry, somehow I missed or didn't understand between your first two posts that the read mapping results in the 2nd post had been adapter filtered. Thanks for clarifying. 3 million reads out of 1.2 billion is not bad for adapter.

As for the pairing and insert distribution issue, thanks for telling us the median insert size but what does the entire distribution look like? If you did not do a gel cut during library prep there might be a long tail to this distribution. I am not sure exactly what threshold bwa uses to decide whether a pairing is "proper" or not, I wonder if many of your reads are mapping just barely too far apart for bwa to call them proper.

As for the first 14bp, this is indeed puzzling. I notice that the two sequences you're observing are reverse complements of each other, and that they also contain a 6nt direct tandem repeat. I wonder if this might be some kind of PCR artefact but really don't have much of a clue. Does the remaining portion of those reads contain the expected target sequence (human?). If so, I wonder how the mapping looks with those sequences trimmed?

Last edited by koadman; 06-06-2012 at 09:54 PM. Reason: oops mistakenly read bowtie as read mapper instead of bwa
koadman is offline   Reply With Quote
Old 01-16-2013, 10:05 AM   #9
mariruilo
Junior Member
 
Location: Oregon

Join Date: Dec 2012
Posts: 7
Default

Aparna,

I'm doing RNAseq with Nextera and oberved exactly the same pattern on the first 15bp, with overrepresented sequences. I'm doing de novo assembly, and thought about running the assembly trimming this portion and not trimming it. I was interested in knowing what you finally did.

Thanks!
mariruilo is offline   Reply With Quote
Old 01-17-2013, 03:49 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by mariruilo View Post
I'm doing RNAseq with Nextera and oberved exactly the same pattern on the first 15bp, with overrepresented sequences. I'm doing de novo assembly, and thought about running the assembly trimming this portion and not trimming it. I was interested in knowing what you finally did.

Thanks!
This is a known observation in case of RNAseq experiments. You can see this thread (there are possibly others) for additional information: http://seqanswers.com/forums/showthread.php?t=11843
GenoMax is offline   Reply With Quote
Old 01-17-2013, 08:51 AM   #11
mariruilo
Junior Member
 
Location: Oregon

Join Date: Dec 2012
Posts: 7
Default

Thank you so much GenoMax! I'm newbie to RNAseq and all that information has been really helpful...
mariruilo is offline   Reply With Quote
Old 06-04-2013, 11:30 AM   #12
ewilbanks
Member
 
Location: Davis, CA

Join Date: Mar 2009
Posts: 82
Default

Does anyone happen to have a motif file for this insertion site described in this reference above?
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3292447/

I'm trying to determine why there are some holes in coverage for a Nextera library mapped to our reference sequences, and thought it might be useful to search for the abundance of the transposase insertion site motifs.
ewilbanks is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO