Seqanswers Leaderboard Ad

**ECO** · 06-04-2012, 03:42 PM

Disclaimer: I haven't used Nextera (but my graduate work involved recombinase specificity in the human genome).

I would bet that this is insertion site bias. Tn5's footprint is ~13-15bp...so this doesn't surprise me at all (in fact I would be amazed if Epicentre evolved it out of their production enzyme...).

Check out the figure in this paper...it has a consensus insertional bias for Tn5...(I'd love to see you run something like MEME on the starts of all your reads...)

**koadman** · 06-04-2012, 07:54 PM

Hi aparna,

Yes, the Nextera kits use an engineered Tn5 that has a target site preference. In addition to the paper ECO mentioned you might look at the Supplementary Figure 1 for Adey et al 2010:

Application Unavailable | Springer Nature

http://genomebiology.com/content/supplementary/gb-2010-11-12-r119-s2.pdf

where they show the nucleotide distribution for the first several sites sequenced with those kits and compare it to sonication.

Whether this bias is a bane or a boon really depends on your application. For de novo assembly it may be troublesome and as I mentioned in another post the group I work in has found (this is not yet published! treat as anecdote!) that overdigesting with Tn5 and eliminating the small fragments, e.g. <350nt, helps to reduce the target site bias in the resulting library. If doing this kind of assembly you might want to investigate some of the newer assemblers designed for data with uneven coverage such as whole genome amplification data. IDBA-UD, diginorm, and one of the newer euler releases come to mind, probably there is also something from the group at UMD/Johns Hopkins that would work too. We developed our own pipeline called A5 for these assemblies. The paper is in press.

On the other hand if you want to do SNP profiling the bias might be helpful in the same way that people use RAD-seq to focus sequencing reads to assay polymorphisms in a subset of the genome.

**cliffbeall** · 06-05-2012, 07:11 AM

We have seen the Nextera bias as well. Our data has seemed to assemble ok though. At any rate, the effect is nothing like whole genome amplification - orders of magnitude different.

For a 51 SE library in a reference-guided assembly we got N50 of 14 kb, 94% coverage of a reference genome (most all of the missing 6% appeared to be mobile DNA, presumably strain polymorphisms). This is a 44% GC bacterium with 2MB genome.

I haven't compared it directly to Tru-Seq, would be very interested if someone has. But it shouldn't totally break de novo assembly from my experience.

**aparna** · 06-05-2012, 10:37 AM

Hi Koadman and ECO ,
Thank you so much for your valuble insights and for the attachments.
I have not used MEME yet but looks like those 5 prime 14 bps are IS bias indeed.
I see the first 14 bp in the paired end sequencing data as CCCTAACCCTAACC or GGGTTAGGGTTAGG.

We are comparing Nextera vs Truseq WG amplification methods.As a part of it I am also interested in variant calling to see the differences and take it from there. As part of this effort I originally mapped this data to human reference hg19 using bwa defualt settings.Difference is quite noticeable in mapping and mates pairing.

Nextera Untrimmed:

1,214,797,540 in total
91017689 duplicates
915682948 mapped (75.38%)
843158634 properly paired (69.41%)
17398062 singletons (1.43%)

Nextera Trimmed: (using bwa aln -B 14 )

1,214,797,540 in total
90178338 duplicates
900457327 mapped (74.12%)
735216938 properly paired (60.52%)
21143784 singletons (1.74%)

With trimming we were expecting good mapping comparable to Truseq data wich was like 94% mapped reads with 92% pairing - but no. wondering what went wrong. Do you suggest any thing else?

**koadman** · 06-05-2012, 07:28 PM

Have you filtered out nextera adapter sequences from the reads with something like tagdust or scythe for 3' contamination? If not, what does your insert size distribution look like? Do you have a bioanalyzer trace? What method did you use for size selection?

In our early attempts at nextera where we relied on the Ampure XP beads for size selection we would see high rates of adapter contamination. We now do a broad swath gel cut 320-600nt for all nextera libraries and the adapter contamination rates are much lower, usually 1% or less.

Illumina has finally shared their Nextera adapter sequences so you could try filtering those reads and see whether your mapping rate goes up.

The apparent duplicate rate of 10% is also a bit worrisome, although with nextera libraries this number can also be influenced by transposition bias and not just PCR cycles. If the tagmentation is heavily biased, two read pairs that are not PCR duplicates will be much more likely to start in the same positions.

**aparna** · 06-06-2012, 04:20 AM

Hi Koadman,

I had around 3million reads that were 3 prime contaminants and thats about it. As mentioned in my initial post, we have tried to look for Illumina/Epicenter adapters/ primers in our data and found few hundreds of them.
I need to ask in the lab about the size selection and traces.
Post mapping median insert size was falling at 200 bp compared to Truseq 400 bp.

This data is real puzzling to me. We will figure out and post an update here if possible.

**koadman** · 06-06-2012, 12:32 PM

Oh, sorry, somehow I missed or didn't understand between your first two posts that the read mapping results in the 2nd post had been adapter filtered. Thanks for clarifying. 3 million reads out of 1.2 billion is not bad for adapter.

As for the pairing and insert distribution issue, thanks for telling us the median insert size but what does the entire distribution look like? If you did not do a gel cut during library prep there might be a long tail to this distribution. I am not sure exactly what threshold bwa uses to decide whether a pairing is "proper" or not, I wonder if many of your reads are mapping just barely too far apart for bwa to call them proper.

As for the first 14bp, this is indeed puzzling. I notice that the two sequences you're observing are reverse complements of each other, and that they also contain a 6nt direct tandem repeat. I wonder if this might be some kind of PCR artefact but really don't have much of a clue. Does the remaining portion of those reads contain the expected target sequence (human?). If so, I wonder how the mapping looks with those sequences trimmed?

**mariruilo** · 01-16-2013, 11:05 AM

Aparna,

I'm doing RNAseq with Nextera and oberved exactly the same pattern on the first 15bp, with overrepresented sequences. I'm doing de novo assembly, and thought about running the assembly trimming this portion and not trimming it. I was interested in knowing what you finally did.

Thanks!

**GenoMax** · 01-17-2013, 04:49 AM

Originally posted by mariruilo View Post

I'm doing RNAseq with Nextera and oberved exactly the same pattern on the first 15bp, with overrepresented sequences. I'm doing de novo assembly, and thought about running the assembly trimming this portion and not trimming it. I was interested in knowing what you finally did.

Thanks!

This is a known observation in case of RNAseq experiments. You can see this thread (there are possibly others) for additional information: http://seqanswers.com/forums/showthread.php?t=11843

**mariruilo** · 01-17-2013, 09:51 AM

Thank you so much GenoMax! I'm newbie to RNAseq and all that information has been really helpful...

**ewilbanks** · 06-04-2013, 11:30 AM

Does anyone happen to have a motif file for this insertion site described in this reference above?

Page not available - PMC

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3292447/

I'm trying to determine why there are some holes in coverage for a Nextera library mapped to our reference sequences, and thought it might be useful to search for the abundance of the transposase insertion site motifs.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Help with Nextera WGS data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News