SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
"allele balance ratio" and "quality by depth" in VCF files efoss Bioinformatics 2 10-25-2011 12:13 PM
Relatively large proportion of "LOWDATA", "FAIL" of FPKM_status running cufflink ruben6um Bioinformatics 3 10-12-2011 01:39 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 08:55 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 01:19 PM
SEQanswers second "publication": "How to map billions of short reads onto genomes" ECO Literature Watch 0 06-30-2009 12:49 AM

Reply
 
Thread Tools
Old 02-16-2011, 08:27 AM   #1
rcorbett
Member
 
Location: canada

Join Date: Sep 2009
Posts: 29
Default "ideal" insert size

Has anyone discovered a study or formal recommendation of some sort that gives reason for chosing one ideal insert size for paried-end sequencing on human samples? I have been asked this by our labratory staff and all I can tell them is that a really narrow distribution would be good, but as for insert distance I have little information to go on.
We do both alignment and assembly on our data.

Any help appreciated.
rcorbett is offline   Reply With Quote
Old 02-17-2011, 12:33 AM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

You don't mention the platform you're using, but I'd imagine the major constraint is going to be the technical limitations of your sequencer. On Illumina systems longer insert lengths will result in larger, dimmer spots reducing both the amount and quality of data you can obtain. We've run libraries with insert sizes up to about 1kb but I'm not sure I'd want to go much higher than that. There's often no point in having really short inserts either since you'll end up reading through the insert and into the adapter in a significant proportion of your reads.

The other big issue which may or may not be a factor for you is the amount of material you have. If you perform a very tight size selection then you're reducing the amount of material you have to create your library and you run the risk of getting a big pile of PCR artefacts if you start amplifying from too little material.

I'm sure there are other considerations specific to your biological application. If you're doing assemblies you might want to look at mate pair libraries which allow the generation of paired sequences separated by much longer distances (2-5kb) whilst still keeping to the insert size limitations of the sequencing platform.
simonandrews is offline   Reply With Quote
Old 02-17-2011, 08:04 AM   #3
rcorbett
Member
 
Location: canada

Join Date: Sep 2009
Posts: 29
Default

Thanks for your input,
Specifcially I've been asked this by our group who are responsible for illumina sequencing.

They have cited the trade-off between tight distribution and yield, which makes sense to me.

What befuddles me is that when I'm asked the question "if you could have any insert size, what would it be?" I don't have much to go on other than we don't want to sequence through the fragment twice. We have restrictions from WTSS, etc. which are driven by the sample, but for WGSS I'm looking for a bioinformatic reason to choose one size over another.

Shouldn't there be some feature of hg18/hg19 like sines/lines etc. that would necessitate a larger or smaller insert size for WGSS libraries, so that we can make more use of them bioinformatically (aligning and assembly)?
rcorbett is offline   Reply With Quote
Old 02-18-2011, 12:40 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

This is going eventually to come down to your use case. If you're doing some kind of ChIP experiment then you won't want to increase your insert size too much since you'll lose resolution in your feature detection. I don't do much assembly but my recollection from those that do is that it's useful to have a range of insert sizes (though maybe in separate experiments?) to allow for spanning of short and long repeats.

Our experience has been that longer read lengths are negating many of the problems of duplicated alignments in remapping experiments. Once you're up to 50bp or so (either paired or single end) then a surprisingly high proportion of 'repeat' sequence is actually mappable. We work in backcrossed strains with no SNPs though, so maybe this is more of an issue if you have more diversity. These days most of the sequences we can't map come from regions not present in the genome assembly (telomeres and centromeres mostly), so there's not much we can do about that.
simonandrews is offline   Reply With Quote
Old 02-18-2011, 08:18 AM   #5
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

I think your ideal insert size would be somewhere along the lines of the maximum insert and read length that allows you to maximize the throughput of your sequencing platform without saturating your data.
JohnK is offline   Reply With Quote
Old 02-18-2011, 02:24 PM   #6
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

I think a lot of these answers are good.

The optimal insert size depends on your experiment and goals.

I'm assuming you're not talking ChIP-seq (which often is best doing single-end).

For exome-seq, something around 200-350 is more than adequate for hitting >99% of the targets and assessing variants. Probably >4 exomes per HiSeq lane doing this based on what I've seen.

For whole genome, a combination of tightly distributed 200- and 2000-base inserts is optimal for human (for the sake of SV detection). The 2kb insert reads can be fairly low depth--they'll make up for issues mapping over LINEs that you eluded to).

If you don't care about having the optimal SV detection rate, you can go with 200-350bp whole genome similar to exome without much issue (though the cost may be an issue).

For the sake of phasing, a less tightly distributed mean 2-3000-base insert would be great (expecting about 1 SNV/1kb).
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Reply

Tags
insert size

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO