Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rcorbett
    Member
    • Sep 2009
    • 29

    "ideal" insert size

    Has anyone discovered a study or formal recommendation of some sort that gives reason for chosing one ideal insert size for paried-end sequencing on human samples? I have been asked this by our labratory staff and all I can tell them is that a really narrow distribution would be good, but as for insert distance I have little information to go on.
    We do both alignment and assembly on our data.

    Any help appreciated.
  • simonandrews
    Simon Andrews
    • May 2009
    • 870

    #2
    You don't mention the platform you're using, but I'd imagine the major constraint is going to be the technical limitations of your sequencer. On Illumina systems longer insert lengths will result in larger, dimmer spots reducing both the amount and quality of data you can obtain. We've run libraries with insert sizes up to about 1kb but I'm not sure I'd want to go much higher than that. There's often no point in having really short inserts either since you'll end up reading through the insert and into the adapter in a significant proportion of your reads.

    The other big issue which may or may not be a factor for you is the amount of material you have. If you perform a very tight size selection then you're reducing the amount of material you have to create your library and you run the risk of getting a big pile of PCR artefacts if you start amplifying from too little material.

    I'm sure there are other considerations specific to your biological application. If you're doing assemblies you might want to look at mate pair libraries which allow the generation of paired sequences separated by much longer distances (2-5kb) whilst still keeping to the insert size limitations of the sequencing platform.

    Comment

    • rcorbett
      Member
      • Sep 2009
      • 29

      #3
      Thanks for your input,
      Specifcially I've been asked this by our group who are responsible for illumina sequencing.

      They have cited the trade-off between tight distribution and yield, which makes sense to me.

      What befuddles me is that when I'm asked the question "if you could have any insert size, what would it be?" I don't have much to go on other than we don't want to sequence through the fragment twice. We have restrictions from WTSS, etc. which are driven by the sample, but for WGSS I'm looking for a bioinformatic reason to choose one size over another.

      Shouldn't there be some feature of hg18/hg19 like sines/lines etc. that would necessitate a larger or smaller insert size for WGSS libraries, so that we can make more use of them bioinformatically (aligning and assembly)?

      Comment

      • simonandrews
        Simon Andrews
        • May 2009
        • 870

        #4
        This is going eventually to come down to your use case. If you're doing some kind of ChIP experiment then you won't want to increase your insert size too much since you'll lose resolution in your feature detection. I don't do much assembly but my recollection from those that do is that it's useful to have a range of insert sizes (though maybe in separate experiments?) to allow for spanning of short and long repeats.

        Our experience has been that longer read lengths are negating many of the problems of duplicated alignments in remapping experiments. Once you're up to 50bp or so (either paired or single end) then a surprisingly high proportion of 'repeat' sequence is actually mappable. We work in backcrossed strains with no SNPs though, so maybe this is more of an issue if you have more diversity. These days most of the sequences we can't map come from regions not present in the genome assembly (telomeres and centromeres mostly), so there's not much we can do about that.

        Comment

        • JohnK
          Senior Member
          • Feb 2010
          • 106

          #5
          I think your ideal insert size would be somewhere along the lines of the maximum insert and read length that allows you to maximize the throughput of your sequencing platform without saturating your data.

          Comment

          • Michael.James.Clark
            Senior Member
            • Apr 2009
            • 207

            #6
            I think a lot of these answers are good.

            The optimal insert size depends on your experiment and goals.

            I'm assuming you're not talking ChIP-seq (which often is best doing single-end).

            For exome-seq, something around 200-350 is more than adequate for hitting >99% of the targets and assessing variants. Probably >4 exomes per HiSeq lane doing this based on what I've seen.

            For whole genome, a combination of tightly distributed 200- and 2000-base inserts is optimal for human (for the sake of SV detection). The 2kb insert reads can be fairly low depth--they'll make up for issues mapping over LINEs that you eluded to).

            If you don't care about having the optimal SV detection rate, you can go with 200-350bp whole genome similar to exome without much issue (though the cost may be an issue).

            For the sake of phasing, a less tightly distributed mean 2-3000-base insert would be great (expecting about 1 SNV/1kb).
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            38 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            100 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            121 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            114 views
            0 reactions
            Last Post SEQadmin2  
            Working...