SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
HiSeq X different insert size? JBKri Illumina/Solexa 5 11-05-2016 06:42 AM
1000bp insert size with Illumina TruSeq DNA PCR-Free Library prep kit Lovro Sample Prep / Library Generation 13 10-03-2016 11:10 AM
HiSeq insert size njlodato Sample Prep / Library Generation 0 09-04-2015 10:03 AM
Maximum insert size with Hiseq 3000 upendra_35 Illumina/Solexa 6 07-23-2015 03:44 AM
150bp-1.3kb insert size PE on HiSeq jmugford Illumina/Solexa 5 04-19-2012 07:43 PM

Reply
 
Thread Tools
Old 03-07-2017, 08:01 AM   #1
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default Illumina HiSeq library insert size

Hello, I did a illumina HiSeq 2X150 bp metagenomic sequencing recently. I have some questions.

1>I have got my sequencing report back (from sequencing center). The report says the average insert size is about 600bp, which means majority of reads that was prepare to be sequenced are around 600bp. I am confused about it. You know, after I got my fastq files back (R1 and R2). I firstly merged paired ends. I have >60% of reads that can be join together successfully. I don't how could this happen. Since the method only sequence 150 bp, and the fragment is 600 bp. There will be no overlaps (150 X 2 = 300 bp << 600 bp). Why I can still get so many reads joined. Let says, if I want to join more paired - end reads, the fragment size should be designed less than 300 bp right?

2> The report also says "300 cycles using the HiSeq system". This straight-forward. I suppose for R1 and R2 is 150 cycles, receptively. Each cycle will add one nucleotide and 150 cycle will be 150 bp. The sequencing center says they can also do maximum 500 cycles, which means 2X250 bp sequencing. I was wondering why they don't run more cycles such as 1000 cycles, so we could get 2X500 bp. This will give us longer reads. I don't know which factors restrict the illumina reads lengths? For the reports, it seems we can increase cycles to get longer reads.

Thanks,
SDPA_Pet is offline   Reply With Quote
Old 03-07-2017, 08:20 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

Use BBMap to estimate insert sizes. There are two methods described here. That estimate of 600 bp is clearly wrong since you would not have been able to merge the R1/R2 reads otherwise.

2x250 is maximum supported length on HiSeq 2500 and 2 x 300 on MiSeq. One can't get longer sequencing lengths on currently available Illumina sequencing kits. One could run asymmetric runs (e.g. 1 x 600 bp) but that is not generally recommended due to drops in quality you are bound to experience towards the end of such runs.

Last edited by GenoMax; 03-07-2017 at 08:25 AM.
GenoMax is offline   Reply With Quote
Old 03-07-2017, 08:32 AM   #3
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Hi GenoMax,

Yes, I know the bioinformatic tools BBMAP. According to their report, it says they determine the size of library using Agilent 2100 Bioanalyzer. I have never used a Bioanalyzer. I would guess it is kind of instrument that can do physical measurement (not a bioinformatic tool). Do you suggest that their reports or measurements are wrong. I should use bioinformatic tools to check it? Is it common that bioanalyzer gives you a wrong number?

So, I am correct, right? To join 2X150 bp, most of inserts should be less than 300bp.
SDPA_Pet is offline   Reply With Quote
Old 03-07-2017, 09:03 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

BBMap is going to give you an absolute answer by actually using the data that is there. There is no ambiguity involved. It will work if you have a reference available or without. Only case it won't work is if you have reads that don't merge and you don't have a reference available.

If you are able to join the PE reads then there are some inserts there that are smaller than 300 bp.

While you library may have had fragments in the 600 bp range, if there were any that were of a smaller size (as indicated by tails on bioanalyzer traces, you don't get an an absolute answer from bioanalyzer, AFAIK) then those fragments will preferentially bind and form clusters.

Last edited by GenoMax; 03-07-2017 at 09:10 AM.
GenoMax is offline   Reply With Quote
Old 03-07-2017, 10:24 AM   #5
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Hi Genomax,

Thanks. What you said makes me think the sequencing center send me a wrong report. They might mean the largest fragment. It doesn't make any sense for them to build so large fragment. 2X150bp only can sequence 300 bp maximum. If they build a library size of 600 bp, there are 300bp gaps out there. The coverage won't be very good.

Thanks,
SDPA_Pet is offline   Reply With Quote
Old 03-07-2017, 11:46 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

Quote:
Originally Posted by SDPA_Pet View Post
Hi Genomax,

Thanks. What you said makes me think the sequencing center send me a wrong report. They might mean the largest fragment. It doesn't make any sense for them to build so large fragment. 2X150bp only can sequence 300 bp maximum. If they build a library size of 600 bp, there are 300bp gaps out there. The coverage won't be very good.

Thanks,
2X150bp only can sample 300 bp from a fragment (for what ever size fragment, as long as it can get sequenced). You also need to keep in mind that there will always be a "normal" distribution of fragment sizes in your library with some tailing on both sides. How those tails look may determine the outcome of what preferentially clusters (small fragments would) on the flowcell.

Choice of insert sizes depends on what you are trying to do. If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region. If you must have the entire region covered by the two reads (i.e. reads need to overlap) then you would want to make inserts smaller.

Which of these two cases were you wanting to do?
GenoMax is offline   Reply With Quote
Old 03-07-2017, 11:48 AM   #7
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Quote:
Originally Posted by GenoMax View Post
2X150bp only can sample 300 bp from a fragment (for what ever size fragment, as long as it can get sequenced).

Choice of insert sizes depends on what you are trying to do. If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region. If you must have the entire region covered by the two reads (i.e. reads need to overlap) then you would want to make inserts smaller.

Which of these two cases were you wanting to do?
I do not have reference genome. My samples are environmental samples from soils. As I said, I got quit good joined ratio > 50%, which surprised me, because the reports told me the average insert size is 600 bp.
SDPA_Pet is offline   Reply With Quote
Old 03-07-2017, 11:57 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

What kind of samples are these and what will you be doing with them (assembly?) downstream?
GenoMax is offline   Reply With Quote
Old 03-07-2017, 12:41 PM   #9
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Quote:
Originally Posted by GenoMax View Post
What kind of samples are these and what will you be doing with them (assembly?) downstream?
They are soil samples. I did shotgun metagenomics and I am insterested in microbial communities. I will not assemble it, because normally less than 1% of reads can be assemble. My plan is joined whatever reads can be joined and get longer reads. Then, to annotate it using the long reads. Those samples are from environments and you don't really have prior knowledge about what is in it. The workflow is different from model organism.

PS, I don't understand in your previous post about "If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region". Just curious. I don't do model organisms and so normally there is no reference database. However, if they chose 2X150bp and have a reference database, but use 600 bp inserts. You can only sequence 150 bp from either end, but I still can't get information about 300 bp in the middle of the fragment. Why would they build a larger fragment library?
SDPA_Pet is offline   Reply With Quote
Old 03-07-2017, 08:13 PM   #10
atcghelix
Member
 
Location: CA

Join Date: Jul 2013
Posts: 74
Default

Are you sure they subtracted the adapter length from the fragment sizes to get the insert sizes (meaning, are you sure they're reporting insert size from the bioanalyzer?)? If the fragments themselves are an average of 600bp with a fairly wide distribution, it wouldn't be surprising if 60% of your reads merged with 150bp PE.

That said, we've (very rarely) had libraries that gave drastically different results between bioanalyzer, fragment analyzer, and tapestation, with the empirical insert size distributions determined after sequencing not agreeing with any of them.
atcghelix is offline   Reply With Quote
Old 03-08-2017, 04:07 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

Quote:
Originally Posted by SDPA_Pet View Post
They are soil samples. I did shotgun metagenomics and I am insterested in microbial communities. I will not assemble it, because normally less than 1% of reads can be assemble. My plan is joined whatever reads can be joined and get longer reads. Then, to annotate it using the long reads. Those samples are from environments and you don't really have prior knowledge about what is in it. The workflow is different from model organism.
SPAdes has a meta option for doing assemblies with metagenomes. I am sure there are other options for this type of assemblies. You would need access to a server with ample RAM but it should be possible to assemble the data you have to some extent. Unless you have already tried this and are reporting 1% assembly based on that.

Quote:
PS, I don't understand in your previous post about "If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region". Just curious.
As long as you can map the two ends of a fragment on a reference genome at the expected distance you could consider that region as "sampled". Since you will have reads that will randomly cover the genome, you should get reads mapping/spanning across entire genome.
GenoMax is offline   Reply With Quote
Old 03-08-2017, 04:22 AM   #12
Markiyan
Senior Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 115
Lightbulb Do not forget the clustering efficiency change ws insert size.

Also do not forget the clustering efficiency dependency on insert size.

Basically despite your library having 600bp fragments, they would clusters less efficiently (~10x?) than 200-300 bp fragments present in the sample. As a result one gets a peak on FLASH histogram in the area that is ~1/3x on the rising side of the bell curve produces by bioanalyzer. (You get enrichment of the smaller fragments during the clustering stage.)

PS: with latest iteration of the Illumina instruments (Hiseq4000/NovaSeq) they seem to continue to support libraries with up to 350 bp insert size - Shorter insets give you smaller and brighter (clusters/wells) + less likely to be long enough to jump to neighbouring wells - so can be sequenced on higher densities. As the result we get max 2x150 bp max. support from (Hiseq4000/NovaSeq). If you need 2x250 stick with HiSeq2500 or MiSeq.
Markiyan is offline   Reply With Quote
Old 03-08-2017, 05:22 AM   #13
yzzhang
Member
 
Location: florida

Join Date: Jan 2013
Posts: 66
Default

for our soil samples, the assembled reads normally account for ~50% of the original reads. BTW, our data is >10 Gb per sample.
yzzhang is offline   Reply With Quote
Old 03-08-2017, 05:41 AM   #14
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Hi, thanks. Can you explain more about the clustering stage. I don't know much details about HiSeq? Clustering stage -- do you mean it is a step of library building or Bridge amplification?

Quote:
Originally Posted by Markiyan View Post
Also do not forget the clustering efficiency dependency on insert size.

Basically despite your library having 600bp fragments, they would clusters less efficiently (~10x?) than 200-300 bp fragments present in the sample. As a result one gets a peak on FLASH histogram in the area that is ~1/3x on the rising side of the bell curve produces by bioanalyzer. (You get enrichment of the smaller fragments during the clustering stage.)

PS: with latest iteration of the Illumina instruments (Hiseq4000/NovaSeq) they seem to continue to support libraries with up to 350 bp insert size - Shorter insets give you smaller and brighter (clusters/wells) + less likely to be long enough to jump to neighbouring wells - so can be sequenced on higher densities. As the result we get max 2x150 bp max. support from (Hiseq4000/NovaSeq). If you need 2x250 stick with HiSeq2500 or MiSeq.
SDPA_Pet is offline   Reply With Quote
Old 03-08-2017, 06:25 AM   #15
Markiyan
Senior Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 115
Lightbulb Clustering = Bridge Amplification (for pre ex amp).

Clustering means bridge amplification for pre ExAmp (non-patterned flowcells) - in situ PCR on the flow cell surface oligos lawn. Has similar rukes/laws to a regular PCR, only the product stays in situ, forming a forest from DNA strands.

For ExAmp Chemistry (patterned flowcells) - Clustering means cluster formation using Isothermal Amplification.
(In theory only on the occupied nanowell, in practice, especially at low loading concentrations a few neighbours may join in too...).

Have a read about ExAmp & Hiseq4000:
http://core-genomics.blogspot.co.uk/...d-to-know.html
Markiyan is offline   Reply With Quote
Old 03-08-2017, 06:27 AM   #16
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Quote:
Originally Posted by Markiyan View Post
Clustering means bridge amplification for pre ExAmp (non-patterned flowcells) - in situ PCR on the flow cell surface oligos lawn. Has similar rukes/laws to a regular PCR, only the product stays in situ, forming a forest from DNA strands.

For ExAmp Chemistry (patterned flowcells) - Clustering means cluster formation using Isothermal Amplification.
(In theory only on the occupied nanowell, in practice, especially at low loading concentrations a few neighbours may join in too...).

Have a read about ExAmp & Hiseq4000:
http://core-genomics.blogspot.co.uk/...d-to-know.html
Thank you. I did my sequencing on old platform HiSeq 2500.
SDPA_Pet is offline   Reply With Quote
Old 03-08-2017, 07:33 AM   #17
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 198
Default

Out of curiosity, why are you joining the read pairs? A lot of the metagenomics software out there now supports paired end reads as input. The metaSPAdes assembler @GenoMax mentioned requires paired end data IIRC.
fanli is offline   Reply With Quote
Old 03-08-2017, 07:37 AM   #18
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Quote:
Originally Posted by fanli View Post
Out of curiosity, why are you joining the read pairs? A lot of the metagenomics software out there now supports paired end reads as input. The metaSPAdes assembler @GenoMax mentioned requires paired end data IIRC.
Hey, I did try metaSPAdes, less than 1% of total reads assembled. A lot of people tried alternative methods, joined paired-ends and get long reads, but don't assembled reads. Then, use the long merged reads to do BLAST or other annotations.
SDPA_Pet is offline   Reply With Quote
Old 03-08-2017, 07:39 AM   #19
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 198
Default

Would something like kraken or CLARK not be helpful? Are you trying to assemble and annotate de novo genomes? Or trying to figure out the microbial composition and functional content? I guess my point is you would discard ~40% of your data in the joining process, which may not be necessary depending on your task of interest.
fanli is offline   Reply With Quote
Old 03-08-2017, 07:43 AM   #20
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 204
Default

Quote:
Originally Posted by fanli View Post
Would something like kraken or CLARK not be helpful? Are you trying to assemble and annotate de novo genomes? Or trying to figure out the microbial composition and functional content? I guess my point is you would discard ~40% of your data in the joining process, which may not be necessary depending on your task of interest.
I am not interested in a specific genome in the soil community. Basically, I am just interested in the microbial composition and functional content. There is another way that people usually do. They don't assemble or merge pairs and they just blast using raw data. However, I think blast using ~150bp reads is worse. Some publication shows the intermediate length (merged pair) is better than assembled longer reads or unasembled short reads for the question that I am asking for.
SDPA_Pet is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:53 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO