Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina HiSeq library insert size

    Hello, I did a illumina HiSeq 2X150 bp metagenomic sequencing recently. I have some questions.

    1>I have got my sequencing report back (from sequencing center). The report says the average insert size is about 600bp, which means majority of reads that was prepare to be sequenced are around 600bp. I am confused about it. You know, after I got my fastq files back (R1 and R2). I firstly merged paired ends. I have >60% of reads that can be join together successfully. I don't how could this happen. Since the method only sequence 150 bp, and the fragment is 600 bp. There will be no overlaps (150 X 2 = 300 bp << 600 bp). Why I can still get so many reads joined. Let says, if I want to join more paired - end reads, the fragment size should be designed less than 300 bp right?

    2> The report also says "300 cycles using the HiSeq system". This straight-forward. I suppose for R1 and R2 is 150 cycles, receptively. Each cycle will add one nucleotide and 150 cycle will be 150 bp. The sequencing center says they can also do maximum 500 cycles, which means 2X250 bp sequencing. I was wondering why they don't run more cycles such as 1000 cycles, so we could get 2X500 bp. This will give us longer reads. I don't know which factors restrict the illumina reads lengths? For the reports, it seems we can increase cycles to get longer reads.

    Thanks,

  • #2
    Use BBMap to estimate insert sizes. There are two methods described here. That estimate of 600 bp is clearly wrong since you would not have been able to merge the R1/R2 reads otherwise.

    2x250 is maximum supported length on HiSeq 2500 and 2 x 300 on MiSeq. One can't get longer sequencing lengths on currently available Illumina sequencing kits. One could run asymmetric runs (e.g. 1 x 600 bp) but that is not generally recommended due to drops in quality you are bound to experience towards the end of such runs.
    Last edited by GenoMax; 03-07-2017, 09:25 AM.

    Comment


    • #3
      Hi GenoMax,

      Yes, I know the bioinformatic tools BBMAP. According to their report, it says they determine the size of library using Agilent 2100 Bioanalyzer. I have never used a Bioanalyzer. I would guess it is kind of instrument that can do physical measurement (not a bioinformatic tool). Do you suggest that their reports or measurements are wrong. I should use bioinformatic tools to check it? Is it common that bioanalyzer gives you a wrong number?

      So, I am correct, right? To join 2X150 bp, most of inserts should be less than 300bp.

      Comment


      • #4
        BBMap is going to give you an absolute answer by actually using the data that is there. There is no ambiguity involved. It will work if you have a reference available or without. Only case it won't work is if you have reads that don't merge and you don't have a reference available.

        If you are able to join the PE reads then there are some inserts there that are smaller than 300 bp.

        While you library may have had fragments in the 600 bp range, if there were any that were of a smaller size (as indicated by tails on bioanalyzer traces, you don't get an an absolute answer from bioanalyzer, AFAIK) then those fragments will preferentially bind and form clusters.
        Last edited by GenoMax; 03-07-2017, 10:10 AM.

        Comment


        • #5
          Hi Genomax,

          Thanks. What you said makes me think the sequencing center send me a wrong report. They might mean the largest fragment. It doesn't make any sense for them to build so large fragment. 2X150bp only can sequence 300 bp maximum. If they build a library size of 600 bp, there are 300bp gaps out there. The coverage won't be very good.

          Thanks,

          Comment


          • #6
            Originally posted by SDPA_Pet View Post
            Hi Genomax,

            Thanks. What you said makes me think the sequencing center send me a wrong report. They might mean the largest fragment. It doesn't make any sense for them to build so large fragment. 2X150bp only can sequence 300 bp maximum. If they build a library size of 600 bp, there are 300bp gaps out there. The coverage won't be very good.

            Thanks,
            2X150bp only can sample 300 bp from a fragment (for what ever size fragment, as long as it can get sequenced). You also need to keep in mind that there will always be a "normal" distribution of fragment sizes in your library with some tailing on both sides. How those tails look may determine the outcome of what preferentially clusters (small fragments would) on the flowcell.

            Choice of insert sizes depends on what you are trying to do. If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region. If you must have the entire region covered by the two reads (i.e. reads need to overlap) then you would want to make inserts smaller.

            Which of these two cases were you wanting to do?

            Comment


            • #7
              Originally posted by GenoMax View Post
              2X150bp only can sample 300 bp from a fragment (for what ever size fragment, as long as it can get sequenced).

              Choice of insert sizes depends on what you are trying to do. If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region. If you must have the entire region covered by the two reads (i.e. reads need to overlap) then you would want to make inserts smaller.

              Which of these two cases were you wanting to do?
              I do not have reference genome. My samples are environmental samples from soils. As I said, I got quit good joined ratio > 50%, which surprised me, because the reports told me the average insert size is 600 bp.

              Comment


              • #8
                What kind of samples are these and what will you be doing with them (assembly?) downstream?

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  What kind of samples are these and what will you be doing with them (assembly?) downstream?
                  They are soil samples. I did shotgun metagenomics and I am insterested in microbial communities. I will not assemble it, because normally less than 1% of reads can be assemble. My plan is joined whatever reads can be joined and get longer reads. Then, to annotate it using the long reads. Those samples are from environments and you don't really have prior knowledge about what is in it. The workflow is different from model organism.

                  PS, I don't understand in your previous post about "If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region". Just curious. I don't do model organisms and so normally there is no reference database. However, if they chose 2X150bp and have a reference database, but use 600 bp inserts. You can only sequence 150 bp from either end, but I still can't get information about 300 bp in the middle of the fragment. Why would they build a larger fragment library?

                  Comment


                  • #10
                    Are you sure they subtracted the adapter length from the fragment sizes to get the insert sizes (meaning, are you sure they're reporting insert size from the bioanalyzer?)? If the fragments themselves are an average of 600bp with a fairly wide distribution, it wouldn't be surprising if 60% of your reads merged with 150bp PE.

                    That said, we've (very rarely) had libraries that gave drastically different results between bioanalyzer, fragment analyzer, and tapestation, with the empirical insert size distributions determined after sequencing not agreeing with any of them.

                    Comment


                    • #11
                      Originally posted by SDPA_Pet View Post
                      They are soil samples. I did shotgun metagenomics and I am insterested in microbial communities. I will not assemble it, because normally less than 1% of reads can be assemble. My plan is joined whatever reads can be joined and get longer reads. Then, to annotate it using the long reads. Those samples are from environments and you don't really have prior knowledge about what is in it. The workflow is different from model organism.
                      SPAdes has a meta option for doing assemblies with metagenomes. I am sure there are other options for this type of assemblies. You would need access to a server with ample RAM but it should be possible to assemble the data you have to some extent. Unless you have already tried this and are reporting 1% assembly based on that.

                      PS, I don't understand in your previous post about "If you have a reference available then making the libraries so the two ends do not overlap makes sense since you can sample a larger region". Just curious.
                      As long as you can map the two ends of a fragment on a reference genome at the expected distance you could consider that region as "sampled". Since you will have reads that will randomly cover the genome, you should get reads mapping/spanning across entire genome.

                      Comment


                      • #12
                        Do not forget the clustering efficiency change ws insert size.

                        Also do not forget the clustering efficiency dependency on insert size.

                        Basically despite your library having 600bp fragments, they would clusters less efficiently (~10x?) than 200-300 bp fragments present in the sample. As a result one gets a peak on FLASH histogram in the area that is ~1/3x on the rising side of the bell curve produces by bioanalyzer. (You get enrichment of the smaller fragments during the clustering stage.)

                        PS: with latest iteration of the Illumina instruments (Hiseq4000/NovaSeq) they seem to continue to support libraries with up to 350 bp insert size - Shorter insets give you smaller and brighter (clusters/wells) + less likely to be long enough to jump to neighbouring wells - so can be sequenced on higher densities. As the result we get max 2x150 bp max. support from (Hiseq4000/NovaSeq). If you need 2x250 stick with HiSeq2500 or MiSeq.

                        Comment


                        • #13
                          for our soil samples, the assembled reads normally account for ~50% of the original reads. BTW, our data is >10 Gb per sample.

                          Comment


                          • #14
                            Hi, thanks. Can you explain more about the clustering stage. I don't know much details about HiSeq? Clustering stage -- do you mean it is a step of library building or Bridge amplification?

                            Originally posted by Markiyan View Post
                            Also do not forget the clustering efficiency dependency on insert size.

                            Basically despite your library having 600bp fragments, they would clusters less efficiently (~10x?) than 200-300 bp fragments present in the sample. As a result one gets a peak on FLASH histogram in the area that is ~1/3x on the rising side of the bell curve produces by bioanalyzer. (You get enrichment of the smaller fragments during the clustering stage.)

                            PS: with latest iteration of the Illumina instruments (Hiseq4000/NovaSeq) they seem to continue to support libraries with up to 350 bp insert size - Shorter insets give you smaller and brighter (clusters/wells) + less likely to be long enough to jump to neighbouring wells - so can be sequenced on higher densities. As the result we get max 2x150 bp max. support from (Hiseq4000/NovaSeq). If you need 2x250 stick with HiSeq2500 or MiSeq.

                            Comment


                            • #15
                              Clustering = Bridge Amplification (for pre ex amp).

                              Clustering means bridge amplification for pre ExAmp (non-patterned flowcells) - in situ PCR on the flow cell surface oligos lawn. Has similar rukes/laws to a regular PCR, only the product stays in situ, forming a forest from DNA strands.

                              For ExAmp Chemistry (patterned flowcells) - Clustering means cluster formation using Isothermal Amplification.
                              (In theory only on the occupied nanowell, in practice, especially at low loading concentrations a few neighbours may join in too...).

                              Have a read about ExAmp & Hiseq4000:
                              http://core-genomics.blogspot.co.uk/...d-to-know.html

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X