Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NextSeq Data

    We recently acquired a NextSeq machine and are not very impressed with the data. I've uploaded a spreadsheet containing some of the statistics here:



    The first tab is a HiSeq2000 2x150bp run. The insert size was below target, so I adapter-trimmed adapters before analyzing the data (no other preprocessing was run); and the HS2000 is not really spec'd to 2x150, so as you might imagine, the quality suffers toward the end. Regardless, it's pretty good. Looking at the mapping stats, 99.55% of the reads mapped, and overall 79.85% of the reads were error-free.

    The next two tabs contain a couple of lanes of NextSeq bacterial sequence. Lane 1 generally seems to be the best, with quality dropping to a minimum at lane 4. But even for lane 1, only 96.47% of the reads mapped and 49.3% were perfect matches; by lane 4, 95.91% mapped and 38.91% were perfect. So the rate of reads with errors roughly tripled from HS2000 (which does not support 2x150bp runs) to NextSeq (which supposedly does), and as you can see on the "Average Quality by Position" and "Error Rate vs Read Position" graphs, the comparison would be brutal - an order of magnitude or more - if you consider 2x100bp reads. Also, if you look at the "Quality Score Accuracy" graph, the HS2000 quality scores are fairly accurate and typically underestimate quality, while the NextSeq ones are inaccurate and overestimate quality by about 10 dB (and are quantized), so you can't easily quality-trim the NextSeq data to improve it.

    The "Library Uniqueness" graph, generated by sampling a kmer from each read and hashing it to see if it was seen before, is also very odd for NextSeq. It is wavy. The graph should monotonically decrease and any increase indicates a sudden error burst. So it seems maybe the period (~625000 reads) corresponds with an image frame, the clusters around the edges of the frame are blurry, as one might expect from low-quality or miscalibrated optics.

    The Base Frequency vs Position graph is also interesting - NextSeq has a clear A/T ratio bias that is not present in HS data. The 3bp-wavelength sawtooth pattern probably has something to do with codon structure.

    Does anyone else have data they'd like to share on NextSeq machines?

    P.S. Command lines I used:

    Code:
    bbcountunique.sh in=reads.fq.gz reads=100000000 out=uniqueness.txt
    
    bbduk.sh in=reads.fq.gz reads=4000000 ktrim=r k=25 hdist=1 mink=12 tbo tpe ref=nextera.fa,truseq.fa out=ktrimmed.fq.gz ow
    
    bbmap.sh in=ktrimmed.fq.gz reads=4000000 mhist=mhist.txt ihist=ihist.txt bhist=bhist.txt idhist=idhist.txt ehist=ehist.txt qhist=qhist.txt idbins=200 qahist=qahist.txt aqhist=aqhist.txt indelhist=indelhist.txt gchist=gchist.txt
    
    bbmerge.sh in=ktrimmed.fq.gz reads=4000000 ihist=ihist_merge.txt

  • #2
    Thanks Brian for posting your analysis results. I wonder if HiSeq reads are also from bacterial DNA library and prepared using the same protocol as NextSeq ones.

    Comment


    • #3
      The HiSeq reads are bacterial, but from a collection of 26 different isolates mixed together to form a synthetic metagenomic community. I don't know much about the preparation protocols, but certainly the insert sizes differ substantially, so at least size selection was probably different; maybe shearing too.

      Comment


      • #4
        Interesting, thanks very much for the detailed analysis and your thoughts. So the data looks a little worse than HiSeq, I agree, but they're at an early stage with the NextSeq chemistry. Far more serious would be the use of low quality optics, which would be understandable at that price point.

        Any thoughts or observations on de novo assembly or SNP calling ? I believe I saw a post on SeqAnswers saying SNP calling works fine on the NextSeq at the expense of a few more indel errors (compared to HiSeq data).

        We are interested in a direct comparison against the Ion Proton. I see these details indicate the indel error rate is a lot lower here than that what I've heard comes off the Proton. This is very important for getting good de novo assemblies of course.

        Thanks again.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          We recently acquired a NextSeq machine and are not very impressed with the data. I've uploaded a spreadsheet containing some of the statistics here:



          The first tab is a HiSeq2000 2x150bp run. The insert size was below target, so I adapter-trimmed adapters before analyzing the data (no other preprocessing was run); and the HS2000 is not really spec'd to 2x150, so as you might imagine, the quality suffers toward the end. Regardless, it's pretty good. Looking at the mapping stats, 99.55% of the reads mapped, and overall 79.85% of the reads were error-free.

          The next two tabs contain a couple of lanes of NextSeq bacterial sequence. Lane 1 generally seems to be the best, with quality dropping to a minimum at lane 4. But even for lane 1, only 96.47% of the reads mapped and 49.3% were perfect matches; by lane 4, 95.91% mapped and 38.91% were perfect. So the rate of reads with errors roughly tripled from HS2000 (which does not support 2x150bp runs) to NextSeq (which supposedly does), and as you can see on the "Average Quality by Position" and "Error Rate vs Read Position" graphs, the comparison would be brutal - an order of magnitude or more - if you consider 2x100bp reads. Also, if you look at the "Quality Score Accuracy" graph, the HS2000 quality scores are fairly accurate and typically underestimate quality, while the NextSeq ones are inaccurate and overestimate quality by about 10 dB (and are quantized), so you can't easily quality-trim the NextSeq data to improve it.

          The "Library Uniqueness" graph, generated by sampling a kmer from each read and hashing it to see if it was seen before, is also very odd for NextSeq. It is wavy. The graph should monotonically decrease and any increase indicates a sudden error burst. So it seems maybe the period (~625000 reads) corresponds with an image frame, the clusters around the edges of the frame are blurry, as one might expect from low-quality or miscalibrated optics.

          The Base Frequency vs Position graph is also interesting - NextSeq has a clear A/T ratio bias that is not present in HS data. The 3bp-wavelength sawtooth pattern probably has something to do with codon structure.

          Does anyone else have data they'd like to share on NextSeq machines?

          P.S. Command lines I used:

          Code:
          bbcountunique.sh in=reads.fq.gz reads=100000000 out=uniqueness.txt
          
          bbduk.sh in=reads.fq.gz reads=4000000 ktrim=r k=25 hdist=1 mink=12 tbo tpe ref=nextera.fa,truseq.fa out=ktrimmed.fq.gz ow
          
          bbmap.sh in=ktrimmed.fq.gz reads=4000000 mhist=mhist.txt ihist=ihist.txt bhist=bhist.txt idhist=idhist.txt ehist=ehist.txt qhist=qhist.txt idbins=200 qahist=qahist.txt aqhist=aqhist.txt indelhist=indelhist.txt gchist=gchist.txt
          
          bbmerge.sh in=ktrimmed.fq.gz reads=4000000 ihist=ihist_merge.txt
          Hi Brian,

          We are looking to purchasing a NextSeq. But we do have a concern regarding the quality of the reads generated on NextSeq. Do you have a better experience now with the NextSeq?

          Your input is highly appreciated.

          James

          Comment


          • #6
            V2 chemistry has substantially higher quality than V1; it's basically fine. However, it still has some issues with the barcode-reading cycles, which has caused problems with multiplexed runs; we've had some in which certain barcodes are misread ~95% of the time, and thus get demultiplexed into the unknown bin. Last I heard, Illumina was aware of this issue and working on it; not sure what the current status is.

            Comment


            • #7
              Originally posted by Brian Bushnell View Post
              V2 chemistry has substantially higher quality than V1; it's basically fine. However, it still has some issues with the barcode-reading cycles, which has caused problems with multiplexed runs; we've had some in which certain barcodes are misread ~95% of the time, and thus get demultiplexed into the unknown bin. Last I heard, Illumina was aware of this issue and working on it; not sure what the current status is.
              Brian,

              Thanks for your reply. Are those bar-codes (that were misread) from Illumina or are they custom ones that prepared by you or your end-user?

              Thanks

              James

              Comment


              • #8
                I think they were Illumina TruSeq, but it's possible they were custom. They worked fine on HiSeq and MiSeq, though, and on NextSeq with V1 chemistry.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM
                • seqadmin
                  The Impact of AI in Genomic Medicine
                  by seqadmin



                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                  02-26-2024, 02:07 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-14-2024, 06:13 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-08-2024, 08:03 AM
                0 responses
                72 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-07-2024, 08:13 AM
                0 responses
                81 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-06-2024, 09:51 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X