Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MiSeq Amplified Amplicons sequencing: good Qscore, bad error rate

    Hi All,

    I have been working on sequencing these PCR amplified amplicons on the MiSeq. I think the final roadblock I'm running into is the sequencing quality/error of these libraries on the MiSeq (2x150).

    I should add that these are low complexity libraries. There is built in UMI on read 1 for the first 12 nucleotide, then after that there should be diversity. On read 2, there's about 19 nucleotides that's the same across all the amplicons, and there should be diversity afterwards.

    Knowing that the libraries have low diversity, I've been increasing the amount of PhiX spike in (most recently up to 35%).

    The run initially looks good: The Qscore looks pretty good, and the per sequence quality is high (I attached the fastqc graphs for read 1, read 2 is worse, but not significantly). Overall %>Q30 is about 89%.

    The issue is when the PhiX spike-in is mapped back to the genome, it reported an error rate of 8%. I don't understand why the error rate is so high. I've been told that the mapped error rate is the more believable one.

    The other thing I can think of is that the per base percentage content also doesn't look great (attached), which would suggest that even with 35% spike in, there's still not enough diversity on the loaded sample.

    Would spiking in more PhiX help with the situation? I haven't been using staggered sequencing primers. How would those work?
    Attached Files
    Last edited by SunPenguin; 12-09-2015, 09:28 AM.

  • #2
    I don't think increasing PhiX above 35% would do that much for you-- according to Illumina, 10% should be sufficient. Is the 35% spike in at least producing an alignment % of approximately the same value?

    The Q30 number looks good, but what do the other run stats look like? What's the cluster density, what's the PF rate, how does the FWHM graph look?

    The error rate for PhiX, in my experience, usually does not go quite so high when all the other metrics are normal. I have some amplicon data (more base pair diversity than yours so not directly comparable) but my error rates are all still under 1%, even when the % base composition deviates from expected.

    Have you been in contact with Illumina tech support at all?

    Comment


    • #3
      Hi thank you for your reply. I'll try to get more information when I have access to the sav file tomorrow.

      I've been talking to the sequencing core here, and quite frankly they also are not quite sure. I remember them telling me that the cluster density is at around 900K/mm2, which is usually okay, though they thought it may be a bit high for low diversity libraries.

      I included the AA trace example of the libraries I was sequencing. the major product is at around 600bp, though there is a minor product at around 450.

      The rest of the fastqc is also here. The only other thing that I really didn't like about the sequencing run is how many repeated sequences there are. That's not too surprising, since this is one of the older libraries that probably went through too many PCR cycles. I don't really think that would interfere with PhiX sequencing quality though.

      We actually originally aimed for 40% spike-in, but the alignment ended up with 35%.
      Attached Files
      Last edited by SunPenguin; 12-09-2015, 04:59 PM.

      Comment


      • #4
        It looks like the cluster density is fine, and your PhiX alignment isn't really that far off your target. Plus your sequencing core is probably watching out for anything hardware related that might occur during a run, so the issue probably isn't any of that.

        Looking at your (additional) data and comparing it against my amplicon runs, I'm seeing two things:

        1) my base pair composition resembles yours when I'm sequencing through a common sequence that exists on literally all of my reads. I'm attaching a screenshot of my base pair composition graph so you can see what I mean. The common region is in the first ~20bp of the run. Once I get through that common region, my base pair composition more closely resembles a more diverse sample.

        2) the kmer content graph you posted only has spikes in the first five positions. I don't think I've ever seen that, regardless of the library type. I'm used to seeing spikes across the board. I'm also attaching a screen cap of my kmer content graph so you can see. Apologies for the cropping, but I was trying to keep the image under the 146.5kb limit for the forum attachments.

        Anyway, these two things together are leading me to focus more on the low diversity of your sample. The number of amplicons you're sequencing must be very low (also supported by the amount of repeated sequence you're seeing?). It would make sense to me that if you have more amplicon library than PhiX, and the library is extremely low diversity, that the error rates on the PhiX sequence might be inflated quite a lot. It's possible that the amplicon reads are interfering with estimates of phasing and dye crosstalk, and it's pushing up error rates, even if the data isn't necessarily problematic. Have you tried using your amplicon sequence for any down stream analyses? I'd be interested to know if what you're seeing is an artifact or if the data actually has more mismatches in it.

        An interesting troubleshooting experiment would be to load a lot of PhiX (upwards of 50 or 60%) with your library and see if that corrects the issue.

        Also, as you mentioned these were older libraries and possibly had been overamplified in PCR, are you seeing similar results for runs with newer libraries where the number of PCR cycles has been adjusted down? I expect that might change the amount of overrepresented and/or duplicated sequence, if nothing else.
        Attached Files

        Comment


        • #5
          Hi SunPenguin. I agree with Jessica it is likely a low complexity issue - but would add that it might be exacerbated by densities above 900k. We have run a lot of amplicon low compexity libraries and some perform better than others (even with a healthy PhiX spike-in). There is a definite 'cliff edge' with these libraries and they can impact of PF% and sequence quality. A couple of things to try. 1) For Amplicons we try and aim for a V2 density of ~700 2) Think about including a single-source sample in your sample set (e.g E.coli DNA if doing 16S bacteria) this will give you an error rate for the amplicon you are generating.

          Is the library clean of dimer etc? - I ask because the Q-score drop off is quite abrupt after 100bp. If the library has a lot of short artefacts it might not be helping the situation either.

          Comment


          • #6
            Hi all,

            Thank you guys for helping. I got my hand on the run file, and you guys were right; there were some funky metrics with the run.

            The PF% was only 74%, which is a lot lower than I had thought. The exact density is 923 K/mm2.

            The phiX error rate also shoots up after 100bp (though by 100 cycle the error rate is already at 3%). The FWHW is slightly large I think, with the C and T signals drifting from 3.2 to 3.5 throughout the pE run, and A and G from about 2.8 to 3.

            I did clean the library by 0.7X spri, but as you can see on the AA trace I uploaded above, there is some 400bp products. I agree that it's weird that it seems the library has a lot of short artifacts (I can see that through the repeated sequence in fastqc as well, which show some adapter sequences).

            I have salvaged some data for downstream analysis. I definitely was able to recover some sequences that match back to nblast, but yes the number of reads that I ended up salvaging was low compared to what came off of the sequencer, and the diversity was very low (overall maybe <100,000 unique species).

            I'm in the process of sequencing our new and improved library, which should be much more diverse with less PCR cycles and steps. I'm also looking to pool several dissimilar libraries together in addition to PhiX. I'm looking through this data set here to make sure I don't make the same sequencing mistake again...

            Comment


            • #7
              I run lots of amplicons (that's my main thing) and do at least 10% phiX+10% Nextera genomes or if I don't have any genomes, I'll do 20% phiX. Same as bunce, I aim for ~700k clusters. Your pf is technically out of spec so Illumina will likely replace your kit, but they'll also tell you that the problem is your library
              Microbial ecologist, running a sequencing core. I have lots of strong opinions on how to survey communities, pretty sure some are even correct.

              Comment


              • #8
                I spoke with our sequencing core today again, and found that the reason for the high cluster density is apparently QC related. They had loaded the lane originally aiming for 600-700k/mm2, but it ended up being 900K/mm2.

                The QC was done by qPCR, so I'm really not quite sure what happened there... especially considering that the bioA/AA trace looks very clean.
                Last edited by SunPenguin; 12-14-2015, 05:39 PM.

                Comment


                • #9
                  Hi All,

                  I have a similar question and I think this post is the best fit for my question (compared to other post I have checked).

                  I submitted some libraries for sequencing using Miseq 600 V3, PE.
                  My sample has low diversity
                  The lenght of my construct is 322 bp
                  The core thought that doing 250 cycles would be better, my reads are 251 bp long, adapter read-throught type.
                  The core used 15% PhyX spiked in (which I agreed with).
                  I got about 17M reads and the quality is relatively good, with 89% or so over Q30.
                  I saw in the SAV file that the error percentage is 1%
                  There are important difference between the files of the two reads (number of mistmatches, total number of sequences).

                  I have mapped (to my known reference) some sequences using Geneious, to have a visual idea of quality. In the attached figures you can see:
                  1) Lots of darker blue spots, which are mist-matches.
                  2) There are lots of errors in the adapter region.

                  I am puzzled whether:
                  1) It would be ok if I remove the sequences with lots of errors in the adapters and keep the other ones for down stream analysis, or
                  2) if this really indicates that this data set is not to be trusted for SNPs calling. I mean some reads would have errors in the sequence and not in the adapter, how can I tell?. For long tracks of errors is relatively easy, but for SNPs?

                  ----
                  EDITION:
                  Looking at these alignments in more detail, I noticed that the primer binding region, the last part of the alignment (from nucleotide 153 on, 17 nt, not included in the reference) there don't seem to be that many errors. I think the errors in there are the ones created during PCR for library preparation. The ones in the adapter are still a black box to me, why more in there?
                  ----

                  A bit more detail if you need it:
                  The reference is a syntethic DNA of 187 bp. I transcribe it, and subject the transcripts to cycles of evolution under different treatments. The sequence data are PCR products obtained from the evolved transcripts. If you want to know more you can check the Continuous in vitro Evolution technique.

                  Please, let me know if you need more detail.

                  C


                  Last edited by cosmarium; 08-08-2016, 10:38 PM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X