Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Nextera Pair-End Sequence Content Bias-Require trimming for DeNovo Assembly?

    I'm working on a bacterial data set that I was having difficulty assembling.

    Illumina. 300 bp reads. Pair End Data. Nextera library prep.

    The FastQC per-base-sequence-content chart (attached) shows high sequence content bias in the first 15-20 positions. Initially, I thought it was adapter contamination and tried to use a variety of trimming tools (trimmomatic, others) to remove what I thought were adapters. I found a blog here: (https://www.instapaper.com/read/496731324), that suggests this is a library problem due to Nextera kits.

    After running the data through trimmomatic, I used the paired data (ignored the data from the unpaired data sets for the time being) and then artificially trimmed off the first 20 positions from the subset of data that was showing the sequence bias. I was finally able to get a reasonable assembly.

    Questions:
    1) Does the sequence bias in the first 20 bases point to a problem with the library prep? Or is this typical with the Nextera/nothing to worry about?

    2) For DeNovo assembly, is it necessary to trim off the first ~20 bases? Is there a recommended tool/process? (rather than just arbitrarily clipping the first 20 bases)?

    3) I noticed Trimmomatic separates the reads into reads that are and are not paired. For DeNovo Assembly, is there any reason NOT to include the unpaired data?

    Thanks in advance
    Attached Files

  • #2
    Nextera has highly nonuniform first ~20bp, but it's neither adapter sequence nor errors; just a fragmentation site bias. You don't need to trim it. If you did trim it, though, the only way would be to trim the first X bases.

    For assembly, if you use a pair-aware assembler and have sufficient data, it's best to assemble from paired reads. Some assemblers allow you to specify both paired and unpaired reads in the same assembly, in which case you could use both. But if the assembler only allows you to give it paired OR unpaired reads, it's probably best to give it the paired reads only, rather than mixing all the reads together, which would require you running the data as unpaired. There is no strict answer that will be correct for all assemblers, as they make use of pairing data differently, or possibly not at all.

    Comment


    • #3
      Thanks for your help Brian.

      Your feedback that it isn't necessary to trim the first 15-20 bases due to fragmentation site bias led me to revisit my QC results.

      Another Question: Would you be willing to comment on the quality of the reverse read? Would you consider this a good run? ok run? Do you typically see the large quality range in the first few bases of the reverse read? The lab is tuning its protocols. Does this point to anything that might need to get changed?

      Adding this in case it helps others in the future.

      Working with Illumina Nextera prepped, pair-end 300 bp reads.

      I have typically been taking a quick glance at the FastQC results. If the results looked good, I didn't bother with trimming/filtering the data before de-novo assembly. (Was relying on the assembler to leverage quality score information)

      However, when I tried to go assemble the data, the assembly (using a variety of assemblers) were all terrible (thousands of small contigs). Mapping results looked fine.

      I was able to get a good assembly after running the data through trimmomatic first. As Brian suggested, it is not necessary to trim off the first 15-20 bases due to fragmentation site bias...
      Attached Files

      Comment


      • #4
        I have never worked with 2x300bp data; so far, we only go up to 2x250. So I'm not sure how typical the quality is of the last bases on read 2, but it certainly looks like it should be trimmed. And overall the quality variability for read 2 seems higher than it should be, but I don't work on the wet-lab side, so I'm not sure what it might indicate.

        If you have plenty of data, you might experiment with throwing away reads with average quality below some threshold (or specifically, pairs in which either read is below the threshold), and see if that improves your assembly.

        Comment


        • #5
          Since FastQC plots larger intervals it is difficult to see what may be going on with R2. You could turn-off the interval plotting on the command line and see if the tail end of R2 truly requires major trimming/throwing away the reads.

          If this is a bacterial genome I would suggest trying SPADes, if you have not already done so.

          Comment


          • #6
            In my experience the fastqc quality plots look similar to what we see with TruSeq libraries.
            However i always do the trimming for adapters and quality.
            Especially with Nextera, the bead size selection and 2x300bp reads you might end up with some adapter sequences in your read data.

            Do you do the trimming on the MiSeq directly or separately afterwards? To get a feel about the adapter contamination i would recommend to turn off the adapter trimming function on the MiSeq.

            Concerning the first 20 bp I agree with Brian and it looks the same for the Nextera libraries we sequenced so far.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X