Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging 16S reads with FLASH - parameters?

    I have 300 bp paired-end Illumina reads generated on the MiSeq using Illumina's V3V4 16S protocol. The amplicon size is 460 bp.

    As the first step in my analysis, I'm using FLASH to merge these reads. I'm using the following command line:

    FLASH --min-overlap=20 --max-overlap= 140 --read-len=300 --fragment-len=460 --fragment-len-stddev=1 --output-directory=MERGED --output-prefix=MERGED 612A-plate-1-H04_S88_L001_R1_001.fastq 612A-plate-1-H04_S88_L001_R2_001.fastq

    After FLASH completes, it gives the following warning:

    [FLASH] WARNING: An unexpectedly high proportion of combined pairs (62.47%) overlapped by more than 140 bp, the --max-overlap (-M) parameter. Considerincreasing this parameter. (As-is, FLASH is penalizing overlaps longer than 140 bp when considering them for possible combining!)

    Since the theoretical max overlap should be 140 bp, that's what I set the max-overlap parameter to. How is it possible that so many reads overlap significantly more than 140 bp? Running a few iterations of this, I have found that I have to set 'max-overlap' at 159 to eliminate this error.

    Just trying to understand how this parameter actually works. Maybe my amplicon is a little smaller than expected?

    EDIT: I just realized that I'm using both the 'read-len'/'fragment-len'/'fragment-len-stddev' parameters together with 'max-overlap' above, so the first three are ignored. If I use them without 'max-overlap', the calculated max-overlap is 152. I used 'max-overlap' to determine that 159 eliminates the warning.
    Last edited by cheezemeister; 05-19-2015, 01:45 PM.

  • #2
    Have you scanned this data (with a trimming program) to see how much adapter dimers or read-through it has? Did FastQC indicate this as a possibility?

    Comment


    • #3
      Originally posted by GenoMax View Post
      Have you scanned this data (with a trimming program) to see how much adapter dimers or read-through it has? Did FastQC indicate this as a possibility?
      Haven't done that, however adapters are trimmed at source by the MiSeq. I haven't quality-trimmed the data yet since everything I've read says that merging first is the preferred method.

      Not sure why I would have read-through on a 460 bp amplicon using a 300 bp read.

      I can run FastQC and see.

      Comment


      • #4
        Originally posted by cheezemeister View Post
        Haven't done that, however adapters are trimmed at source by the MiSeq. I haven't quality-trimmed the data yet since everything I've read says that merging first is the preferred method.

        Not sure why I would have read-through on a 460 bp amplicon using a 300 bp read.

        I can run FastQC and see.
        Wasn't asking about quality trimming. You certainly want to first merge and then trim (if needed, for quality). Since we don't use onboard MiSeq analysis I tend to forget that adapters may have already been trimmed (though in that instance you probably no longer have uniform 300 bp reads, trimmed reads could be short and will overlap more than you expect them to, FastQC will tell you about the size spread).

        Give BBMerge a try as well (from BBMap).
        Last edited by GenoMax; 05-19-2015, 03:06 PM.

        Comment


        • #5
          Just selecting a representative file, FastQC reports my sequence length as 35-300 bp, though 70% are 300 bp and pretty much 100% are >280 bp.

          Since max-overlap at 159 eliminates the error, and increasing beyond that does not increase % merged, that seems to jive with 100% of bases being 280 bp or greater.

          I'll also try BBmerge. Do you happen to know if BBmerge can do batch processing (I've got several thousand samples of data) and output the %merge in a table?

          Comment


          • #6
            Originally posted by cheezemeister View Post
            Just selecting a representative file, FastQC reports my sequence length as 35-300 bp, though 70% are 300 bp and pretty much 100% are >280 bp.
            To clarify, was the only trimming done adapter-trimming by the machine? There should not really be anything in the 280-299bp range if trimming was done correctly and the library was made correctly. Adapter-trimming is not necessary prior to merging; the position of adapters (if any) is obvious based on the overlap, and a good read-merger will trim them if present. I suggest you turn it off in this case unless you first generate an insert-size histogram and specifically note adapter sequence. If ~30% are getting trimmed to between 280 and 299bp (when it should be 0%), perhaps the algorithm being used is a greedy one that matches even 1 bp. The end result will be inferior merging as the overlap region is unnecessarily reduced.
            I'll also try BBmerge. Do you happen to know if BBmerge can do batch processing (I've got several thousand samples of data) and output the %merge in a table?
            BBMerge does not have a batch mode; you'd have to script that. It does print the percent merged for each dataset, though, which can be parsed from stderr.
            Last edited by Brian Bushnell; 05-19-2015, 04:36 PM.

            Comment


            • #7
              For future flash use this should be noted:

              --read-len (-r) has no effect when --max-overlap (-M) is also specified!

              --fragment-len-stddev (-s) has no effect when --max-overlap

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X