Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dereplication tools needed

    It's been a while since I got my hands on shotgun Illumina metagenomic data. I've found that it's quite important to dereplicate before doing any downstream analysis to avoid problems with assembly and inaccurate quantification. The last time around I used usearch --derep_fulllength on a subset of the data to filter out artificial replicate reads, but it is choking on the larger datasets I have now. My approach was to identify a high quality subsection of R1 and dereplicate that, then filter out reads from the raw data. The reason for this is that often there can be a single cycle with high error, and there is always higher error at the end of the read, so some actual replicates could be missed if the whole read is used.

    Can anyone recommend a good current tool for dereplicating Illumina reads? My datasets are about 20-30 million reads each. I came across Fulcrum with google search--any experiences with that? (paper)

  • #2
    Here is a python script that I wrote to dereplicate larger fastas github
    The only requirement is BioPython needs to be in your python path. You can use the -h option for more information. An example usage would be derep_seqs.py -i somefasta.fna

    I hope it helps you!

    Comment


    • #3
      The BBMap package contains a tool for dereplication. It's intended for assembly dereplication, but written so that it works with paired-end fastq reads.

      Usage:
      dedupe.sh in=reads.fq out=fixed.fq maxsubs=0 int=t ac=f

      If your OS does not process bash shellscripts, you can replace "dedupe.sh" with "java -Xmx31g -cp /path/to/current jgi.Dedupe" where 31g should be adjusted to be around 80% of your physical memory.

      "maxsubs=0" means that only exact matches are allowed; you can make that number higher if you want. "int=t" is used to indicate that the data is paired and interleaved. If your data is paired and in 2 files, you need to interleave it first:

      reformat.sh in1=reads1.fq in2=reads2.fq out=interleaved.fq
      (or java -Xmx200m -cp /path/to/current jgi.ReformatReads)

      Other options are random subsampling to reduce the volume of data, and error-correction to help better detect duplicates. The BBMap package contains an error-correction program (ecc.sh), though the effectiveness depends on the makeup of the metagenomic community and data volume, as it depends on high kmer depth. 30m reads for a real metagenome is pretty small.
      Last edited by Brian Bushnell; 02-18-2015, 10:27 AM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      9 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      50 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X