Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimal De Novo Assembly Parameters

    Hi All,

    I'm sequencing 10 kb PCR products using Illumina 150 x 2 paired-end reads. I'm trying to optimize a de novo assembly workflow, and was hoping that I would find some help here. I outline the process below. My goal is to de novo assemble the PCR product into a single, accurate contig. Questions are in red, interleaved with the step in the protocol they refer to. Thanks for any help and feedback.

    Program: Geneious
    Input: Trimmed reads (fastq). The sequencing core trims the adapters and barcodes for me.

    1) Pair the reads. This generates a single file, in which the pairs are now interleaved.

    2) BBNorm (Default settings). Normalize reads to 100x coverage.
    Would error correcting be beneficial?

    3) BBMerge. Merge the reads remaining after normalization. Merge rate is set to "normal".

    4) De Novo Assembly. I'm currently using the Geneious assembler. There are a lot of parameters that can be manipulated. The ones I'm using are attached as a screenshot. Please share if you think these parameters could be improved, and how.
    How frequent are miscalls in Illumina sequencing? I'm not sure how much overlap between neighboring reads I should require, and within this overlap, how many mismatches I should allow. Also, how often are insertions created during the illumina process? Should I allow gaps within reads?

    5) Extract contig consensus sequences. Minimum Coverage = 15, otherwise a gap is called.
    What is an acceptable minimum coverage? Since I'm sequencing a PCR product, I imagine I could increase this significantly. The benefit would be removing possible contaminating sequences.

    6) Map to reference. In an ideal world, only 1 of the contigs maps to the reference, and the others are background genomic DNA (my PCR reaction starts with a small amount of genomic DNA as template). If multiple contigs map, it could be because there were multiple viral genomes in the initial PCR reaction. It means I have to throw the data out, as the individual genomes are too similar to accurately differentiate. So, it's important that the de novo assembly is stringent, yet not over stringent so that true neighboring reads cannot be assembled into a single contig.

    After this I intend to look for Open Reading Frames.

    Thanks again for any feedback/suggestions!

    Jake
    Attached Files
    Last edited by JVGen; 09-27-2016, 12:26 PM. Reason: Attached image

  • #2
    Originally posted by JVGen View Post
    Would error correcting be beneficial?
    [...]
    How frequent are miscalls in Illumina sequencing?
    Miscalls in Illumina sequencing are not very frequent, and should probably not be a consideration at this step.
    As QC measure you can use the tool pilon for error correction afterwards.



    General comment: There is normally not an optimal setting. It's normally best just to test a range which you think could be reasonable.
    (we have a pipeline for this, which will push the data through 5 assemblers with different parameters and evaluates afterwards, which assembly is potentially the best; there's at least one published pipeline for this out there, but wouldn't know the name right now)

    Comment


    • #3
      Originally posted by bastianwur View Post
      Miscalls in Illumina sequencing are not very frequent, and should probably not be a consideration at this step.
      As QC measure you can use the tool pilon for error correction afterwards.



      General comment: There is normally not an optimal setting. It's normally best just to test a range which you think could be reasonable.
      (we have a pipeline for this, which will push the data through 5 assemblers with different parameters and evaluates afterwards, which assembly is potentially the best; there's at least one published pipeline for this out there, but wouldn't know the name right now)
      Thanks Bastian. I'm learning this is quite a complex process. I intend to look for open reading frames, so misassembly/miscalls are quite worrisome. They could introduce nonsense mutations, and we wouldn't know the difference (because we're sequencing viruses which are highly mutated). We need a bioinformatician :P

      Comment


      • #4
        Yeah, that can for sure happen during the assembly processes, we've seen this during some comparative genomics tests.
        But as suggested, use Pilon for error correction afterwards. It needs the reads mapped to the assembly, and will then check if the majority of the reads agree with the assembly, and will correct it if it's not the case.
        The tool is relatively easy to use, in case you're familiar with the command line and know what a BAM file and a fasta file is.

        Comment


        • #5
          Originally posted by bastianwur View Post
          Yeah, that can for sure happen during the assembly processes, we've seen this during some comparative genomics tests.
          But as suggested, use Pilon for error correction afterwards. It needs the reads mapped to the assembly, and will then check if the majority of the reads agree with the assembly, and will correct it if it's not the case.
          The tool is relatively easy to use, in case you're familiar with the command line and know what a BAM file and a fasta file is.
          Thanks Bastian. Which de novo assemblers do you use? Many do not appear to save the reads; the output is a contig consensus sequence (Tadpole & Velvet, for instance). The only one that I've used the saves the aligned reads with the assembled contig is the assembler within Geneious. I'd be interested to hear what you use.

          Thanks!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM
          • seqadmin
            The Impact of AI in Genomic Medicine
            by seqadmin



            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
            02-26-2024, 02:07 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-14-2024, 06:13 AM
          0 responses
          34 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-08-2024, 08:03 AM
          0 responses
          72 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-07-2024, 08:13 AM
          0 responses
          81 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-06-2024, 09:51 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X