Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimal De Novo Assembly Parameters

    Hi All,

    I'm sequencing 10 kb PCR products using Illumina 150 x 2 paired-end reads. I'm trying to optimize a de novo assembly workflow, and was hoping that I would find some help here. I outline the process below. My goal is to de novo assemble the PCR product into a single, accurate contig. Questions are in red, interleaved with the step in the protocol they refer to. Thanks for any help and feedback.

    Program: Geneious
    Input: Trimmed reads (fastq). The sequencing core trims the adapters and barcodes for me.

    1) Pair the reads. This generates a single file, in which the pairs are now interleaved.

    2) BBNorm (Default settings). Normalize reads to 100x coverage.
    Would error correcting be beneficial?

    3) BBMerge. Merge the reads remaining after normalization. Merge rate is set to "normal".

    4) De Novo Assembly. I'm currently using the Geneious assembler. There are a lot of parameters that can be manipulated. The ones I'm using are attached as a screenshot. Please share if you think these parameters could be improved, and how.
    How frequent are miscalls in Illumina sequencing? I'm not sure how much overlap between neighboring reads I should require, and within this overlap, how many mismatches I should allow. Also, how often are insertions created during the illumina process? Should I allow gaps within reads?

    5) Extract contig consensus sequences. Minimum Coverage = 15, otherwise a gap is called.
    What is an acceptable minimum coverage? Since I'm sequencing a PCR product, I imagine I could increase this significantly. The benefit would be removing possible contaminating sequences.

    6) Map to reference. In an ideal world, only 1 of the contigs maps to the reference, and the others are background genomic DNA (my PCR reaction starts with a small amount of genomic DNA as template). If multiple contigs map, it could be because there were multiple viral genomes in the initial PCR reaction. It means I have to throw the data out, as the individual genomes are too similar to accurately differentiate. So, it's important that the de novo assembly is stringent, yet not over stringent so that true neighboring reads cannot be assembled into a single contig.

    After this I intend to look for Open Reading Frames.

    Thanks again for any feedback/suggestions!

    Jake
    Attached Files
    Last edited by JVGen; 09-27-2016, 12:26 PM. Reason: Attached image

  • #2
    Originally posted by JVGen View Post
    Would error correcting be beneficial?
    [...]
    How frequent are miscalls in Illumina sequencing?
    Miscalls in Illumina sequencing are not very frequent, and should probably not be a consideration at this step.
    As QC measure you can use the tool pilon for error correction afterwards.



    General comment: There is normally not an optimal setting. It's normally best just to test a range which you think could be reasonable.
    (we have a pipeline for this, which will push the data through 5 assemblers with different parameters and evaluates afterwards, which assembly is potentially the best; there's at least one published pipeline for this out there, but wouldn't know the name right now)

    Comment


    • #3
      Originally posted by bastianwur View Post
      Miscalls in Illumina sequencing are not very frequent, and should probably not be a consideration at this step.
      As QC measure you can use the tool pilon for error correction afterwards.



      General comment: There is normally not an optimal setting. It's normally best just to test a range which you think could be reasonable.
      (we have a pipeline for this, which will push the data through 5 assemblers with different parameters and evaluates afterwards, which assembly is potentially the best; there's at least one published pipeline for this out there, but wouldn't know the name right now)
      Thanks Bastian. I'm learning this is quite a complex process. I intend to look for open reading frames, so misassembly/miscalls are quite worrisome. They could introduce nonsense mutations, and we wouldn't know the difference (because we're sequencing viruses which are highly mutated). We need a bioinformatician :P

      Comment


      • #4
        Yeah, that can for sure happen during the assembly processes, we've seen this during some comparative genomics tests.
        But as suggested, use Pilon for error correction afterwards. It needs the reads mapped to the assembly, and will then check if the majority of the reads agree with the assembly, and will correct it if it's not the case.
        The tool is relatively easy to use, in case you're familiar with the command line and know what a BAM file and a fasta file is.

        Comment


        • #5
          Originally posted by bastianwur View Post
          Yeah, that can for sure happen during the assembly processes, we've seen this during some comparative genomics tests.
          But as suggested, use Pilon for error correction afterwards. It needs the reads mapped to the assembly, and will then check if the majority of the reads agree with the assembly, and will correct it if it's not the case.
          The tool is relatively easy to use, in case you're familiar with the command line and know what a BAM file and a fasta file is.
          Thanks Bastian. Which de novo assemblers do you use? Many do not appear to save the reads; the output is a contig consensus sequence (Tadpole & Velvet, for instance). The only one that I've used the saves the aligned reads with the assembled contig is the assembler within Geneious. I'd be interested to hear what you use.

          Thanks!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Advances in Sequencing Analysis Tools
            by seqadmin


            The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
            05-06-2024, 07:48 AM
          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:57 AM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-06-2024, 07:17 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-02-2024, 08:06 AM
          0 responses
          19 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-30-2024, 12:17 PM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Working...
          X