SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Choosing Optimal Assembly from Quast Data richelleredekop Bioinformatics 2 09-04-2015 10:53 AM
newbler, parameters to improve my cDNA assembly de novo AleixArnau Bioinformatics 3 04-02-2014 06:32 AM
newbler, parameters to improve my cDNA assembly de novo AleixArnau 454 Pyrosequencing 1 03-26-2014 02:29 AM
Soap de novo trans parameters RyNkA Bioinformatics 3 01-09-2014 07:45 AM

Reply
 
Thread Tools
Old 09-27-2016, 12:22 PM   #1
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default Optimal De Novo Assembly Parameters

Hi All,

I'm sequencing 10 kb PCR products using Illumina 150 x 2 paired-end reads. I'm trying to optimize a de novo assembly workflow, and was hoping that I would find some help here. I outline the process below. My goal is to de novo assemble the PCR product into a single, accurate contig. Questions are in red, interleaved with the step in the protocol they refer to. Thanks for any help and feedback.

Program: Geneious
Input: Trimmed reads (fastq). The sequencing core trims the adapters and barcodes for me.

1) Pair the reads. This generates a single file, in which the pairs are now interleaved.

2) BBNorm (Default settings). Normalize reads to 100x coverage.
Would error correcting be beneficial?

3) BBMerge. Merge the reads remaining after normalization. Merge rate is set to "normal".

4) De Novo Assembly. I'm currently using the Geneious assembler. There are a lot of parameters that can be manipulated. The ones I'm using are attached as a screenshot. Please share if you think these parameters could be improved, and how.
How frequent are miscalls in Illumina sequencing? I'm not sure how much overlap between neighboring reads I should require, and within this overlap, how many mismatches I should allow. Also, how often are insertions created during the illumina process? Should I allow gaps within reads?

5) Extract contig consensus sequences. Minimum Coverage = 15, otherwise a gap is called.
What is an acceptable minimum coverage? Since I'm sequencing a PCR product, I imagine I could increase this significantly. The benefit would be removing possible contaminating sequences.

6) Map to reference. In an ideal world, only 1 of the contigs maps to the reference, and the others are background genomic DNA (my PCR reaction starts with a small amount of genomic DNA as template). If multiple contigs map, it could be because there were multiple viral genomes in the initial PCR reaction. It means I have to throw the data out, as the individual genomes are too similar to accurately differentiate. So, it's important that the de novo assembly is stringent, yet not over stringent so that true neighboring reads cannot be assembled into a single contig.

After this I intend to look for Open Reading Frames.

Thanks again for any feedback/suggestions!

Jake
Attached Images
File Type: png Screen Shot 2016-09-27 at 4.25.46 PM.png (130.3 KB, 4 views)

Last edited by JVGen; 09-27-2016 at 12:26 PM. Reason: Attached image
JVGen is offline   Reply With Quote
Old 10-06-2016, 06:54 AM   #2
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Quote:
Originally Posted by JVGen View Post
Would error correcting be beneficial?
[...]
How frequent are miscalls in Illumina sequencing?
Miscalls in Illumina sequencing are not very frequent, and should probably not be a consideration at this step.
As QC measure you can use the tool pilon for error correction afterwards.



General comment: There is normally not an optimal setting. It's normally best just to test a range which you think could be reasonable.
(we have a pipeline for this, which will push the data through 5 assemblers with different parameters and evaluates afterwards, which assembly is potentially the best; there's at least one published pipeline for this out there, but wouldn't know the name right now)
bastianwur is offline   Reply With Quote
Old 10-06-2016, 06:57 AM   #3
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default

Quote:
Originally Posted by bastianwur View Post
Miscalls in Illumina sequencing are not very frequent, and should probably not be a consideration at this step.
As QC measure you can use the tool pilon for error correction afterwards.



General comment: There is normally not an optimal setting. It's normally best just to test a range which you think could be reasonable.
(we have a pipeline for this, which will push the data through 5 assemblers with different parameters and evaluates afterwards, which assembly is potentially the best; there's at least one published pipeline for this out there, but wouldn't know the name right now)
Thanks Bastian. I'm learning this is quite a complex process. I intend to look for open reading frames, so misassembly/miscalls are quite worrisome. They could introduce nonsense mutations, and we wouldn't know the difference (because we're sequencing viruses which are highly mutated). We need a bioinformatician :P
JVGen is offline   Reply With Quote
Old 10-06-2016, 07:19 AM   #4
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Yeah, that can for sure happen during the assembly processes, we've seen this during some comparative genomics tests.
But as suggested, use Pilon for error correction afterwards. It needs the reads mapped to the assembly, and will then check if the majority of the reads agree with the assembly, and will correct it if it's not the case.
The tool is relatively easy to use, in case you're familiar with the command line and know what a BAM file and a fasta file is.
bastianwur is offline   Reply With Quote
Old 10-07-2016, 05:03 AM   #5
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default

Quote:
Originally Posted by bastianwur View Post
Yeah, that can for sure happen during the assembly processes, we've seen this during some comparative genomics tests.
But as suggested, use Pilon for error correction afterwards. It needs the reads mapped to the assembly, and will then check if the majority of the reads agree with the assembly, and will correct it if it's not the case.
The tool is relatively easy to use, in case you're familiar with the command line and know what a BAM file and a fasta file is.
Thanks Bastian. Which de novo assemblers do you use? Many do not appear to save the reads; the output is a contig consensus sequence (Tadpole & Velvet, for instance). The only one that I've used the saves the aligned reads with the assembled contig is the assembler within Geneious. I'd be interested to hear what you use.

Thanks!
JVGen is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO