SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
[Velvet,assembly] core dumped occured by runnning velvet matador3000 De novo discovery 0 12-17-2011 08:31 AM
Query on K-mer using in velvet ramadatta.88 Bioinformatics 2 10-04-2011 08:23 PM
Why using k-mer? papori De novo discovery 12 03-07-2011 04:30 AM
remove false positives from less than 17 mer hpgala Illumina/Solexa 0 02-26-2011 01:30 PM
Optimal k-mer and N50? AronaldJ De novo discovery 1 12-28-2010 10:03 AM

Reply
 
Thread Tools
Old 05-31-2011, 08:54 AM   #1
NGS_user
Junior Member
 
Location: Europe

Join Date: Nov 2010
Posts: 9
Default Large K-mer Velvet

Hi Folks,
I am using Velvet to assemble a number of genes where the reads are of 75bp length. An issue I am having is that some of these genes are a result of duplications, where the parent and duplicate gene are very similar. Am I right in thinking that a high k-mer length will reduce that chances of an assembly error (smaller k-mers being merged as one contig despite coming from reads generated from duplicates). I realize sequencing errors may be unavoidable, hopefully good coverage will help avoid these. If longer k-mers are better for duplicates would it be better to generate longer reads?
NGS_user is offline   Reply With Quote
Old 05-31-2011, 10:28 AM   #2
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by NGS_user View Post
Hi Folks,
I am using Velvet to assemble a number of genes where the reads are of 75bp length.
Are the reads paired ?

Quote:
Originally Posted by NGS_user View Post

An issue I am having is that some of these genes are a result of duplications, where the parent and duplicate gene are very similar. Am I right in thinking that a high k-mer length will reduce that chances of an assembly error (smaller k-mers being merged as one contig despite coming from reads generated from duplicates).
Surely this will account for some differences in assemblers using bubble merging or bubble popping approaches such as Velvet or ABySS.

In general, increasing the k-mer length increases the uniqueness of k-mers in the resulting graph.

Two things disallow the use of a very large k-mer length. The first is obviously the read length. The second is the error rate.


Quote:
Originally Posted by NGS_user View Post
I realize sequencing errors may be unavoidable, hopefully good coverage will help avoid these.
If sequencing errors occur randomly, they won't stack and therefore can be weeded out to some extent. Different assemblers will do that in different manners.

For example, In Ray (see http://denovoassembler.sf.net; I am the author), these errors are just avoided, but are not removed from the graph.


Quote:
Originally Posted by NGS_user View Post
If longer k-mers are better for duplicates would it be better to generate longer reads?
Longer reads is always better if the throughput scales as well.

This is one of the goals that Pacific Biosciences aims to achieve -- longer reads.


Maybe you can try Ray on your dataset. Ray does not merge similar paths in the assembly process so that might help.


seb

Last edited by seb567; 05-31-2011 at 10:29 AM. Reason: fixed link
seb567 is offline   Reply With Quote
Old 06-01-2011, 01:51 AM   #3
NGS_user
Junior Member
 
Location: Europe

Join Date: Nov 2010
Posts: 9
Default

The reads are single end but if I am to generate new data I could have paired end reads of either 100 or 150 bp (GAII). I am just concerned that the high error rate will affect my assemblies as I am not assembling a genome, rather a family of mammalian genes
NGS_user is offline   Reply With Quote
Old 06-10-2011, 08:59 PM   #4
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Cool

Quote:
Originally Posted by NGS_user View Post
The reads are single end but if I am to generate new data I could have paired end reads of either 100 or 150 bp (GAII). I am just concerned that the high error rate will affect my assemblies as I am not assembling a genome, rather a family of mammalian genes
Perhaps you could first perform simulations on those genes (if they are known) or on closely-related or similar genes.

You can do that with Ray right away.

First, you need these packages (available in all GNU/Linux distros):

make
g++
open-mpi
git (to get the development version of Ray)
boost (to compile the read simulator shipped with Ray)


What follows is the workflow you could use.

Install Ray and VirtualNextGenSequencer

Code:
git clone git@github.com:sebhtml/ray.git
cd ray
make PREFIX=build MAXKMERLENGTH=128 VIRTUAL_SEQUENCER=y
make install

Sequencer your genes in silico


Code:
N=600000 #number of pairs of reads
readLength=75
errorRate=0.005 # 0.5%
ref=~/nuccore/genes.fasta
mean=400 # average insert size
sd=40 # standard deviation

./build/VirtualNextGenSequencer $ref $errorRate \
$mean $sd $N $readLength L1_1.fasta L1_2.fasta
Build an assembly
Code:
mpirun -np 64 ./build/Ray -k 70 -p L1_1.fasta L2_2.fasta \
 -o GeneBuild
seb567 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO