SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
ddRAD with frogs/large genomes atcghelix Sample Prep / Library Generation 2 02-27-2014 08:55 PM
Indexing very large genomes Brett_CCG Bioinformatics 10 08-15-2013 10:52 PM
Bambus2 ... setup for large(ish) genomes plattsa Bioinformatics 0 05-09-2012 08:32 AM
CLC genomics server for assembling large genomes ngs_agd Bioinformatics 4 03-23-2012 01:47 PM
Assembly of Large Genomes using Cloud Computing by Contrail Gangcai De novo discovery 9 11-23-2011 08:42 AM

Reply
 
Thread Tools
Old 04-05-2014, 08:33 AM   #1
genetics_jo
Member
 
Location: Corvallis, OR

Join Date: Feb 2014
Posts: 11
Default How long does velvetg take for large genomes?

Hi All,
I have a single lane of Illumina HiSeq 2000 101-bp paired end reads on a single genome (hop--est. genome size 2.8 Gb) along with two RNA-Seq experiments (same conditions as genome seq) that I'm attempting to assemble using Velvet (in advance, please don't criticize use of velvet...I will be using other assembly packages in future). All three "experiments" were done on different genotypes. I've run just the genome sequence data with all ambiguities removed as "single" reads and have successfully completed runs with velvet. It took approximately 35 hours to complete assembly--but of course, assembly using the above settings was not great (only 1/3 genome covered with N50=270). I've also run the RNA-Seq experiments as "single" reads and have seen velvetg run to completion in a similar amount of time.

I have now processed all reads to remove all orphaned reads resulting from paired end read processing (removed all ambiquities and trimmed) and combined these orphaned reads into a single fastq.gz file. I submitted all the processed paired end read files (as "shortPaired" reads) along with orphaned read files (as "short" reads) from genome sequence and two RNA-seq experiments back on Tuesday (April 2nd) on a 1000 Gb RAM machine and velvetg has been running ever since then (96 hours). Using "top" command on UNIX machine has showed significant changes in amount of RAM used through the early parts of assembly but the last 2 days have shown a consistent use of 640 Gb RAM with one processor running at 100 %.

My question is this, is this length of time normal for velvetg for assembling such a large dataset or has velvetg just run into a continuous loop and it will never run to completion?

Last edited by genetics_jo; 04-05-2014 at 08:36 AM.
genetics_jo is offline   Reply With Quote
Old 04-05-2014, 11:23 AM   #2
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

I think if you re-compile velvet with 'OPENMP=1' it will be able to use more than 1 processor.
mastal is offline   Reply With Quote
Old 04-05-2014, 02:10 PM   #3
genetics_jo
Member
 
Location: Corvallis, OR

Join Date: Feb 2014
Posts: 11
Default

Quote:
Originally Posted by mastal View Post
I think if you re-compile velvet with 'OPENMP=1' it will be able to use more than 1 processor.
It's already compiled to use multiple cores and did so the first two days of velvetg...plus used up to 880 Gb RAM during that time. Now it's only running on one core and using 660 Gb RAM. Someone said it's "trimming" now and thus not showing lots of changes in RAM or core use??
genetics_jo is offline   Reply With Quote
Old 04-05-2014, 02:16 PM   #4
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Where is the velvet stderr output going?
Can you tell from that whether it's making any progress or just stuck?
mastal is offline   Reply With Quote
Old 04-05-2014, 04:48 PM   #5
genetics_jo
Member
 
Location: Corvallis, OR

Join Date: Feb 2014
Posts: 11
Default

Quote:
Originally Posted by mastal View Post
Where is the velvet stderr output going?
Can you tell from that whether it's making any progress or just stuck?
Unfortunately, I submitted the job as an SGE_Batch script and cannot see the output to determine if it's stuck. In the past when things have gone awry, velvet has simply shut down.

Biggest thing I need to know is if this length of time (>4 days) is normal for a large genome assembly?
genetics_jo is offline   Reply With Quote
Old 04-06-2014, 04:52 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Quote:
Originally Posted by genetics_jo View Post
Unfortunately, I submitted the job as an SGE_Batch script and cannot see the output to determine if it's stuck. In the past when things have gone awry, velvet has simply shut down.

Biggest thing I need to know is if this length of time (>4 days) is normal for a large genome assembly?
I am more familiar with LSF (which has "bpeek" command to look at the output of ongoing jobs). I do not think this is an analogous command (besides qstat -f -u username) available in SGE.

You can check the specific nodes where your job is running to see if velvet related processes are actively running.

If is difficult to define "normal" since all clusters have different hardware, queue limits and your dataset is unique to you. If the job is "running" (i.e. not suspended or otherwise) then best thing to do is wait.

Last edited by GenoMax; 04-06-2014 at 05:11 AM.
GenoMax is offline   Reply With Quote
Old 04-06-2014, 09:13 AM   #7
genetics_jo
Member
 
Location: Corvallis, OR

Join Date: Feb 2014
Posts: 11
Default

Thanks Genomax! I'll hold my finger off the trigger (qdel) as long as possible. My one-week reservation of the cluster runs out on Tuesday so hopefully things will resolve by then.
genetics_jo is offline   Reply With Quote
Old 04-06-2014, 03:54 PM   #8
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

There may be a way to check on progress without the program output. Try doing
Code:
ls -al
every hours or so to see which file is growing and if new files are being added. This will tell you what work is being done.
AdrianP is offline   Reply With Quote
Old 04-06-2014, 04:19 PM   #9
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

That's a good idea, but in my experience, velvetg creates most of the output files just before the run finishes.
mastal is offline   Reply With Quote
Old 04-06-2014, 07:38 PM   #10
genetics_jo
Member
 
Location: Corvallis, OR

Join Date: Feb 2014
Posts: 11
Default

Quote:
Originally Posted by mastal View Post
That's a good idea, but in my experience, velvetg creates most of the output files just before the run finishes.
That's also what I've observed with the previous runs of velvet. The program is still "running" but RAM use and % of processor have remained the same now for several days. Would have thought if it wasn't going to work it would have crashed?
genetics_jo is offline   Reply With Quote
Old 04-07-2014, 02:07 AM   #11
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

I agree. I think if it's got through the part where it uses lots of memory and lots of processors without crashing it should be OK. Just keep your fingers crossed that it finishes before the time you have booked on the cluster runs out. I haven't assembled large genomes, so have no idea how long it should take, but I think it is not unusual for some assemblers to run for many days.
mastal is offline   Reply With Quote
Old 04-10-2014, 05:43 PM   #12
genetics_jo
Member
 
Location: Corvallis, OR

Join Date: Feb 2014
Posts: 11
Default

One other question...I've seen some folks say the paired end fastq files need to be merged together into a single file for "shortPaired" use in Velvet...and seen some say that the two paired end files need to be kept separate and let velvet read and coordinate reads. Which one is it? For example if I have files Humulus_lane1_read1_1.fastq and Humulus_lane1_read1_2.fastq, should these two files be merged together or kept separately for velvet to work properly?
genetics_jo is offline   Reply With Quote
Old 04-10-2014, 11:11 PM   #13
paa6
Member
 
Location: south korea

Join Date: Feb 2014
Posts: 68
Default

[QUOTE=genetics_jo;137017]Unfortunately, I submitted the job as an SGE_Batch script and cannot see the output to determine if it's stuck. In the past when things have gone awry, velvet has simply shut down.

Biggest thing I need to know is if this length of time (>4 days) is normal for a large genome assembly?[/QUOT
I have also recently used velvet for illumina reads but it took few seconds for me to generate assembly...it's a bacterial sequencing and small genome of course!!! but after seeing ur post I am doubting on my assembly time...plz suggest something!!!
paa6 is offline   Reply With Quote
Old 04-11-2014, 02:45 AM   #14
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

@genetics_jo
in recent versions of velvet you can use either method, but you need to use the right parameters.

If you leave the reads in separate files, you should add the flag -separate,
so you would have

Code:
velveth .....   -fastq -shortPaired -separate read1.fastq  read2.fastq
If you don't use the '-separate' flag, then you need to produce a file where the reads are interleaved, using one of the shuffleSequences scripts that are in the contrib subdirectory in velvet.

Code:
velveth  ..... -fastq -shortPaired read1read2_shuffled.fastq
By the way, did your run finish before the allotted time ran out?
mastal is offline   Reply With Quote
Old 04-11-2014, 02:50 AM   #15
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

@paa6
If you're using a very powerful computer, and have only a relatively small number of reads for a small genome, velvet will run very quickly.

As long as it produced the right output files, it should be OK.
mastal is offline   Reply With Quote
Old 04-11-2014, 04:33 PM   #16
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Quote:
Originally Posted by genetics_jo View Post
My question is this, is this length of time normal for velvetg for assembling such a large dataset or has velvetg just run into a continuous loop and it will never run to completion?
I used to work on velvet several years back, and discussed the code/parameters in our blog many times in those days.

e.g.

http://www.homolog.us/blogs/blog/201...meter-exp_cov/

http://www.homolog.us/blogs/blog/201...-roadmap-file/

http://www.homolog.us/blogs/blog/201...ders-question/

The problem with Velvet (especially velvetg) is that it is not at all optimized for large genomes and the time to get to output can be unpredictable. Moreover, you can trust its contig step, but not its scaffolding. However, the contigs produced by Velvet can be easily done by SOAPdenovo2 or Minia in much less time. For example, with the hardware you are describing, SOAPdenovo2 will give you the output in hours, not days.

I know this is not the answer you asked for and you already mentioned about using other assemblers.
__________________
http://homolog.us

Last edited by samanta; 04-11-2014 at 04:38 PM.
samanta is offline   Reply With Quote
Old 04-11-2014, 04:35 PM   #17
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Quote:
Originally Posted by genetics_jo View Post
One other question...I've seen some folks say the paired end fastq files need to be merged together into a single file for "shortPaired" use in Velvet...and seen some say that the two paired end files need to be kept separate and let velvet read and coordinate reads. Which one is it? For example if I have files Humulus_lane1_read1_1.fastq and Humulus_lane1_read1_2.fastq, should these two files be merged together or kept separately for velvet to work properly?
In the version I worked on two years back for trying to assemble a large genome (~600MB size), the paired reads needed to be merged into one file.

FASTA Line 1-2 (read1 left)
FASTA Line 3-4 (read1 right)

etc.

The difficulty I faced was that the scaffolds were completely unpredictable based on small changes in input parameters (exp_cov). It is not as if you run everything once, press a button and trust the output.

That led me to move on to other assemblers. Also, I make sure I understand the code/algorithm of any assembler I use.
__________________
http://homolog.us

Last edited by samanta; 04-11-2014 at 04:41 PM.
samanta is offline   Reply With Quote
Old 04-11-2014, 05:30 PM   #18
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Quote:
Originally Posted by genetics_jo View Post
That's also what I've observed with the previous runs of velvet. The program is still "running" but RAM use and % of processor have remained the same now for several days. Would have thought if it wasn't going to work it would have crashed?

A large part of the work is in removing 'tips' and 'bubbles' and then simplifying the graph. That is when you do not see any input/output.
__________________
http://homolog.us
samanta is offline   Reply With Quote
Reply

Tags
assembly, large genome, velvet

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO