SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Nextera XT prep for metagenomes? DrMarineMicro Sample Prep / Library Generation 10 06-10-2014 04:23 PM
Mira4 Assembler using all memory for denovo, est assembly nareshvasani Ion Torrent 7 01-28-2014 03:08 PM
Tophat for metagenomes? Accentor Metagenomics 0 07-13-2013 06:23 PM
Iontorrent denovo and reference assembler tools pari_89 Bioinformatics 3 05-14-2013 07:47 AM
Publicly available Illumina metagenomes metagUser Metagenomics 2 03-27-2012 12:16 AM

Reply
 
Thread Tools
Old 05-12-2015, 12:37 AM   #1
Holinder
Junior Member
 
Location: Estonia

Join Date: Dec 2014
Posts: 6
Default Assembler for denovo metagenomes

I have two environmental samples that have been sequenced on Illumina MiSeq (2x250). First sample has about 200 Mbp (4M reads) and second has 400 Mbp (12M reads), after adapter removal and quality trimming. Both have very low coverage and duplication rate.
I have tried different assemblers (velvet, idba_ud, megahit, mira, ray-meta) to compare which one does the best job, but I assume because they have so low coverage and are very diverse, all assemblers can't handle these samples. Velvet and megahit had low N50 values and ray-meta and mira failed altogether. Idba_ud had the best results for the first sample, but crashed for the second one.
What other assembler can you suggest that I should try? Or have any of you used idba_ud for a sample with large number of reads and tell me what options should I change to make it work?

Thanks!
Holinder is offline   Reply With Quote
Old 05-12-2015, 02:50 AM   #2
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Did you compile idba_ud for a larger max k-mer value (I would have)? Based on your message, we can only guess why it's crashing. I'm assuming you have processed your PE reads into interleaved fasta with the provided script (as guided by the readme file). 12M reads isn't actually a lot, but it could be that you're running out of memory anyway. How much RAM do you have available for the job? What kind of error message did you get?
__________________
savetherhino.org
rhinoceros is offline   Reply With Quote
Old 05-12-2015, 03:17 AM   #3
Holinder
Junior Member
 
Location: Estonia

Join Date: Dec 2014
Posts: 6
Default

I compiled idba_ud for k-mer size of 600. And I used the provided script to make interleaved fasta file. We have a computer cluster in our university with 20 cores and 64 gb RAM where I run that job. If that is still not enough then I can run it on server with 2 TB RAM, but I doubt that it need that much. It ended with segfault.
Holinder is offline   Reply With Quote
Old 05-12-2015, 03:21 AM   #4
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by Holinder View Post
I compiled idba_ud for k-mer size of 600. And I used the provided script to make interleaved fasta file. We have a computer cluster in our university with 20 cores and 64 gb RAM where I run that job. If that is still not enough then I can run it on server with 2 TB RAM, but I doubt that it need that much. It ended with segfault.
Your optimal max k-mer setting should be about your trimmed read length so something like 250, not 600. As to memory, in the context of assembly, 64 GB is very little RAM and the most likely reason for the segfault.
__________________
savetherhino.org
rhinoceros is offline   Reply With Quote
Old 05-12-2015, 07:06 PM   #5
student-t
Member
 
Location: Garvan Institute

Join Date: Mar 2015
Posts: 16
Default

What k-mer size have you specified to velvet?
student-t is offline   Reply With Quote
Old 05-12-2015, 07:52 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,696
Default

Quote:
Originally Posted by Holinder View Post
I compiled idba_ud for k-mer size of 600. And I used the provided script to make interleaved fasta file. We have a computer cluster in our university with 20 cores and 64 gb RAM where I run that job. If that is still not enough then I can run it on server with 2 TB RAM, but I doubt that it need that much. It ended with segfault.
Particularly when you have low coverage, you need to use shorter kmers; the longer the kmer, the lower your kmer coverage is, for a given read length. The number of kmers you get from a read of length L is (L-K+1). Also, Velvet has some strange behavior at higher kmer lengths; I don't particularly recommend going above K=95 (compiling it for max K=96).

You can often increase the number of long kmers you get from reads by merging them first (depending on the value of K, the insert size, and the read length). 2x250bp MiSeq libraries should merge very well; I suggest that you try merging them prior to assembly, but assemble including both the merged reads and unmerged pairs. If you merge, I recommend adapter-trimming but NOT quality-trimming first. Don't bother with this approach if you get a low merge rate of under, say, 30%.

Ultimately, metagenomes benefit from high coverage. If you have a complex metagenome, a couple MiSeq runs is absolutely insufficient - we often get terrible (L50 under 1kb) metagenome assemblies from even multiple HiSeq lanes (with over 1 billion reads total). Though you may be able to pull out the top 1 or 2 organisms, if there is sufficiently low strain diversity.

Plotting a kmer-frequency histogram, or a read-uniqueness histogram, is often useful to determine whether you have sequenced enough. You can do both with BBMap:

kmercountexact.sh in1=read1.fq in2=read2.fq khist=khist.txt

bbcountunique.sh in1=read1.fq in2=read2.fq out=uhist.txt


Lastly, JGI previously used Soap, but now we use Megahit for metagenomes, as it seems to do the best job with the lowest resource utilization.
Brian Bushnell is offline   Reply With Quote
Old 06-18-2015, 07:42 AM   #7
vout
Junior Member
 
Location: Hong Kong

Join Date: Jun 2015
Posts: 4
Default

Quote:
Originally Posted by Holinder View Post
I have two environmental samples that have been sequenced on Illumina MiSeq (2x250). First sample has about 200 Mbp (4M reads) and second has 400 Mbp (12M reads), after adapter removal and quality trimming. Both have very low coverage and duplication rate.
I have tried different assemblers (velvet, idba_ud, megahit, mira, ray-meta) to compare which one does the best job, but I assume because they have so low coverage and are very diverse, all assemblers can't handle these samples. Velvet and megahit had low N50 values and ray-meta and mira failed altogether. Idba_ud had the best results for the first sample, but crashed for the second one.
What other assembler can you suggest that I should try? Or have any of you used idba_ud for a sample with large number of reads and tell me what options should I change to make it work?

Thanks!
Hi Holinder,

Today we've just released a new version of MEGAHIT (0.3.0-beta), where IDBA-UD's local assembly algorithm is introduced. Basically it provides similar high standard assemblies as IDBA-UD, but is way faster and more memory efficient. You may want to try it out!
vout is offline   Reply With Quote
Old 06-18-2015, 08:09 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,630
Default

@vout: It would be useful if you can add links for the MEGAHIT download.
GenoMax is offline   Reply With Quote
Old 06-18-2015, 04:51 PM   #9
vout
Junior Member
 
Location: Hong Kong

Join Date: Jun 2015
Posts: 4
Default

@GenoMax @Holinder
Its source code is available at https://github.com/voutcn/megahit
Latest release can be downloaded at https://github.com/voutcn/megahit/releases
vout is offline   Reply With Quote
Old 11-23-2015, 08:26 AM   #10
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 60
Default

Tested megahit on a few fecal samples. Worked great in comparison to Minia based on contig stats.

Beyond realigning reads to the contigs what is the best way to evaluate the quality of a metagenome assembly?
lac302 is offline   Reply With Quote
Old 11-23-2015, 09:07 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,696
Default

You can also try annotating the contigs and examining the gene completeness statistics (Quast will do this). Generally, the more long genes are called, the better the assembly.
Brian Bushnell is offline   Reply With Quote
Old 11-24-2015, 01:24 PM   #12
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 60
Default

Thanks Brian. Forgot I had Quast installed on this workstation. Filtering through the output files now.
lac302 is offline   Reply With Quote
Old 04-19-2016, 07:31 AM   #13
marianalozada
Junior Member
 
Location: argentina

Join Date: Nov 2015
Posts: 2
Default

Hello, I have also a question about memory requirements, I hope someone can guide me a bit as I am new to assembly.
I have got a sequencing dataset corresponding to complete sequencing of clones from a metagenomic library (expected coverage >25X). The dataset is typical HiSeq illumina reads 330 000 000 paired end reads (this is 160 M each fq file). I am trying to do denovo assembly of this dataset with Ray Meta, in a shared cluster where the maximum amount of RAM I can access to is 512 G (this is 8 nodes of 64G each). The maximum time allowed for a certain process is 4 days...so I am a little bit constrained with memory.
I am now trying Ray Meta with kmer size of 21 with the following script (cluster uses slurm, I do not paste the whole config script but the part that corresponds to ray process itself):

srun -n 144 ./ray/build-kMAX=64-packed/Ray -k 21 -i Sample_01_forassembly.fa.gz -o RayOutput_Sample01_k21

where -n 144 corresponds to the whole number of processes (8 nodes x 18 maximum number of processes per node available)

Do you think that kmer size of 21 will be enough to finish in these conditions? is there a way to know or estimate the amount of consumed memory and time?
I know that Ray and other kmer based methods are memory consuming The cluster uses MPI and slurm, if it is not possible due to constrains, is it there another program better suited to do this?
Mariana
marianalozada is offline   Reply With Quote
Old 04-29-2016, 01:34 AM   #14
Markiyan
Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 99
Lightbulb

Quote:
Originally Posted by marianalozada View Post
Hello, I have also a question about memory requirements, I hope someone can guide me a bit as I am new to assembly.
I have got a sequencing dataset corresponding to complete sequencing of clones from a metagenomic library (expected coverage >25X). The dataset is typical HiSeq illumina reads 330 000 000 paired end reads (this is 160 M each fq file). I am trying to do denovo assembly of this dataset with Ray Meta, in a shared cluster where the maximum amount of RAM I can access to is 512 G (this is 8 nodes of 64G each). The maximum time allowed for a certain process is 4 days...so I am a little bit constrained with memory.
I am now trying Ray Meta with kmer size of 21 with the following script (cluster uses slurm, I do not paste the whole config script but the part that corresponds to ray process itself):

srun -n 144 ./ray/build-kMAX=64-packed/Ray -k 21 -i Sample_01_forassembly.fa.gz -o RayOutput_Sample01_k21

where -n 144 corresponds to the whole number of processes (8 nodes x 18 maximum number of processes per node available)

Do you think that kmer size of 21 will be enough to finish in these conditions? is there a way to know or estimate the amount of consumed memory and time?
I know that Ray and other kmer based methods are memory consuming The cluster uses MPI and slurm, if it is not possible due to constrains, is it there another program better suited to do this?
Mariana
First get the environment right:
1. I hope it was a hiseq run in 2x250 bases read mode. Otherwise a miseq run 2x300 or 2x250 can be a good addition to the dataset.
2. Run it through flash/panda (see previous posts in the thread).
3. Check the fastqc stats and assemble subset - like 100K, 1M, 10M reads first and check the the results - any high coverage contigs go into the vector/screening file - remove the reads matching them, then repeat with larger number (10X) of reads.
4. Than try the whole dataset, w/o high copy number bits.
5. Do not forget to add the "combined repeats (vector.seq)" sequences to the final assembly results BEFORE annotation.

PS: I would try with kmer=31 or 32 as starting point.

This would give you some rough idea of the resources involved, time&ram scaling.
Markiyan is offline   Reply With Quote
Old 04-29-2016, 05:25 AM   #15
marianalozada
Junior Member
 
Location: argentina

Join Date: Nov 2015
Posts: 2
Default

Thank you Markiyan,
I have not posted it before, but in fact I had already preprocessed the sequences with prinseq and eliminated vector and genomic Ecoli contaminant sequences with bowtie2. In fact, when I ran the program with kmers=21 I was able to finish in a short time(Elapsed time: 1 days, 8 hours, 49 minutes, 59 seconds). Now I am going to see if I can make it better with longer kmers, as I know that assembly quality is dependent on kmer size, but it also takes more memory. I have found the option "write checkpoints" which will allow me to start over if the process is cut after 4 days due to cluster restrictions.
thanks a lot for your help!
Mariana
marianalozada is offline   Reply With Quote
Reply

Tags
environment, metagenome assembly, miseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO