Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assembler for denovo metagenomes

    I have two environmental samples that have been sequenced on Illumina MiSeq (2x250). First sample has about 200 Mbp (4M reads) and second has 400 Mbp (12M reads), after adapter removal and quality trimming. Both have very low coverage and duplication rate.
    I have tried different assemblers (velvet, idba_ud, megahit, mira, ray-meta) to compare which one does the best job, but I assume because they have so low coverage and are very diverse, all assemblers can't handle these samples. Velvet and megahit had low N50 values and ray-meta and mira failed altogether. Idba_ud had the best results for the first sample, but crashed for the second one.
    What other assembler can you suggest that I should try? Or have any of you used idba_ud for a sample with large number of reads and tell me what options should I change to make it work?

    Thanks!

  • #2
    Did you compile idba_ud for a larger max k-mer value (I would have)? Based on your message, we can only guess why it's crashing. I'm assuming you have processed your PE reads into interleaved fasta with the provided script (as guided by the readme file). 12M reads isn't actually a lot, but it could be that you're running out of memory anyway. How much RAM do you have available for the job? What kind of error message did you get?
    savetherhino.org

    Comment


    • #3
      I compiled idba_ud for k-mer size of 600. And I used the provided script to make interleaved fasta file. We have a computer cluster in our university with 20 cores and 64 gb RAM where I run that job. If that is still not enough then I can run it on server with 2 TB RAM, but I doubt that it need that much. It ended with segfault.

      Comment


      • #4
        Originally posted by Holinder View Post
        I compiled idba_ud for k-mer size of 600. And I used the provided script to make interleaved fasta file. We have a computer cluster in our university with 20 cores and 64 gb RAM where I run that job. If that is still not enough then I can run it on server with 2 TB RAM, but I doubt that it need that much. It ended with segfault.
        Your optimal max k-mer setting should be about your trimmed read length so something like 250, not 600. As to memory, in the context of assembly, 64 GB is very little RAM and the most likely reason for the segfault.
        savetherhino.org

        Comment


        • #5
          What k-mer size have you specified to velvet?

          Comment


          • #6
            Originally posted by Holinder View Post
            I compiled idba_ud for k-mer size of 600. And I used the provided script to make interleaved fasta file. We have a computer cluster in our university with 20 cores and 64 gb RAM where I run that job. If that is still not enough then I can run it on server with 2 TB RAM, but I doubt that it need that much. It ended with segfault.
            Particularly when you have low coverage, you need to use shorter kmers; the longer the kmer, the lower your kmer coverage is, for a given read length. The number of kmers you get from a read of length L is (L-K+1). Also, Velvet has some strange behavior at higher kmer lengths; I don't particularly recommend going above K=95 (compiling it for max K=96).

            You can often increase the number of long kmers you get from reads by merging them first (depending on the value of K, the insert size, and the read length). 2x250bp MiSeq libraries should merge very well; I suggest that you try merging them prior to assembly, but assemble including both the merged reads and unmerged pairs. If you merge, I recommend adapter-trimming but NOT quality-trimming first. Don't bother with this approach if you get a low merge rate of under, say, 30%.

            Ultimately, metagenomes benefit from high coverage. If you have a complex metagenome, a couple MiSeq runs is absolutely insufficient - we often get terrible (L50 under 1kb) metagenome assemblies from even multiple HiSeq lanes (with over 1 billion reads total). Though you may be able to pull out the top 1 or 2 organisms, if there is sufficiently low strain diversity.

            Plotting a kmer-frequency histogram, or a read-uniqueness histogram, is often useful to determine whether you have sequenced enough. You can do both with BBMap:

            kmercountexact.sh in1=read1.fq in2=read2.fq khist=khist.txt

            bbcountunique.sh in1=read1.fq in2=read2.fq out=uhist.txt


            Lastly, JGI previously used Soap, but now we use Megahit for metagenomes, as it seems to do the best job with the lowest resource utilization.

            Comment


            • #7
              Originally posted by Holinder View Post
              I have two environmental samples that have been sequenced on Illumina MiSeq (2x250). First sample has about 200 Mbp (4M reads) and second has 400 Mbp (12M reads), after adapter removal and quality trimming. Both have very low coverage and duplication rate.
              I have tried different assemblers (velvet, idba_ud, megahit, mira, ray-meta) to compare which one does the best job, but I assume because they have so low coverage and are very diverse, all assemblers can't handle these samples. Velvet and megahit had low N50 values and ray-meta and mira failed altogether. Idba_ud had the best results for the first sample, but crashed for the second one.
              What other assembler can you suggest that I should try? Or have any of you used idba_ud for a sample with large number of reads and tell me what options should I change to make it work?

              Thanks!
              Hi Holinder,

              Today we've just released a new version of MEGAHIT (0.3.0-beta), where IDBA-UD's local assembly algorithm is introduced. Basically it provides similar high standard assemblies as IDBA-UD, but is way faster and more memory efficient. You may want to try it out!

              Comment


              • #8
                @vout: It would be useful if you can add links for the MEGAHIT download.

                Comment


                • #9
                  @GenoMax @Holinder
                  Its source code is available at https://github.com/voutcn/megahit
                  Latest release can be downloaded at https://github.com/voutcn/megahit/releases

                  Comment


                  • #10
                    Tested megahit on a few fecal samples. Worked great in comparison to Minia based on contig stats.

                    Beyond realigning reads to the contigs what is the best way to evaluate the quality of a metagenome assembly?

                    Comment


                    • #11
                      You can also try annotating the contigs and examining the gene completeness statistics (Quast will do this). Generally, the more long genes are called, the better the assembly.

                      Comment


                      • #12
                        Thanks Brian. Forgot I had Quast installed on this workstation. Filtering through the output files now.

                        Comment


                        • #13
                          Hello, I have also a question about memory requirements, I hope someone can guide me a bit as I am new to assembly.
                          I have got a sequencing dataset corresponding to complete sequencing of clones from a metagenomic library (expected coverage >25X). The dataset is typical HiSeq illumina reads 330 000 000 paired end reads (this is 160 M each fq file). I am trying to do denovo assembly of this dataset with Ray Meta, in a shared cluster where the maximum amount of RAM I can access to is 512 G (this is 8 nodes of 64G each). The maximum time allowed for a certain process is 4 days...so I am a little bit constrained with memory.
                          I am now trying Ray Meta with kmer size of 21 with the following script (cluster uses slurm, I do not paste the whole config script but the part that corresponds to ray process itself):

                          srun -n 144 ./ray/build-kMAX=64-packed/Ray -k 21 -i Sample_01_forassembly.fa.gz -o RayOutput_Sample01_k21

                          where -n 144 corresponds to the whole number of processes (8 nodes x 18 maximum number of processes per node available)

                          Do you think that kmer size of 21 will be enough to finish in these conditions? is there a way to know or estimate the amount of consumed memory and time?
                          I know that Ray and other kmer based methods are memory consuming The cluster uses MPI and slurm, if it is not possible due to constrains, is it there another program better suited to do this?
                          Mariana

                          Comment


                          • #14
                            Originally posted by marianalozada View Post
                            Hello, I have also a question about memory requirements, I hope someone can guide me a bit as I am new to assembly.
                            I have got a sequencing dataset corresponding to complete sequencing of clones from a metagenomic library (expected coverage >25X). The dataset is typical HiSeq illumina reads 330 000 000 paired end reads (this is 160 M each fq file). I am trying to do denovo assembly of this dataset with Ray Meta, in a shared cluster where the maximum amount of RAM I can access to is 512 G (this is 8 nodes of 64G each). The maximum time allowed for a certain process is 4 days...so I am a little bit constrained with memory.
                            I am now trying Ray Meta with kmer size of 21 with the following script (cluster uses slurm, I do not paste the whole config script but the part that corresponds to ray process itself):

                            srun -n 144 ./ray/build-kMAX=64-packed/Ray -k 21 -i Sample_01_forassembly.fa.gz -o RayOutput_Sample01_k21

                            where -n 144 corresponds to the whole number of processes (8 nodes x 18 maximum number of processes per node available)

                            Do you think that kmer size of 21 will be enough to finish in these conditions? is there a way to know or estimate the amount of consumed memory and time?
                            I know that Ray and other kmer based methods are memory consuming The cluster uses MPI and slurm, if it is not possible due to constrains, is it there another program better suited to do this?
                            Mariana
                            First get the environment right:
                            1. I hope it was a hiseq run in 2x250 bases read mode. Otherwise a miseq run 2x300 or 2x250 can be a good addition to the dataset.
                            2. Run it through flash/panda (see previous posts in the thread).
                            3. Check the fastqc stats and assemble subset - like 100K, 1M, 10M reads first and check the the results - any high coverage contigs go into the vector/screening file - remove the reads matching them, then repeat with larger number (10X) of reads.
                            4. Than try the whole dataset, w/o high copy number bits.
                            5. Do not forget to add the "combined repeats (vector.seq)" sequences to the final assembly results BEFORE annotation.

                            PS: I would try with kmer=31 or 32 as starting point.

                            This would give you some rough idea of the resources involved, time&ram scaling.

                            Comment


                            • #15
                              Thank you Markiyan,
                              I have not posted it before, but in fact I had already preprocessed the sequences with prinseq and eliminated vector and genomic Ecoli contaminant sequences with bowtie2. In fact, when I ran the program with kmers=21 I was able to finish in a short time(Elapsed time: 1 days, 8 hours, 49 minutes, 59 seconds). Now I am going to see if I can make it better with longer kmers, as I know that assembly quality is dependent on kmer size, but it also takes more memory. I have found the option "write checkpoints" which will allow me to start over if the process is cut after 4 days due to cluster restrictions.
                              thanks a lot for your help!
                              Mariana

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X