Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coassemble massive amount of data (>3TB)

    Hello, I am attempting to assemble a massive amount of Illumina 2x150bp pair-ended reads data (>3TB). I am considering using megahit as it is the least resource-intensive assemblers I have used and still gives reasonably good results.

    What are the typical strategies if one wants to assemble data size that's beyond typical limitations? I am thinking of dividing them into smaller pools but of course it's not ideal. Thanks

  • #2
    There are a few options for this. First off, preprocessing can reduce the number of kmers present, which typically reduces memory requirements:

    Adapter-trimming
    Quality-trimming (at least to get rid of those Q2 trailing bases)
    Contaminant removal (even if your dataset is 0.1% human, that's still the whole human genome...)
    Normalization (helpful if you have a few organisms with extremely high coverage that constitute the bulk of the data; this happens in some metagenomes)
    Error-correction
    Read merging (useful for many assemblers, but generally has a negative impact on Megahit. Still should reduce the kmer space though).
    Duplicate removal, if the library is PCR-amplified or for certain platforms like NextSeq, HiSeq3000/4000, or NovaSeq.

    All of these will reduce the data volume and kmer space somewhat. If they are not sufficient, you can also discard reads that won't assemble; for example, those with a kmer depth of 1 across the entire read. Dividing randomly is generally not a good idea, but there are some read-based binning tools that use features such as tetramers and depth that try to bin by organism prior to assembly. There are also some distributed assemblers, like Ray, Disco, and MetaHipMer that allow you to use memory across multiple nodes. Generating a kmer-depth histogram can help indicate what kind of preprocessing and assembly strategies might be useful.

    Comment


    • #3
      What is the expected genome size/ploidy level? Having massive oversampling of data will not guarantee good assemblies.

      Comment


      • #4
        It's more like an environmental microbiome dataset collections. So no real expectations for genome size/ploidy level.

        @Brian, I have found out that at this size, normalization (I use BBnorm) becomes so difficult that it would almost certainly exceed the allowed time amount for my university's cluster (7 days), because I could not finish the job even by down-sampling 10x of total reads. I suppose I could actually normalize each sample (because they were amplified individually), and pool them together and maybe try another round of normalization (in case any "duplications" happen inbetween samples. It seems to me that reads binning would basically achieve something very similar to normalization anyway (algorithmically I can't see it being more time and memory efficient)?

        I also was not able to generate a kmer-depth graph when dealing with the multi-TB datasets directly, or perhaps you know of something much more efficient?

        Thanks

        Comment


        • #5
          The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

          Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

          But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.
          Last edited by Brian Bushnell; 04-24-2017, 10:39 AM.

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

            Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

            But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.
            This is excellent advice. I have never tried reads binning but I am very familiar with tetra-nucleotide signature of genomes. Is such a thing possible even at reads level? If so, could you point to me the best software package to use please?

            I have indeed tried bbnorm but its significantly slower for me and uses way more memory than I anticipated, I have however not used passes = 1 filter and that might be why. By the way, bbnorm says that memory should not be a hard cap for the program to run but I am unsure how much memory should I request, the best I could do is probably 1.2TB, could you please give me some advice on that?

            I have attached the error message which I got after 4 days of running while I only specified 10 million reads (the whole is thing 100 times more). So it's way slower for me. Thanks so much!

            Code:
            java -ea -Xmx131841m -Xms131841m -cp /home/jiangch/software/bbmap/current/ jgi.KmerNormalize bits=32 in=mega.nonhuman.fastq interleaved=true threads=16 prefilter=t fixspikes=t target=50 out=faster.fastq prefiltersize=0.5 reads=10000000
            Executing jgi.KmerNormalize [bits=32, in=mega.nonhuman.fastq, interleaved=true, threads=16, prefilter=t, fixspikes=t, target=50, out=faster.fastq, prefiltersize=0.5, reads=10000000]
            
            BBNorm version 37.02
            Set INTERLEAVED to true
            Set threads to 16
            
               ***********   Pass 1   **********   
            
            
            Settings:
            threads:          	16
            k:                	31
            deterministic:    	true
            toss error reads: 	false
            passes:           	1
            bits per cell:    	16
            cells:            	24.16B
            hashes:           	3
            prefilter bits:   	2
            prefilter cells:  	193.29B
            prefilter hashes: 	2
            base min quality: 	5
            kmer min prob:    	0.5
            
            target depth:     	200
            min depth:        	3
            max depth:        	250
            min good kmers:   	15
            depth percentile: 	64.8
            ignore dupe kmers:	true
            fix spikes:       	false
            
            Made prefilter:   	hashes = 2   	 mem = 44.99 GB   	cells = 193.22B   	used = 47.819%
            Made hash table:  	hashes = 3   	 mem = 44.96 GB   	cells = 24.14B   	used = 87.988%
            Warning:  This table is very full, which may reduce accuracy.  Ideal load is under 60% used.
            For better accuracy, use the 'prefilter' flag; run on a node with more memory; quality-trim or error-correct reads; or increase the values of the minprob flag to reduce spurious kmers.  In practice you should still get good normalization results even with loads over 90%, but the histogram and statistics will be off.
            
            Estimated kmers of depth 1-3: 	51336579492
            Estimated kmers of depth 4+ : 	11502226335
            Estimated unique kmers:     	62838805828
            
            Table creation time:		290885.545 seconds.
            Writing interleaved.
            Last edited by Brian Bushnell; 04-24-2017, 11:59 AM.

            Comment


            • #7
              The flags/documentation for BBNorm might be a little misleading in this case... BBNorm acts in 3 phases per pass:

              1) Prefilter construction (which finished). This uses all reads.
              2) Main hashtable construction (which finished). This uses all reads.
              3) Normalization (which started but did not finish). This uses the number of reads you specify with the "reads" flag.

              So, the reason it took so long was because it was using all reads in phases 1 and 2 (otherwise, the depths would be incorrect).

              To reduce the number of reads from the first two phases, you would need to specify "tablereads"; e.g. "reads=10000000 tablereads=10000000". Or most simply just pull the first 10m reads first:

              Code:
              reformat.sh in=reads.fq out=10m.fq reads=10m
              bbnorm.sh prefilter=t passes=1 target=50 in=10m.fq out=normalized.fq
              For binning, you might try MetaBAT. Binning tools are generally better for assemblies, so that tetramer frequencies become more accurate. But, they can work with reads as well, particularly when you have multiple samples of the same metagenome (preferably from slightly different locations / conditions / times), as they can use read depth covariance to assist in binning. But I don't know how well it will work or how long it will take. Also, the longer the reads are, the better; so if the reads mostly overlap, it's prudent to merge them first.

              Comment


              • #8
                I see, that makes a lot of senses why it was also slow when I specified only 1 million reads. I just tried to add tablereads= and it works so fast now!!! I noticed you are the author of BBtools and apparently your toolkit has covered the purposes of multiple scripts I have written for myself but always better implemented and covers more stuff. RIP me! I guess it's still a great way for me to learn bioinformatics though.

                Why don't you just go for a quick application note on bioinformatics to publish your toolkit? Is that publications no longer matters to you any more

                Comment


                • #9
                  It's not like I don't want publications, it's just that it takes time away from development/support...

                  And I agree, there's no better way to learn the intricacies of a problem than by attempting to solve it yourself, whether or not the solution is optimal!

                  Comment


                  • #10
                    The only reason I was suggesting a publication is because it makes citation easier for me

                    I am trying to use dedupe.sh to work on the aggregate of individually assembled contigs as an alternative strategy to co-assmebly. Since it already is over-lap aware, I am wondering if it is capable to simply "merge" those over-lapping contigs, instead of reporting them in different clusters (kinda like minimus 2)? If not, would that be something you consider to add to the module.

                    Comment


                    • #11
                      I agree, I have been planning to add merging to Dedupe for a long time! Hopefully I'll have a chance sometime this year...

                      Comment


                      • #12
                        All you need to do is package a release and put it on zenodo.org. That will give you a citeable DOI and "eternal" data record.

                        It's up to the journals (and reviewers) to accept such a citation in submitted research papers. For what it's worth, I haven't had a problem with this for F1000 research (in fact, they encourage it).

                        Comment


                        • #13
                          Originally posted by Brian Bushnell View Post
                          The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

                          Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

                          But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.
                          For the alternative strategy which is to assemble them independently and deduplicate them using dedupe.sh . I am wondering should I then map normalized reads to the deduplicated contigs or original reads (10x bigger)? It kind of makes sense to me to just go with the normalized reads since I wanted to co-assemble them afterwards anyway.

                          Thanks.

                          Comment


                          • #14
                            If you plan to normalize the reads after mapping but before assembly anyway, then yes, it's logical to just map with the normalized reads... but only if the reads were normalized together (which they probably weren't). If they were normalized independently, you need to either combine them all, then normalize, then map, or map, then combine them all, then normalize. Normalizing independently will lose the low-coverage part of each library (depending on your settings) that could still be assembled once the libraries are combined.

                            Comment


                            • #15
                              Originally posted by Brian Bushnell View Post
                              If you plan to normalize the reads after mapping but before assembly anyway, then yes, it's logical to just map with the normalized reads... but only if the reads were normalized together (which they probably weren't). If they were normalized independently, you need to either combine them all, then normalize, then map, or map, then combine them all, then normalize. Normalizing independently will lose the low-coverage part of each library (depending on your settings) that could still be assembled once the libraries are combined.
                              Thanks for the reply. The reads were indeed normalized together after following your suggestion. Took about 5 days to go through 7 TB of data which is faster than i expected. I still want to approximate "co-assemble" strategy as best as I can even if I can't assemble the normalized reads directly.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              82 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X