Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-18-2017, 04:20 PM

There are a few options for this. First off, preprocessing can reduce the number of kmers present, which typically reduces memory requirements:

Adapter-trimming
Quality-trimming (at least to get rid of those Q2 trailing bases)
Contaminant removal (even if your dataset is 0.1% human, that's still the whole human genome...)
Normalization (helpful if you have a few organisms with extremely high coverage that constitute the bulk of the data; this happens in some metagenomes)
Error-correction
Read merging (useful for many assemblers, but generally has a negative impact on Megahit. Still should reduce the kmer space though).
Duplicate removal, if the library is PCR-amplified or for certain platforms like NextSeq, HiSeq3000/4000, or NovaSeq.

All of these will reduce the data volume and kmer space somewhat. If they are not sufficient, you can also discard reads that won't assemble; for example, those with a kmer depth of 1 across the entire read. Dividing randomly is generally not a good idea, but there are some read-based binning tools that use features such as tetramers and depth that try to bin by organism prior to assembly. There are also some distributed assemblers, like Ray, Disco, and MetaHipMer that allow you to use memory across multiple nodes. Generating a kmer-depth histogram can help indicate what kind of preprocessing and assembly strategies might be useful.

**GenoMax** · 04-18-2017, 05:36 PM

What is the expected genome size/ploidy level? Having massive oversampling of data will not guarantee good assemblies.

**confurious** · 04-19-2017, 04:01 PM

It's more like an environmental microbiome dataset collections. So no real expectations for genome size/ploidy level.

@Brian, I have found out that at this size, normalization (I use BBnorm) becomes so difficult that it would almost certainly exceed the allowed time amount for my university's cluster (7 days), because I could not finish the job even by down-sampling 10x of total reads. I suppose I could actually normalize each sample (because they were amplified individually), and pool them together and maybe try another round of normalization (in case any "duplications" happen inbetween samples. It seems to me that reads binning would basically achieve something very similar to normalization anyway (algorithmically I can't see it being more time and memory efficient)?

I also was not able to generate a kmer-depth graph when dealing with the multi-TB datasets directly, or perhaps you know of something much more efficient?

Thanks

**Brian Bushnell** · 04-24-2017, 10:37 AM

The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.

**confurious** · 04-24-2017, 11:41 AM

Originally posted by Brian Bushnell View Post

The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.

This is excellent advice. I have never tried reads binning but I am very familiar with tetra-nucleotide signature of genomes. Is such a thing possible even at reads level? If so, could you point to me the best software package to use please?

I have indeed tried bbnorm but its significantly slower for me and uses way more memory than I anticipated, I have however not used passes = 1 filter and that might be why. By the way, bbnorm says that memory should not be a hard cap for the program to run but I am unsure how much memory should I request, the best I could do is probably 1.2TB, could you please give me some advice on that?

I have attached the error message which I got after 4 days of running while I only specified 10 million reads (the whole is thing 100 times more). So it's way slower for me. Thanks so much!

Code:

java -ea -Xmx131841m -Xms131841m -cp /home/jiangch/software/bbmap/current/ jgi.KmerNormalize bits=32 in=mega.nonhuman.fastq interleaved=true threads=16 prefilter=t fixspikes=t target=50 out=faster.fastq prefiltersize=0.5 reads=10000000
Executing jgi.KmerNormalize [bits=32, in=mega.nonhuman.fastq, interleaved=true, threads=16, prefilter=t, fixspikes=t, target=50, out=faster.fastq, prefiltersize=0.5, reads=10000000]

BBNorm version 37.02
Set INTERLEAVED to true
Set threads to 16

   ***********   Pass 1   **********   


Settings:
threads:          	16
k:                	31
deterministic:    	true
toss error reads: 	false
passes:           	1
bits per cell:    	16
cells:            	24.16B
hashes:           	3
prefilter bits:   	2
prefilter cells:  	193.29B
prefilter hashes: 	2
base min quality: 	5
kmer min prob:    	0.5

target depth:     	200
min depth:        	3
max depth:        	250
min good kmers:   	15
depth percentile: 	64.8
ignore dupe kmers:	true
fix spikes:       	false

Made prefilter:   	hashes = 2   	 mem = 44.99 GB   	cells = 193.22B   	used = 47.819%
Made hash table:  	hashes = 3   	 mem = 44.96 GB   	cells = 24.14B   	used = 87.988%
Warning:  This table is very full, which may reduce accuracy.  Ideal load is under 60% used.
For better accuracy, use the 'prefilter' flag; run on a node with more memory; quality-trim or error-correct reads; or increase the values of the minprob flag to reduce spurious kmers.  In practice you should still get good normalization results even with loads over 90%, but the histogram and statistics will be off.

Estimated kmers of depth 1-3: 	51336579492
Estimated kmers of depth 4+ : 	11502226335
Estimated unique kmers:     	62838805828

Table creation time:		290885.545 seconds.
Writing interleaved.

**Brian Bushnell** · 04-24-2017, 12:13 PM

The flags/documentation for BBNorm might be a little misleading in this case... BBNorm acts in 3 phases per pass:

1) Prefilter construction (which finished). This uses all reads.
2) Main hashtable construction (which finished). This uses all reads.
3) Normalization (which started but did not finish). This uses the number of reads you specify with the "reads" flag.

So, the reason it took so long was because it was using all reads in phases 1 and 2 (otherwise, the depths would be incorrect).

To reduce the number of reads from the first two phases, you would need to specify "tablereads"; e.g. "reads=10000000 tablereads=10000000". Or most simply just pull the first 10m reads first:

Code:

reformat.sh in=reads.fq out=10m.fq reads=10m
bbnorm.sh prefilter=t passes=1 target=50 in=10m.fq out=normalized.fq

For binning, you might try MetaBAT. Binning tools are generally better for assemblies, so that tetramer frequencies become more accurate. But, they can work with reads as well, particularly when you have multiple samples of the same metagenome (preferably from slightly different locations / conditions / times), as they can use read depth covariance to assist in binning. But I don't know how well it will work or how long it will take. Also, the longer the reads are, the better; so if the reads mostly overlap, it's prudent to merge them first.

**confurious** · 04-24-2017, 01:01 PM

I see, that makes a lot of senses why it was also slow when I specified only 1 million reads. I just tried to add tablereads= and it works so fast now!!! I noticed you are the author of BBtools and apparently your toolkit has covered the purposes of multiple scripts I have written for myself but always better implemented and covers more stuff. RIP me! I guess it's still a great way for me to learn bioinformatics though.

Why don't you just go for a quick application note on bioinformatics to publish your toolkit? Is that publications no longer matters to you any more

**Brian Bushnell** · 04-25-2017, 12:06 PM

It's not like I don't want publications, it's just that it takes time away from development/support...

And I agree, there's no better way to learn the intricacies of a problem than by attempting to solve it yourself, whether or not the solution is optimal!

**confurious** · 04-25-2017, 12:17 PM

The only reason I was suggesting a publication is because it makes citation easier for me

I am trying to use dedupe.sh to work on the aggregate of individually assembled contigs as an alternative strategy to co-assmebly. Since it already is over-lap aware, I am wondering if it is capable to simply "merge" those over-lapping contigs, instead of reporting them in different clusters (kinda like minimus 2)? If not, would that be something you consider to add to the module.

**Brian Bushnell** · 04-25-2017, 12:27 PM

I agree, I have been planning to add merging to Dedupe for a long time! Hopefully I'll have a chance sometime this year...

**gringer** · 04-29-2017, 12:53 AM

All you need to do is package a release and put it on zenodo.org. That will give you a citeable DOI and "eternal" data record.

It's up to the journals (and reviewers) to accept such a citation in submitted research papers. For what it's worth, I haven't had a problem with this for F1000 research (in fact, they encourage it).

**confurious** · 05-08-2017, 03:50 PM

Originally posted by Brian Bushnell View Post

The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.

For the alternative strategy which is to assemble them independently and deduplicate them using dedupe.sh . I am wondering should I then map normalized reads to the deduplicated contigs or original reads (10x bigger)? It kind of makes sense to me to just go with the normalized reads since I wanted to co-assemble them afterwards anyway.

Thanks.

**Brian Bushnell** · 05-08-2017, 04:00 PM

If you plan to normalize the reads after mapping but before assembly anyway, then yes, it's logical to just map with the normalized reads... but only if the reads were normalized together (which they probably weren't). If they were normalized independently, you need to either combine them all, then normalize, then map, or map, then combine them all, then normalize. Normalizing independently will lose the low-coverage part of each library (depending on your settings) that could still be assembled once the libraries are combined.

**confurious** · 05-08-2017, 04:05 PM

Originally posted by Brian Bushnell View Post

If you plan to normalize the reads after mapping but before assembly anyway, then yes, it's logical to just map with the normalized reads... but only if the reads were normalized together (which they probably weren't). If they were normalized independently, you need to either combine them all, then normalize, then map, or map, then combine them all, then normalize. Normalizing independently will lose the low-coverage part of each library (depending on your settings) that could still be assembled once the libraries are combined.

Thanks for the reply. The reads were indeed normalized together after following your suggestion. Took about 5 days to go through 7 TB of data which is faster than i expected. I still want to approximate "co-assemble" strategy as best as I can even if I can't assemble the normalized reads directly.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

coassemble massive amount of data (>3TB)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News