Seqanswers Leaderboard Ad

**Brian Bushnell** · 01-27-2015, 02:40 PM

The best way to remove low-coverage reads without a reference is to use kmer-counting to estimate read depth. BBNorm will do this:

bbnorm.sh in=reads.fq out=highpass.fq target=99999999 min=5 passes=1

This will remove all reads that have a depth below 5x. The "target=99999999" ensures that it will not remove high-depth reads unless the depth is above 99999999, which it won't be.

BBNorm is fast and does not use very much memory.

**horvathdp** · 01-27-2015, 02:46 PM

Excellent! I just downloaded your program and am unpacking my reads now. I take it that I don't need to specify a kmer length. Also, I have paired end reads, and would like to keep them that way for eventual assembly. Is there a good way to do that or should I just run the output through a re-pairing program afterwards?

**Brian Bushnell** · 01-27-2015, 03:03 PM

Originally posted by horvathdp View Post

Excellent! I just downloaded your program and am unpacking my reads now. I take it that I don't need to specify a kmer length. Also, I have paired end reads, and would like to keep them that way for eventual assembly. Is there a good way to do that or should I just run the output through a re-pairing program afterwards?

Pairing is retained. It takes 2 files (using in1= in2= out1= out2=) if you want, or pairs interleaved in a single file (alternating read 1 then read 2). Also, it reads/writes gzip (but not tar).

The default kmer length is 31; you can change that with e.g. "k=25". If you have short reads, you may want to reduce it, as the depth it targets is kmer depth, and once kmers get close to read length, kmer depth and read depth diverge.

**horvathdp** · 01-28-2015, 06:36 AM

Good to know. I'm working my way through the docs. It seems like a nice, fairly intuitive program. Incidentally, I tried to run the program last night, and it finished in seconds, but the output files were empty. My guess is that I don't have sufficient RAM or memory on my small instance. Can you take a guess as to the minimum I need to run this program on ~400 million 160 base fragments?

Dave

**Brian Bushnell** · 01-28-2015, 10:44 AM

BBNorm is fast, but not THAT fast - if 400 million reads finished in seconds, then I'm sure it crashed or aborted - could you post the exact command and console output (std error)?

BBNorm uses a constant amount of memory that is set when it starts, regardless of the input. Rather, the accuracy of its estimates of read depth declines with the input volume, and increases with the amount of memory available. And the amount of memory that is needed for optimal accuracy varies based on the error rate and genome size of the data - for a chipseq experiment, if only 5% of the genome is covered at all by reads, then the effective genome size would be very small.

Generally, I would say that the amount of memory needed for very high accuracy, with typical Illumina data, is around 15 bytes per covered reference base, regardless of the number of reads. In your case, since you only care about low coverage, you can use the flag "bits=4" (which caps the counters at a max of 15, and makes them use less memory) to increase the accuracy.

**horvathdp** · 01-29-2015, 06:58 AM

Here is the output when I try to generate a histogram using the command:
./khist.sh in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt

I decided this might be more informative since I have such histograms that were generated by jellyfish for this dataset (of spurge genomic illumina sequences) so I could compare them.

I got the histogram file, but it is empty except for the column headers.

I am running only 1 cpu with 4G of ram.

The output is below

horvathdp@vm64-16:~/myvol/Price/PriceSource130506/bbmap$ ./khist.sh in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
java -ea -Xmx679m -Xms679m -cp /home/horvathdp/myvol/Price/PriceSource130506/bbmap/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, in=spurge_genome_all_trimmed_R1.fastq, hist=histogram.txt]

Settings:
threads: 1
k: 31
deterministic: false
toss error reads: false
passes: 1
bits per cell: 32
cells: 67.65M
hashes: 3
prefilter bits: 2
prefilter cells: 582.87M
prefilter hashes: 2
base min quality: 0
kmer min prob: 0.0

target depth: 40
min depth: 0
max depth: 40
min good kmers: 1
depth percentile: 54.0
ignore dupe kmers: true
fix spikes: false
histogram length: 1048576
print zero cov: false

Exception in thread "Thread-1" java.lang.IndexOutOfBoundsException: index -3230147
at java.util.concurrent.atomic.AtomicIntegerArray.rawIndex(AtomicIntegerArray.java:59)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.incrementHashedLocal(KCountArray7MTA.java:517)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:242)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made prefilter: hashes = 2 mem = 138.88 MB cells = 582.51M used = 0.000%
Exception in thread "Thread-5" java.lang.IndexOutOfBoundsException: index -3230147
at java.util.concurrent.atomic.AtomicIntegerArray.rawIndex(AtomicIntegerArray.java:59)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.readHashed(KCountArray7MTA.java:201)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:119)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:231)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made hash table: hashes = 3 mem = 258.06 MB cells = 67.65M used = 0.000%

Estimated kmers of depth 1-3: 1
Estimated kmers of depth 4+ : 0
Estimated unique kmers: 1

Table creation time: 1.279 seconds.
Exception in thread "Thread-10" java.lang.NoClassDefFoundError: java/util/concurrent/ThreadLocalRandom
at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2807)
Caused by: java.lang.ClassNotFoundException: java.util.concurrent.ThreadLocalRandom
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
... 1 more
Table read time: 0.185 seconds. 0.00 kb/sec
Total reads in: 0 NaN% Kept
Total bases in: 0 NaN% Kept
Error reads in: 0 NaN%
Error type 1: 0 NaN%
Error type 2: 0 NaN%
Error type 3: 0 NaN%

Wrote histogram to histogram.txt
Total kmers counted: 0
Total unique kmer count: 0
Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
The most accurate value is the greater of the two.

Percent unique: Infinity%
Depth average: NaN (unique kmers)
Depth median: 0 (unique kmers)
Depth standard deviation: NaN (unique kmers)

Depth average: NaN (all kmers)
Depth median: 0 (all kmers)
Depth standard deviation: 0.00 (all kmers)

Approx. read depth median: NaN

Total time: 2.074 seconds. 0.00 kb/sec

**Brian Bushnell** · 01-29-2015, 09:43 AM

Oh! I think I understand the problem now. Almost all of the programs in BBTools are compatible with Java 6, but BBNorm (ecc.sh, bbnorm.sh, and khist.sh) requires Java 7 or higher. Also, it needs a 64-bit OS and 64-bit version of Java if you want to be able to use all of your memory. If you install 64-bit Java 7, then you can add the flag "-Xmx3g" which will force the program to use 3 GB of ram.

**horvathdp** · 01-29-2015, 10:37 AM

So I loaded java7,

made sure it was the default java

Reading state information... Done
default-jre is already the newest version.
default-jre set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 127 not upgraded.\and I still get the following

./khist.sh -Xmx3g in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
java -ea -Xmx3g -Xms3g -cp /home/horvathdp/myvol/Price/PriceSource130506/bbmap/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 -Xmx3g in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, -Xmx3g, in=spurge_genome_all_trimmed_R1.fastq, hist=histogram.txt]

Settings:
threads: 1
k: 31
deterministic: false
toss error reads: false
passes: 1
bits per cell: 32
cells: 355.27M
hashes: 3
prefilter bits: 2
prefilter cells: 3060.78M
prefilter hashes: 2
base min quality: 0
kmer min prob: 0.0

target depth: 40
min depth: 0
max depth: 40
min good kmers: 1
depth percentile: 54.0
ignore dupe kmers: true
fix spikes: false
histogram length: 1048576
print zero cov: false

Exception in thread "Thread-1" java.lang.IndexOutOfBoundsException: index -19148990
at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(AtomicIntegerArray.java:65)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.incrementHashedLocal(KCountArray7MTA.java:517)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:242)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made prefilter: hashes = 2 mem = 729.14 MB cells = 3058.25M used = 0.000%
Exception in thread "Thread-5" java.lang.IndexOutOfBoundsException: index -19148990
at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(AtomicIntegerArray.java:65)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.readHashed(KCountArray7MTA.java:201)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:119)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:231)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made hash table: hashes = 3 mem = 1.32 GB cells = 355.16M used = 0.000%

Estimated kmers of depth 1-3: 1
Estimated kmers of depth 4+ : 0
Estimated unique kmers: 1

Table creation time: 9.133 seconds.
Exception in thread "Thread-11" java.lang.IndexOutOfBoundsException: index -19148990
at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(AtomicIntegerArray.java:65)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.readHashed(KCountArray7MTA.java:201)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:119)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:109)
at kmer.KCountArray.read(KCountArray.java:133)
at jgi.KmerNormalize.generateCoverage(KmerNormalize.java:2757)
at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:2857)
at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2811)
Table read time: 0.129 seconds. 0.00 kb/sec
Total reads in: 0 NaN% Kept
Total bases in: 0 NaN% Kept
Error reads in: 0 NaN%
Error type 1: 0 NaN%
Error type 2: 0 NaN%
Error type 3: 0 NaN%

Wrote histogram to histogram.txt
Total kmers counted: 0
Total unique kmer count: 0
Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
The most accurate value is the greater of the two.

Percent unique: Infinity%
Depth average: NaN (unique kmers)
Depth median: 0 (unique kmers)
Depth standard deviation: NaN (unique kmers)

Depth average: NaN (all kmers)
Depth median: 0 (all kmers)
Depth standard deviation: 0.00 (all kmers)

Approx. read depth median: NaN

Total time: 15.108 seconds. 0.00 kb/sec

**Brian Bushnell** · 01-29-2015, 10:46 AM

Well, sorry for the inconvenience, then... I have never seen that error before and I can't imagine why it is happening. khist.sh works fine for me when I run it with -Xmx679m or -Xmx3g, and 1 thread, which should be exactly the same as what you are doing.

I can only suggest that another similar tool, such as khmer or Trinity, might be able to accomplish the same thing.

**GenoMax** · 01-29-2015, 11:01 AM

@horvathdp: Do you have only 4G RAM? Are you running a 32-bit or 64-bit OS?

**horvathdp** · 01-29-2015, 12:40 PM

Thanks for trying. It always amazes me how helpful and patient you developers are with nubes like myself. Yes, just 4G of RAM on a linux (Ubuntu 12) 64 bit instance.

**horvathdp** · 02-04-2015, 12:14 PM

I managed to get the program to run on a different system. Not sure what was the trick, but all is well. I have another question though. When my data was run on Jellyfish, the histogram I got back had a peak at 15X coverage. However, when I ran khist.sh my peak is at about 35X coverage. To be fair, based on the predicted size of my genome, was expecting about 30X coverage. Any idea why Jellyfish indicates about half the coverage that your program gives?

**Brian Bushnell** · 02-04-2015, 01:59 PM

The most likely cause is that khist.sh stores only one copy of a kmer or its reverse-complement. I am not sure about Jellyfish's behavior, but this is the result I would expect if it stored forward and reverse kmers independently. In this case, you would expect roughly 30x coverage of reads on the genome, or 15x coverage of forward reads. But you'd have to check Jellyfish's documentation to see exactly how it processes kmers and their reverse-complements.

**horvathdp** · 02-06-2015, 01:01 PM

So I generated a bunch of lovely histograms and figured out my cutoffs for the samples, but I am getting an error of the hash table being too full when running bbnorm.sh- even when I cut my sample size in half. Are there any run parameters I could add to the command line to allow the program to work? The machine I am working on has a Tb of hard drive and 12 G of RAM, so I am surprised that it is running low.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

how to best remove low-coverage reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News