Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-12-2014, 08:45 PM

There's nothing wrong with those reads, they are valid. Maybe it doesn't like header data on the "+" lines, where it is optional, or maybe it doesn't like the stuff after the space; you could try removing both of those.

But if you want to test error-correction software... I'd be happy if you included BBNorm! It's extremely fast, and designed very conservatively to avoid any false corrections, even in difficult cases like polyploid organisms, amplified single-cell data, highly repetitive organisms, contaminated libraries, and indels (it does not correct indels, but it doesn't make them worse, either). Handles fasta, fastq, paired, gzipped, ... pretty much anything.

Command (on a Linux machine with bash):

ecc.sh in=reads.fastq out=corrected.fastq -Xmx30g

(...where the -Xmx flag specifies roughly 85% of the physical memory on the computer).

P.S. The same package contains BBMap, which outputs useful statistics for evaluating error-correctors even if you don't produce a sam file. For example:

mapped: 46.9273% 98545 reads
unambiguous: 46.6302% 97921 reads

perfect best site: 42.8053%
semiperfect site: 46.2454%
ambiguousMapping: 0.2971% (Kept)
low-Q discards: 0.0000%

Match Rate: 99.3339% 48944396
Error Rate: 0.0312% 15393
Sub Rate: 0.0179% 8830
Del Rate: 0.0090% 4455
Ins Rate: 0.0043% 2108
N Rate: 0.6349% 312833

(The formatting looks better on an actual console)

**gkamath** · 04-12-2014, 09:08 PM

Hi Brian,

Thanks for the reply. I was looking for a good error correction tool to pre-process data before I use my tool. I'll certainly try out BBNorm as well.

Thanks,
Govinda.

**gkamath** · 04-12-2014, 09:33 PM

Hi Brian,

I tried the BBMap based utility ecc.sh. It sends out the following error message.

govinda@govinda-ThinkPad-T430s:~/Documents/Research_code$ ./bbmap/ecc.sh in=test1nn.fastq out=corrected_bbnorm.fastq outt=thrown_bbnorm.fastq -Xmx30g
java -da -Xmx30g -cp /home/govinda/Documents/Research_code/bbmap/current/ jgi.KmerNormalize bits=16 ecc=t passes=1 keepall dr=f prefilter in=test1nn.fastq out=corrected_bbnorm.fastq outt=thrown_bbnorm.fastq -Xmx30g
Executing jgi.KmerNormalize [bits=16, ecc=t, passes=1, keepall, dr=f, prefilter, in=test1nn.fastq, out=corrected_bbnorm.fastq, outt=thrown_bbnorm.fastq, -Xmx30g]

Settings:
threads: 4
k: 31
deterministic: false
toss error reads: false
passes: 1
bits per cell: 16
cells: 6770.27M
hashes: 3
prefilter bits: 2
prefilter cells: 29.16B
prefilter hashes: 2
base min quality: 5
kmer min prob: 0.5

target depth: 40
min depth: 6
max depth: 40
min good kmers: 15
depth percentile: 54.0
ignore dupe kmers: true
fix spikes: false

Made prefilter: hashes = 2 mem = 6.79 GB cells = 29.16B used = 0.001%
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.atomic.AtomicIntegerArray.<init>(AtomicIntegerArray.java:69)
at kmer.KCountArray7MTA.<init>(KCountArray7MTA.java:79)
at kmer.KCountArray.makeNew(KCountArray.java:55)
at kmer.KmerCount7MTA.makeKca(KmerCount7MTA.java:182)
at jgi.KmerNormalize.runPass(KmerNormalize.java:970)
at jgi.KmerNormalize.main(KmerNormalize.java:707)

Do I have to install any java libraries?

**Brian Bushnell** · 04-12-2014, 09:41 PM

Govinda,

That's not a library problem, just a memory problem. How much memory does the machine have? You need to reduce -Xmx30g slightly (possibly due to other programs running in the background). If you have 32GB RAM, try... -Xmx25g. If you have 16GB RAM, try -Xmx12g.

The next revision I release will be better able to calculate how much memory is free automatically.

-Brian

P.S. In this case, I'm guessing you're just using a handful of reads, as the prefilter was only 0.001% full. Bear in mind that error-correction is reliant on high coverage - BBNorm with the default settings won't correct anything unless you have at least roughly 20x coverage, because there's not enough data to make confident corrections. Also, unlike some other programs, BBNorm purely uses a "count-min sketch" data structure. This does NOT grow in size with input data; rather, you allocate the memory you want to use at the beginning, and then no matter how much data you get, it will never overflow, the accuracy just decreases. As a result, don't be concerned that even if you only process 1 read, it will use many gigabytes of memory - the memory usage is fixed regardless of input.

So, the more memory you give it, the more accurate it will be, which is why it's best to give it as much memory as you have free on the machine. But it will still run if you say "-Xmx2g", just the accuracy will be reduced on very large datasets.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Creating .fastq file from simulated reads

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News