SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to remove reads contaminants? Guigra Bioinformatics 7 07-03-2013 08:04 AM
Any way to remove background reads? metheuse Bioinformatics 3 03-16-2013 10:53 AM
remove illumina low quality reads StephaniePi83 Bioinformatics 1 04-12-2012 03:12 AM
Low coverage sequencing, which strategy? Ole De novo discovery 3 02-29-2012 09:21 AM
low 454 coverage combined with high solexa coverage strob Bioinformatics 7 10-07-2010 10:14 AM

Reply
 
Thread Tools
Old 01-27-2015, 12:35 PM   #1
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default how to best remove low-coverage reads

Hi all,

I need to remove low coverage reads from a CHipseq experiment before I attempt assembly (I don't have a reference genome to map back to, so I am improvising). It looks like there are several options, but am not familiar enough with any of them to use. I have access to iPlant atmosphere for computing power so my memory is going to be a bit limited. I am also somewhat of a novice- I have unpacked and run some programs, but I haven't run any scripts of my own. Any suggestions?
horvathdp is offline   Reply With Quote
Old 01-27-2015, 01:40 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The best way to remove low-coverage reads without a reference is to use kmer-counting to estimate read depth. BBNorm will do this:

bbnorm.sh in=reads.fq out=highpass.fq target=99999999 min=5 passes=1

This will remove all reads that have a depth below 5x. The "target=99999999" ensures that it will not remove high-depth reads unless the depth is above 99999999, which it won't be.

BBNorm is fast and does not use very much memory.
Brian Bushnell is offline   Reply With Quote
Old 01-27-2015, 01:46 PM   #3
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Excellent! I just downloaded your program and am unpacking my reads now. I take it that I don't need to specify a kmer length. Also, I have paired end reads, and would like to keep them that way for eventual assembly. Is there a good way to do that or should I just run the output through a re-pairing program afterwards?
horvathdp is offline   Reply With Quote
Old 01-27-2015, 02:03 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by horvathdp View Post
Excellent! I just downloaded your program and am unpacking my reads now. I take it that I don't need to specify a kmer length. Also, I have paired end reads, and would like to keep them that way for eventual assembly. Is there a good way to do that or should I just run the output through a re-pairing program afterwards?
Pairing is retained. It takes 2 files (using in1= in2= out1= out2=) if you want, or pairs interleaved in a single file (alternating read 1 then read 2). Also, it reads/writes gzip (but not tar).

The default kmer length is 31; you can change that with e.g. "k=25". If you have short reads, you may want to reduce it, as the depth it targets is kmer depth, and once kmers get close to read length, kmer depth and read depth diverge.

Last edited by Brian Bushnell; 01-27-2015 at 03:18 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-28-2015, 05:36 AM   #5
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Good to know. I'm working my way through the docs. It seems like a nice, fairly intuitive program. Incidentally, I tried to run the program last night, and it finished in seconds, but the output files were empty. My guess is that I don't have sufficient RAM or memory on my small instance. Can you take a guess as to the minimum I need to run this program on ~400 million 160 base fragments?

Dave
horvathdp is offline   Reply With Quote
Old 01-28-2015, 09:44 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBNorm is fast, but not THAT fast - if 400 million reads finished in seconds, then I'm sure it crashed or aborted - could you post the exact command and console output (std error)?

BBNorm uses a constant amount of memory that is set when it starts, regardless of the input. Rather, the accuracy of its estimates of read depth declines with the input volume, and increases with the amount of memory available. And the amount of memory that is needed for optimal accuracy varies based on the error rate and genome size of the data - for a chipseq experiment, if only 5% of the genome is covered at all by reads, then the effective genome size would be very small.

Generally, I would say that the amount of memory needed for very high accuracy, with typical Illumina data, is around 15 bytes per covered reference base, regardless of the number of reads. In your case, since you only care about low coverage, you can use the flag "bits=4" (which caps the counters at a max of 15, and makes them use less memory) to increase the accuracy.
Brian Bushnell is offline   Reply With Quote
Old 01-29-2015, 05:58 AM   #7
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Here is the output when I try to generate a histogram using the command:
./khist.sh in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt

I decided this might be more informative since I have such histograms that were generated by jellyfish for this dataset (of spurge genomic illumina sequences) so I could compare them.

I got the histogram file, but it is empty except for the column headers.

I am running only 1 cpu with 4G of ram.

The output is below


horvathdp@vm64-16:~/myvol/Price/PriceSource130506/bbmap$ ./khist.sh in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
java -ea -Xmx679m -Xms679m -cp /home/horvathdp/myvol/Price/PriceSource130506/bbmap/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, in=spurge_genome_all_trimmed_R1.fastq, hist=histogram.txt]


Settings:
threads: 1
k: 31
deterministic: false
toss error reads: false
passes: 1
bits per cell: 32
cells: 67.65M
hashes: 3
prefilter bits: 2
prefilter cells: 582.87M
prefilter hashes: 2
base min quality: 0
kmer min prob: 0.0

target depth: 40
min depth: 0
max depth: 40
min good kmers: 1
depth percentile: 54.0
ignore dupe kmers: true
fix spikes: false
histogram length: 1048576
print zero cov: false

Exception in thread "Thread-1" java.lang.IndexOutOfBoundsException: index -3230147
at java.util.concurrent.atomic.AtomicIntegerArray.rawIndex(AtomicIntegerArray.java:59)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.incrementHashedLocal(KCountArray7MTA.java:517)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:242)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made prefilter: hashes = 2 mem = 138.88 MB cells = 582.51M used = 0.000%
Exception in thread "Thread-5" java.lang.IndexOutOfBoundsException: index -3230147
at java.util.concurrent.atomic.AtomicIntegerArray.rawIndex(AtomicIntegerArray.java:59)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.readHashed(KCountArray7MTA.java:201)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:119)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:231)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made hash table: hashes = 3 mem = 258.06 MB cells = 67.65M used = 0.000%

Estimated kmers of depth 1-3: 1
Estimated kmers of depth 4+ : 0
Estimated unique kmers: 1

Table creation time: 1.279 seconds.
Exception in thread "Thread-10" java.lang.NoClassDefFoundError: java/util/concurrent/ThreadLocalRandom
at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2807)
Caused by: java.lang.ClassNotFoundException: java.util.concurrent.ThreadLocalRandom
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
... 1 more
Table read time: 0.185 seconds. 0.00 kb/sec
Total reads in: 0 NaN% Kept
Total bases in: 0 NaN% Kept
Error reads in: 0 NaN%
Error type 1: 0 NaN%
Error type 2: 0 NaN%
Error type 3: 0 NaN%

Wrote histogram to histogram.txt
Total kmers counted: 0
Total unique kmer count: 0
Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
The most accurate value is the greater of the two.

Percent unique: Infinity%
Depth average: NaN (unique kmers)
Depth median: 0 (unique kmers)
Depth standard deviation: NaN (unique kmers)

Depth average: NaN (all kmers)
Depth median: 0 (all kmers)
Depth standard deviation: 0.00 (all kmers)

Approx. read depth median: NaN

Total time: 2.074 seconds. 0.00 kb/sec
horvathdp is offline   Reply With Quote
Old 01-29-2015, 08:43 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh! I think I understand the problem now. Almost all of the programs in BBTools are compatible with Java 6, but BBNorm (ecc.sh, bbnorm.sh, and khist.sh) requires Java 7 or higher. Also, it needs a 64-bit OS and 64-bit version of Java if you want to be able to use all of your memory. If you install 64-bit Java 7, then you can add the flag "-Xmx3g" which will force the program to use 3 GB of ram.
Brian Bushnell is offline   Reply With Quote
Old 01-29-2015, 09:37 AM   #9
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

So I loaded java7,


made sure it was the default java

Reading state information... Done
default-jre is already the newest version.
default-jre set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 127 not upgraded.\and I still get the following


./khist.sh -Xmx3g in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
java -ea -Xmx3g -Xms3g -cp /home/horvathdp/myvol/Price/PriceSource130506/bbmap/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 -Xmx3g in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, -Xmx3g, in=spurge_genome_all_trimmed_R1.fastq, hist=histogram.txt]


Settings:
threads: 1
k: 31
deterministic: false
toss error reads: false
passes: 1
bits per cell: 32
cells: 355.27M
hashes: 3
prefilter bits: 2
prefilter cells: 3060.78M
prefilter hashes: 2
base min quality: 0
kmer min prob: 0.0

target depth: 40
min depth: 0
max depth: 40
min good kmers: 1
depth percentile: 54.0
ignore dupe kmers: true
fix spikes: false
histogram length: 1048576
print zero cov: false

Exception in thread "Thread-1" java.lang.IndexOutOfBoundsException: index -19148990
at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(AtomicIntegerArray.java:65)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.incrementHashedLocal(KCountArray7MTA.java:517)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:242)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made prefilter: hashes = 2 mem = 729.14 MB cells = 3058.25M used = 0.000%
Exception in thread "Thread-5" java.lang.IndexOutOfBoundsException: index -19148990
at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(AtomicIntegerArray.java:65)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.readHashed(KCountArray7MTA.java:201)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:119)
at kmer.KCountArray7MTA.increment(KCountArray7MTA.java:231)
at kmer.KmerCount7MTA$CountThread.addRead_Advanced(KmerCount7MTA.java:717)
at kmer.KmerCount7MTA$CountThread.count(KmerCount7MTA.java:498)
at kmer.KmerCount7MTA$CountThread.run(KmerCount7MTA.java:457)
Made hash table: hashes = 3 mem = 1.32 GB cells = 355.16M used = 0.000%

Estimated kmers of depth 1-3: 1
Estimated kmers of depth 4+ : 0
Estimated unique kmers: 1

Table creation time: 9.133 seconds.
Exception in thread "Thread-11" java.lang.IndexOutOfBoundsException: index -19148990
at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(AtomicIntegerArray.java:65)
at java.util.concurrent.atomic.AtomicIntegerArray.get(AtomicIntegerArray.java:112)
at kmer.KCountArray7MTA.readHashed(KCountArray7MTA.java:201)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:119)
at kmer.KCountArray7MTA.read(KCountArray7MTA.java:109)
at kmer.KCountArray.read(KCountArray.java:133)
at jgi.KmerNormalize.generateCoverage(KmerNormalize.java:2757)
at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:2857)
at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2811)
Table read time: 0.129 seconds. 0.00 kb/sec
Total reads in: 0 NaN% Kept
Total bases in: 0 NaN% Kept
Error reads in: 0 NaN%
Error type 1: 0 NaN%
Error type 2: 0 NaN%
Error type 3: 0 NaN%

Wrote histogram to histogram.txt
Total kmers counted: 0
Total unique kmer count: 0
Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
The most accurate value is the greater of the two.

Percent unique: Infinity%
Depth average: NaN (unique kmers)
Depth median: 0 (unique kmers)
Depth standard deviation: NaN (unique kmers)

Depth average: NaN (all kmers)
Depth median: 0 (all kmers)
Depth standard deviation: 0.00 (all kmers)

Approx. read depth median: NaN

Total time: 15.108 seconds. 0.00 kb/sec
horvathdp is offline   Reply With Quote
Old 01-29-2015, 09:46 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Well, sorry for the inconvenience, then... I have never seen that error before and I can't imagine why it is happening. khist.sh works fine for me when I run it with -Xmx679m or -Xmx3g, and 1 thread, which should be exactly the same as what you are doing.

I can only suggest that another similar tool, such as khmer or Trinity, might be able to accomplish the same thing.
Brian Bushnell is offline   Reply With Quote
Old 01-29-2015, 10:01 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,046
Default

@horvathdp: Do you have only 4G RAM? Are you running a 32-bit or 64-bit OS?
GenoMax is offline   Reply With Quote
Old 01-29-2015, 11:40 AM   #12
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Thanks for trying. It always amazes me how helpful and patient you developers are with nubes like myself. Yes, just 4G of RAM on a linux (Ubuntu 12) 64 bit instance.
horvathdp is offline   Reply With Quote
Old 02-04-2015, 11:14 AM   #13
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

I managed to get the program to run on a different system. Not sure what was the trick, but all is well. I have another question though. When my data was run on Jellyfish, the histogram I got back had a peak at 15X coverage. However, when I ran khist.sh my peak is at about 35X coverage. To be fair, based on the predicted size of my genome, was expecting about 30X coverage. Any idea why Jellyfish indicates about half the coverage that your program gives?
horvathdp is offline   Reply With Quote
Old 02-04-2015, 12:59 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The most likely cause is that khist.sh stores only one copy of a kmer or its reverse-complement. I am not sure about Jellyfish's behavior, but this is the result I would expect if it stored forward and reverse kmers independently. In this case, you would expect roughly 30x coverage of reads on the genome, or 15x coverage of forward reads. But you'd have to check Jellyfish's documentation to see exactly how it processes kmers and their reverse-complements.
Brian Bushnell is offline   Reply With Quote
Old 02-06-2015, 12:01 PM   #15
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

So I generated a bunch of lovely histograms and figured out my cutoffs for the samples, but I am getting an error of the hash table being too full when running bbnorm.sh- even when I cut my sample size in half. Are there any run parameters I could add to the command line to allow the program to work? The machine I am working on has a Tb of hard drive and 12 G of RAM, so I am surprised that it is running low.
horvathdp is offline   Reply With Quote
Old 02-06-2015, 12:22 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

There are several parameters you can use to affect the table fullness, such as using the prefilter or reducing the number of bits per cell, or manually specifying the -Xmx parameter. Can you post the command line you used, and the standard error log? And, if possible, attach a kmer frequency histogram (from the full dataset). Even when the table gets really full, the accuracy is generally quite high.

Also, quality-trimming and adapter-trimming the data before analyzing it can have a huge impact, because the vast majority of the kmers that get counted are error kmers rather than genomic kmers; so anything that reduces the number of errors and amount of non-genomic sequence will be beneficial.

Last edited by Brian Bushnell; 02-06-2015 at 12:24 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-06-2015, 01:11 PM   #17
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

the command was
./bbnorm.sh in=all_spurge_genomic_R1.fastq out=single_copy_R1.fastq target=80 min=5 passes=1 below is a link to a picture of the screenshot and the histogram of the combined R1 and R2 reads (I set the first two kmers to 0 since the numbers for these was very large). When I try this, with both R1 and R2 files I get a hash table that is 100% full.

http://de.iplantcollaborative.org/dl...2/Untitled.jpg
horvathdp is offline   Reply With Quote
Old 02-06-2015, 02:04 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

OK, it looks like there are some good possibilities here:

First, it does not look like BBNorm is using all available memory, for whatever reason, so that can be configured manually. Try this:

./bbnorm.sh -Xmx11g in=all_spurge_genomic_R1.fastq out=single_copy_R1.fastq target=80 min=5 passes=1 prefilter=t cbits=16 minprob=0.8

-Xmx11g will force it to use 11GB of RAM; prefilter will put low-count kmers in a more efficient table; and cbits=16 will double the number of cells in the hashtable by capping the counts at 65k (rather than 2 billion).

Secondly, BBNorm will keep read 1 and read 2 together if you use them both at the same time (with in1= and in2=), which is generally a good idea. Thirdly, you can further reduce the error rate (and thus kmer count), as I mentioned, by quality-trimming and adapter-trimming prior to this process. Also (or alternatively), you can ignore kmers that have a high chance of being incorrect, by adjusting the "minprob" flag upward from it's default 0.5 to, say, 0.8 (meaning, only count kmers with at least a 80% chance of being error-free according to the quality values).

Normalization and other kmer-counting-based processes use a lot of memory when working with large genomes, and 12GB is not very much in the world of high-throughput sequencing.

Last edited by Brian Bushnell; 02-06-2015 at 03:01 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-07-2015, 08:01 AM   #19
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Awesome! Thanks again- look me up at PAG next year and I'll buy the beer :-). Incidentally, I already quality trimmed and adapter trimmed the dataset (and checked the qc). It's the first thing I do with any new set of reads before attempting any other sort of analysis. However, I am sure some primers might make it through the trimming programs, so another pass sounds like a good idea so long as it does not add a lot of time. I'll have to wait until Monday to give it a go- unless I can get access to a hotter virtual machine from iPlant. Hopefully you won't hear from me again (until I ask you how you want to be cited :-).
horvathdp is offline   Reply With Quote
Old 02-10-2015, 06:00 AM   #20
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Sadly, I am still having an issue (I think). The program has been running for three days now (but has not appeared to quit) and has the following error messages:
Exception in thread "Thread-45" java.lang.RuntimeException: java.io.IOException: Protocol error at stream.ReadStreamByteWriter.run (ReadStreamByteWriter,java:31)
caused by: ....a seried of java.io.IoExceptions, FileOutputStream, and ReadStreamByteWriter, and BufferedOutputStream.write

Should I kill it or let it ruminate?
horvathdp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:18 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO