SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 03-15-2017, 12:57 PM   #441
sebastian_1
Junior Member
 
Location: Czech republic

Join Date: Mar 2017
Posts: 2
Default

Hello,

I am new to this forum so I apologize if I posted in the wrong place.

Currently I am trying to assemble a single cell genome from some Illumina Hiseq 2x250 bp reads.
One of my first steps is to do normalization of the data. Previously I was using bbnorm.sh because it's fast and very good. However, with this data I cannot use bbnorm.sh on our server due to memory issues. (I have around 100 million PE reads).
So I tried to use bbnorm on a cluster where I allocated 16 cores and 700GB of RAM.

I run the software using this command line:
bbnorm.sh in1=st1c_R1.fastq.gz in2=st1c_R2.fastq.gz out1=st1c_R1_norm.fastq.gz out2=st1c_R2_norm.fastq.gz threads=12 target=60 mindepth=2

The software runs fine, but the process will be killed by the scheduler because of high CPU usage.

Quote:
=>> PBS: job killed: ncpus 59.52 exceeded limit 16 (sum)
I tried to increase the number of threads requested to 30, lower the number of threads in the command line to 10, but still I have the same problem.

The cluster has installed BBMap version 36.92. It looks like for some reason the number of threads which I define is not taken into account, even if during the run it tells me that the number of threads was set to 12.

Quote:
Settings:
threads: 12
k: 31
deterministic: true
toss error reads: false
passes: 1
bits per cell: 16
cells: 145.37B
hashes: 3
base min quality: 5
kmer min prob: 0.5

target depth: 240
min depth: 2
max depth: 300
min good kmers: 15
depth percentile: 64.8
ignore dupe kmers: true
fix spikes: false
Any suggestions? I really cannot ask for the entire node just for me.
I understand that due to other components like pigz, it can use more than 12 cores, but here it looks like uses 60 cores, or all available cores.

Thank you,
Sebastian
sebastian_1 is offline   Reply With Quote
Old 03-15-2017, 01:38 PM   #442
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,447
Default

Hi Sebastian,

You can reduce the memory needs of BBNorm by adding the flags "prefilter" and "bits=16", which might allow it to run on a lower-memory computer, depending on the complexity of the input data.

But as for reducing the CPU utilization... that's tricky. You can add "pigz=f unpigz=f" to disable pigz, so you will only be left with java threads. Java thread scheduling is nondeterministic due to garbage collection and various other background stuff, and I'm not sure if it's possible to ensure the process will be capped at a certain number of total threads without binding the process to specific cores. But you can try manually specifying the number of GC threads:

-XX:ParallelGCThreads=2

This command may or may not work, depending on your java version. You'd have to skip (or modify) the shellscript so that the actual java command would look something like this:

java -ea -Xmx200g -Xms200g -XX:ParallelGCThreads=2 -cp /path/to/bbmap/current/ jgi.KmerNormalize in1=st1c_R1.fastq.gz in2=st1c_R2.fastq.gz out1=st1c_R1_norm.fastq.gz out2=st1c_R2_norm.fastq.gz threads=12 target=60 mindepth=2
Brian Bushnell is offline   Reply With Quote
Old 03-16-2017, 02:00 AM   #443
sebastian_1
Junior Member
 
Location: Czech republic

Join Date: Mar 2017
Posts: 2
Default

Hi Brian,

Thank you for your quick reply. I've tried to run BBNorm with prefilter=t and bits=16 on our server, but I still run out of memory. I have a server with 128GB of RAM, and I've tried also on a 256 GB but it didn't work. But I saw that the high CPU usage (more than defined in threads=x) is just at the beginning, for a short time and afterwards java limits itself to the required number of cores.
Luckily, I've found that the PBS system divides the CPU hours to the walltime, and thus gets the CPU usage. The workaround was to ask the job as a STDIN, and be idle for around 30 minutes, which will create a "buffer", and thus even if I start with much higher initial CPU usage, the time buffer is enough to compensate, and thus the job doesn't get killed by the scheduler.

However now I get an error with pigz:

Quote:
Table creation time: 1667.979 seconds.
Started output threads.
pigz: write error code 122
pigz: abort: write error on <stdout>
Exception in thread "Thread-53" java.lang.RuntimeException: java.io.IOException: Stream closed
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:31)
Caused by: java.io.IOException: Stream closed
at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:433)
at java.io.OutputStream.write(OutputStream.java:116)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at stream.ReadStreamByteWriter.writeFastq(ReadStreamByteWriter.java:451)
at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:96)
at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:41)
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:27)
Exception in thread "Thread-64" java.lang.RuntimeException: Writing to a terminated thread.
at stream.ConcurrentGenericReadOutputStream.write(ConcurrentGenericReadOutputStream.java:207)
at stream.ConcurrentGenericReadOutputStream.addOrdered(ConcurrentGenericReadOutputStream.java:193)
at stream.ConcurrentGenericReadOutputStream.add(ConcurrentGenericReadOutputStream.java:98)
at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3129)
at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2801)
Output buffer became full; key 95560 waiting on 95304.
LE: I set the $TMPDIR on the cluster to my current working directory and i got rid of the pigz error:-)

Last edited by sebastian_1; 03-16-2017 at 03:24 AM.
sebastian_1 is offline   Reply With Quote
Old 03-16-2017, 03:26 AM   #444
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,190
Default

While you wait for @Brian to respond (he is on California Daylight Time) you could turn off "pigz=f unpigz=f" and see if that allows the job to move forward.
GenoMax is offline   Reply With Quote
Old 03-16-2017, 06:04 AM   #445
minja
Junior Member
 
Location: Novosibirsk

Join Date: Mar 2017
Posts: 1
Default

Hello Brain,

I'm trying to make use of bbmapskimmer.
Here is a simple fasta example (read.fasta):

Code:
>1
GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA
>2
TAGAAAACATAAAGCTAGTGATAATGTAGTAGTACAAGGGGATCATGTTC
>3
GAACATGATCCCCTTGTACTATTACATTATCACTAGCTTTATGTTTTCTA
Read 1 and read 3 are only different in single "T/C" letter, read 2 is reverse-complement of 1.

I'm aligning to mm10 genome:

~/Distr/bbmap/bbmapskimmer.sh in=~/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=~/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local

Here is test_bbmap1.sam:
Code:
@HD    VN:1.4    SO:unsorted
@SQ    SN:chr10    LN:130694993
....
@SQ    SN:chrM    LN:16299
@SQ    SN:chrX    LN:171031299
@SQ    SN:chrY    LN:91744698
@PG    ID:BBMap    PN:BBMap    VN:37.02    CL:java -Djava.library.path=/mnt/storage/home/vsfishman/Distr/bbmap/jni/ -ea -Xmx98546m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/mnt/storage/home/vsfishman/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=/mnt/storage/home/vsfishman/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local
1    0    chr16    39205299    43    50=    *    0    0    GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA    *    NM:i:0    AM:i:43    NH:i:2
1    272    chr10    38707112    39    28=1X21=    *    0    0    *    *    NM:i:1    AM:i:39    NH:i:2
2    16    chr16    39205299    43    50=    *    0    0    GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA    *    NM:i:0    AM:i:43    NH:i:2
2    256    chr10    38707112    39    28=1X21=    *    0    0    *    *    NM:i:1    AM:i:39    NH:i:2
3    16    chr10    38707112    3    50=    *    0    0    TAGAAAACATAAAGCTAGTGATAATGTAATAGTACAAGGGGATCATGTTC    *    XT:A:R    NM:i:0    AM:i:3    NH:i:80
3    272    chr10    72154315    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr10    77806115    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    272    chr10    127797927    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr13    4007635    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr14    17647057    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr14    40019581    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr15    50069753    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    272    chr15    72113945    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr15    75259455    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr17    69579150    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr19    60830365    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr1    95272134    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr1    102465771    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
....
(I reduced some header lines and some alinments of read 3)

And, just to note, without "local" option I got same results.

As you can see, there are many perfect alignmens for read3, but only 2 alignments of reads 1 and 2. As I understand, with minid=0.6 all alignments of read3 should be also reported for reads 1 and 2?

Could you, please, help to clarify this?
And one more question, why reads 1 and 2 do not have "XT:A:" tag?

Best,
Minja

Last edited by minja; 03-16-2017 at 06:07 AM.
minja is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO