SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BBMap (aligner for DNA/RNAseq) is now open-source and available for download. Brian Bushnell Bioinformatics 556 09-07-2017 03:42 PM
BBMap for BitSeq dietmar13 Bioinformatics 1 04-30-2015 08:40 AM
Please help my BBMap cannot remove Illumina adapter TofuKaj Bioinformatics 4 04-28-2015 08:53 AM
BBMap Error Phage Hunter Bioinformatics 5 01-14-2015 04:34 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM

Reply
 
Thread Tools
Old 09-30-2015, 06:52 PM   #41
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

It is almost possible to do this with Seal, which outputs reads into bins based on kmer matching.

seal.sh in=reads.fq pattern=%.fq k=6 restrictleft=6 mm=f ref=barcodes.fa rcomp=f

That would require a file "barcodes.fa" like this:
>AACTGA
AACTGA
>GGCCTT
GGCCTT

etc., with one fasta entry per barcode, so the output reads would be in file AACTGA.fq and so forth. This is sort of a common request, so maybe I will make it unnecessary to provide a fasta file of the barcodes. Does that matter to you either way?

However, BBDuk has the flags "skipr1" and "skipr2", which allow it to only do kmer operations on one read or the other. Seal currently lacks this, but it's essential for processing inline barcodes. I'll add it for the next release.
Brian Bushnell is offline   Reply With Quote
Old 09-30-2015, 07:04 PM   #42
dkainer
Junior Member
 
Location: Australia

Join Date: May 2015
Posts: 9
Default

i hadn't noticed the Seal command. Thanks for responding so fast!

So i assume that if I were to input paired-end reads to Seal with a barcodes.fa as the ref, it would try and match the barcodes in both the R1 and R2 reads? Hence the need for skipr1 and skipr2...?

Additionally, would seal let you left trim off the barcode bases from the R1 read?
dkainer is offline   Reply With Quote
Old 09-30-2015, 07:21 PM   #43
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by dkainer View Post
i hadn't noticed the Seal command. Thanks for responding so fast!

So i assume that if I were to input paired-end reads to Seal with a barcodes.fa as the ref, it would try and match the barcodes in both the R1 and R2 reads? Hence the need for skipr1 and skipr2...?
That's correct.

Quote:
Additionally, would seal let you left trim off the barcode bases from the R1 read?
Yes, it has a flag "ftl" (forcetrimleft) for doing that... "ftl=6" would remove the first 6 bases of all reads. Unfortunately it would do that for both read 1 and read 2. So... if you have reads in 2 files, that's fine; you just process the read1 file with "ftl=6". If they are interleaved it's more complicated - you'd have to split them first (for example, reformat.sh in=reads.fq out=read#.fq). I'll consider adding that the ability to only do all operations on left or right reads... it seems useful.
Brian Bushnell is offline   Reply With Quote
Old 01-24-2016, 05:59 PM   #44
habib
Junior Member
 
Location: indonesia

Join Date: Feb 2012
Posts: 3
Default

Hi Brian,

I produced the following graph using khist.sh for my 100bp PE Illumina reads. Could you help me interpret the graph please? What is the difference between raw_count and unique_kmers?

https://raw.githubusercontent.com/ha...hist.sh_output
__________________
Habib

Life does have an instruction...
habib is offline   Reply With Quote
Old 01-25-2016, 07:58 AM   #45
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Try dividing the raw count by the depth and see that the result equals unique_kmers. That might give you a clue as to what everything means.
westerman is offline   Reply With Quote
Old 01-25-2016, 05:02 PM   #46
habib
Junior Member
 
Location: indonesia

Join Date: Feb 2012
Posts: 3
Default

Thank you westerman.

So, each unique kmers would be present in average, at 'coverage depth' many times.

How about several peaks that I observed after depth 1 (which is error kmers)? There are small peak at depth 23, bigger at 46 and highest at 83. Do they indicate that my genome is heterozygous, diploid?
__________________
Habib

Life does have an instruction...
habib is offline   Reply With Quote
Old 01-25-2016, 07:33 PM   #47
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by habib View Post
Thank you westerman.

So, each unique kmers would be present in average, at 'coverage depth' many times.

How about several peaks that I observed after depth 1 (which is error kmers)? There are small peak at depth 23, bigger at 46 and highest at 83. Do they indicate that my genome is heterozygous, diploid?
It's always difficult to determine exactly what this means. With only a primary peak at 46 and smaller at 23, that indicates a heterozygous diploid. However, the peak at 83 could be a lot of things. Looking at the "unique_kmers column, the first two peaks are actually at 21 and 42, so a 3rd peak at around 84 probably indicates a tetraploid. It could also be a contaminant or an organelle such as mitochondria or chloroplast, but organelles would usually have higher coverage. It could also be 2-copy repeats in the genome. But I suspect it's tetraploid.
Brian Bushnell is offline   Reply With Quote
Old 01-25-2016, 10:55 PM   #48
habib
Junior Member
 
Location: indonesia

Join Date: Feb 2012
Posts: 3
Default

Thank Brian,

actually there is also another peak at around 1000 coverage, which, as you suggest could be the organelle genomic sequences (I did not include all the data points in my previous post).

With a possibility of tetraploid, I think I am a bit in trouble in how to get a good assembly of this genome...
__________________
Habib

Life does have an instruction...
habib is offline   Reply With Quote
Old 02-18-2016, 11:10 AM   #49
DNA Sorcerer
Member
 
Location: Canada

Join Date: Mar 2010
Posts: 24
Default

Hi Brian,

I tried to run bbmap but got an error, and before going into debugging wanted to see if you can tell me if I have a general configuration issue in the custer (e.g. java), so I know what to tell the sysadmin. Thanks a lot in advance.

My line is: CLARKSCV1.2.2-b/bbmap/bbmap.sh ref=contigs.fasta nodisk in1=scratch/s_3_1_sequence.fastq in2=scratch/s_3_2_sequence.fastq covstats=omgen.coverage

And the error message is:
Hello from inside a Grid Engine job running on cl339
Job beginning at Thu Feb 18 16:30:54 NST 2016
Job ending at Thu Feb 18 16:30:54 NST 2016
java -Djava.library.path=/home/cslamovi/CLARKSCV1.2.2-b/bbmap/jni/ -ea -Xmx1310m -cp /home/cslamovi/CLARKSCV1.2.2-b/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=contigs.fasta nodisk in1=scratch/s_3_1_sequence.fastq in2=scratch/s_3_2_sequence.fastq covstats=omgen.coverage
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
DNA Sorcerer is offline   Reply With Quote
Old 02-18-2016, 11:25 AM   #50
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

@DNA_sorcerer: @Brian would be along later today with an official answer but first thing to check is what version of java is running on your node/cluster.

Post output of

Code:
$  java -version
If I remember this right, @Brian only validates BBMap suite against java v.1.7 and 1.8.

You also may be missing a leading "/" in you file paths (scratch/s_3_1_sequence.fastq) unless "scratch" directory is in the directory from where you are running your command.
GenoMax is offline   Reply With Quote
Old 02-18-2016, 11:40 AM   #51
DNA Sorcerer
Member
 
Location: Canada

Join Date: Mar 2010
Posts: 24
Default

Ops, I had forgotten to load java module before. I did now, bbmap run for a couple of minutes and stopped. This is the output:

Quote:
Hello from inside a Grid Engine job running on cl157
Job beginning at Thu Feb 18 17:03:05 NST 2016
Job ending at Thu Feb 18 17:03:05 NST 2016
java -Djava.library.path=/home/cslamovi/CLARKSCV1.2.2-b/bbmap/jni/ -ea -Xmx1310m -cp /home/cslamovi/CLARKSCV1.2.2-b/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=contigs.fasta nodisk in1=scratch/s_3_1_sequence.fastq in2=scratch/s_3_2_sequence.fastq covstats=omgen.coverage
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=contigs.fasta, nodisk, in1=scratch/s_3_1_sequence.fastq, in2=scratch/s_3_2_sequence.fastq, covstats=omgen.coverage]

BBMap version 35.82
Retaining first best site only for ambiguous mappings.
No output file.
Executing dna.FastaToChromArrays2 [contigs.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

Set genScaffoldInfo=true
Set genome to 1

Loaded Reference: 0.335 seconds.
Loading index for chunk 1-1, build 1
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 8.592 seconds.
Executing jgi.CoveragePileup [covhist=null, covstats=omgen.coverage, basecov=null, bincov=null, physcov=false, 32bit=false, nzo=false, twocolumn=false, secondary=false, covminscaf=0, ksb=true, binsize=1000, startcov=false, strandedcov=false, rpkm=null, normcov=null, normcovo=null, in1=scratch/s_3_1_sequence.fastq, in2=scratch/s_3_2_sequence.fastq]

Set NONZERO_ONLY to false
Set TWOCOLUMN to false
Set USE_SECONDARY_ALIGNMENTS to false
Set KEEP_SHORT_BINS to true
Set USE_COVERAGE_ARRAYS to false
Set USE_BITSETS to true
Analyzed Index: 6.904 seconds.
Changed from ASCII-33 to ASCII-64 on input quality 97 while prescanning.
Cleared Memory: 1.798 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 1 mapping thread.
__________________
Hi there
DNA Sorcerer is offline   Reply With Quote
Old 02-18-2016, 11:46 AM   #52
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Has the grid engine job completed?

Most likely not. After printing the message above BBMap starts loading "genome" index into memory before it starts doing alignments, so be patient .. as long as the job is still "running".

If the job completed then look in the grid engine error file and let us know what is there.
GenoMax is offline   Reply With Quote
Old 02-18-2016, 06:52 PM   #53
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Also, note this line:

Code:
Started 1 mapping thread.
This means it is running using only 1 thread. Depending on your input size, it could take a while! If that's intentional, then it's fine. But BBMap is fully multithreaded so normally I request all available slots on a node when scheduling, and tell BBMap to use all of them with the flag "threads=16" (for example). It will normally autodetect the number of available processors and use all of them, but on SGE/UGE (which I think you are using), if the NSLOTS environment variable is set, it will cap the max threads at that value (so for example on a 16-core machine, if you request 1 slot, such that "echo $NSLOTS" returns 1, then unless you manually specify the number of threads it will only use 1, to ensure fairness).

I recommend you ssh into the node it's running on and run "top -c", then look for a java process and ensure it is using CPU resources (should be at 100%). You can also examine the output file; it should gradually be growing.

Last edited by Brian Bushnell; 02-18-2016 at 06:55 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-19-2016, 06:42 AM   #54
DNA Sorcerer
Member
 
Location: Canada

Join Date: Mar 2010
Posts: 24
Default

Thanks Brian,
I got it run on 16 threads, but after a little while I get these sort of errors:

Quote:
Started 16 mapping threads.
Exception in thread "Thread-7" java.lang.AssertionError
at stream.Read.expectedErrors(Read.java:2099)
at stream.Read.avgQualityByProbability(Read.java:1713)
at stream.Read.avgQualityByProbability(Read.java:1706)
at stream.Read.avgQuality(Read.java:1700)
at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
at align2.BBMapThread.processReadPair(BBMapThread.java:953)
at align2.AbstractMapThread.run(AbstractMapThread.java:508)
Detecting finished threads: 0Exception in thread "Thread-13" java.lang.AssertionError
at stream.Read.expectedErrors(Read.java:2099)
at stream.Read.avgQualityByProbability(Read.java:1713)
at stream.Read.avgQualityByProbability(Read.java:1706)
at stream.Read.avgQuality(Read.java:1700)
at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
at align2.BBMapThread.processReadPair(BBMapThread.java:953)
at align2.AbstractMapThread.run(AbstractMapThread.java:508)
Exception in thread "Thread-12" java.lang.AssertionError
at stream.Read.expectedErrors(Read.java:2099)
at stream.Read.avgQualityByProbability(Read.java:1713)
at stream.Read.avgQualityByProbability(Read.java:1706)
at stream.Read.avgQuality(Read.java:1700)
at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
at align2.BBMapThread.processReadPair(BBMapThread.java:953)
at align2.AbstractMapThread.run(AbstractMapThread.java:508)
Before this run the program suggested using ignorebadquality, which I did for this run.
__________________
Hi there
DNA Sorcerer is offline   Reply With Quote
Old 02-19-2016, 07:02 AM   #55
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Do you know how many cores there are on the node that you are running this on (16 was an example, you may have more or less available on the hardware you have access to)? Are you providing the necessary request for using all available cores on a node in your SGE command line?

Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.
GenoMax is offline   Reply With Quote
Old 02-19-2016, 07:06 AM   #56
DNA Sorcerer
Member
 
Location: Canada

Join Date: Mar 2010
Posts: 24
Default

Quote:
Originally Posted by GenoMax View Post
Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.
Thanks, I'll take a closer look at the cluster's manual and try again.
__________________
Hi there
DNA Sorcerer is offline   Reply With Quote
Old 02-19-2016, 07:12 AM   #57
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

You don't have to use all of the cores/threads available on a physical server (some clusters may be setup in a way that require you to reserve exclusive access to a node to do that).

I generally find that threads=6 works plenty fast (BBMap is well written). You may run into some I/O limits unless you have a high performance file system that goes along with your HPC cluster.
GenoMax is offline   Reply With Quote
Old 02-19-2016, 06:04 PM   #58
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Your error message indicates a problem with the quality scores (admittedly, I should improve the error message from that assertion, but I've never seen it happen before!). Specifically, there are some N bases (no-calls) that have a quality of greater than 0. This could mean that you have an ASCII-64 file being processed as ASCII-33, or it could mean... well, something else, like bug in the base-calling software.

If the problem is just Ns with nonzero quality scores, you can fix the files by running them through Reformat (this command assumes the file is ASCII-33/Sanger):

reformat.sh -da ibq qin=33 in=scratch/s_3_#_sequence.fastq out=fixed_#.fq

Do you happen to know what the quality encoding is of these files? For example, what platform are they from, and how old are they?
Brian Bushnell is offline   Reply With Quote
Old 02-22-2016, 08:49 AM   #59
DNA Sorcerer
Member
 
Location: Canada

Join Date: Mar 2010
Posts: 24
Default

Thank Brian,

You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?
__________________
Hi there
DNA Sorcerer is offline   Reply With Quote
Old 02-22-2016, 09:09 AM   #60
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Quote:
Originally Posted by DNA Sorcerer View Post
Thank Brian,

You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?
Historically there has been more than one format for Q-scores for fastq files so you need to let BBMap know which score scale to use (most new data is in Sanger fastq format). More on the encoding here: https://en.wikipedia.org/wiki/FASTQ_format#Encoding

@Brian also has a program to identify the correct encoding for your files. That example is in post # 1 of this thread.

Code:
$ testformat.sh in=seq.fq.gz
GenoMax is offline   Reply With Quote
Reply

Tags
bbmap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:51 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO