SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   BBMap (aligner for DNA/RNAseq) is now open-source and available for download. (http://seqanswers.com/forums/showthread.php?t=41057)

Brian Bushnell 02-19-2014 01:33 PM

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.
 
BBMap is now available here:

https://sourceforge.net/projects/bbmap/

Links to other BBTools forum threads:
BBSplit (Binning reads based on mapping to multiple references at once)
BBDuk and Seal (Decontamination, filtering, adapter-trimming, kmer-masking, and alignment-free expression quantification)
BBNorm (Normalization, error-correction, kmer frequency histograms, and genome size estimation)
BBMerge (Paired read overlap merging and insert size calculation)
Reformat (Read reformatting, deinterleaving, subsampling, etc)
CalcUniqueness (Library uniqueness/saturation plots)
RemoveHuman (Filtering out reads from human, or some other specific organism, with zero false positives removals)
Repair (Re-pairing reads that got out of order, based on their names)
CalcTrueQuality (Recalibrating quality scores of reads)
Tadpole (Assembler, error-correction, read extension)
KmerCompressor (Kmer set generation and set operations)
Clumpify (Increases compression of gzipped/bzipped fastq files)

A thread with answers to frequently asked questions about BBTools, collated by GenoMax, is here.

BBMap/BBTools are now open source. Please try it out - it's a 3MB download, and written in pure Java, so installation is trivial - just unzip and run. Handles all sequencing platforms (Illumina, PacBio, 454, Sanger, Nanopore, etc) except Solid colorspace, which I removed to simplify the code.

A Powerpoint comparison of performance (speed, memory, sensitivity, specificity) on various genomes, compared to bwa, bowtie2, gsnap, smalt:

https://drive.google.com/file/d/0B3l...it?usp=sharing

...but in summary, BBMap is similar in speed to bwa, with much better sensitivity and specificity than any other aligner I've compared it to. It uses more memory than Burrows-Wheeler-based aligners, but in exchange, the indexing speed is many times faster.

How to use

There is documentation in the docs folder and displayed by shellscripts when run with no arguments. But for example:

bbmap.sh ref=ecoli.fa
...will build an index and write it to the present directory

bbmap.sh in=reads.fq out=mapped.sam
...will map to the indexed reference

bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk
...will build an index in memory and map paired reads to it in a single command

If your OS does not support shellscripts, replace 'bbmap.sh' like this:
java -Xmx23g -cp /path/to/current align2.BBMap in=reads.fq out=mapped.sam

...where /path/to/current is the location of the 'current' directory, and -Xmx23g specifies the amount of memory to use. This should be set to about 85% of physical memory (the symbols 'm' or 'g' specify megs or gigs), or more, depending on your virtual memory configuration. Human reference requires around 21 GB; generally, references need around 7 bytes per base pair, and a minimum of 500 MB at default settings. However, there is a reduced memory mode ('usemodulo') that only needs half as much memory. The shellscripts are just wrappers that display usage information and set the -Xmx parameter.

Please ask if you encounter any problems or need help! And there are other neat tools too, for error correction, normalization, depth-binning, reference-based binning, contaminant filtering, adapter trimming, optimal quality trimming, reformatting files, paired-read merging, deduplication of assemblies, and histogram generation for things like kmer depth and insert size.

NOTE: BBMap (and all related tools) shellscripts will try to autodetect memory, but may fail (resulting in the jvm failing to start or running out of memory), depending on the system configuration. This can be overridden by adding the -Xmx30g flag to the parameter list of the shellscript (adjusted for your computer's physical memory) and it will be passed to java.

luc 02-23-2014 04:17 PM

Dear Brian,

thanks a lot for posting your aligner package.
Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)?
Would you recommend any specific settings for the alignment of RNA seq reads?

Brian Bushnell 02-24-2014 10:09 AM

Quote:

Originally Posted by luc (Post 133511)
Dear Brian,

thanks a lot for posting your aligner package.
Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)?
Would you recommend any specific settings for the alignment of RNA seq reads?

Luc,

Unfortunately, the limit is still 6kb. I am working on increasing that but it may take a while. For PacBio reads, you should use the script mapPacBio8k.sh (which despite the name only goes up to 6k), and the PacBio reads should be in fasta format. Reads in fasta format that are longer than 6000bp will be automatically broken into 6000bp pieces and their names appended with "_1", "_2", and so forth. But the vast majority of the PacBio reads will be under 6kbp and thus won't be fragmented.

For any Illumina reads, including merged Miseq, use bbmap.sh. It handles reads up to 500bp. The only exception is if you want to map Illumina reads to PacBio reads (for error correction), in which case you should use mapPacBio.

bbmap.sh (which calls BBMap.class) has mutation penalties tuned for Illumina reads, while mapPacBio8k.sh (which calls BBMapPacBio.class) has mutation penalties tuned for PacBio reads and a longer maximum read length, and is slower. You can still use mapPacBio8k.sh for any really long reads, though, like when mapping one assembly to another.

For RNA-seq (using Illumina reads), I would recommend a command line like this:

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam maxindel=100000 xstag=firststrand intronlen=10 ambig=random

This will look for introns up to 100kbp long, and flag every gap longer than 10bp with cigar symbol 'N' (skipped) rather than 'D' (deletion). Ambiguously-mapped reads will go to a random top-scoring site, which is generally best for quantification. XS tags, needed by Cufflinks, will be produced according to the "first strand" library protocol (if you don't know what protocol is used, then set it to xstag=unstranded).

The 'maxindel' flag can be increased or decreased as needed. Many plants and fungi tend to have very short introns mostly under 400bp, while human average is much higher and has a few that are even longer than 100kbp. It should be set to above the size of the largest introns you expect, but higher values decrease accuracy, so I don't recommend setting it above 500000 or so. The default is 16000 which is fine for the plants and fungi I've worked with but if you are working with vertebrates then 200000 would probably be appropriate.

luc 02-24-2014 02:12 PM

Thanks a lot for the advice!
I will give it a try.

Brian Bushnell 02-25-2014 05:28 PM

Quote:

Originally Posted by luc (Post 133595)
Thanks a lot for the advice!
I will give it a try.

You're welcome. By the way, I added a java 6-compiled package to the Sourceforge page (and a newer version of the java 7-compiled package). These may also address some other issues; e.g., some shellscripts had Windows-style newlines (\n\r), and I changed them all to Unix-style newlines (\n). Some shellscript parsers choke on \n\r.

westerman 03-07-2014 08:48 AM

Brian: Trying out BBMap and I am having problems indexing the reference. My species is Pig which has the following Fasta entry lines in the genome.fa file:

Quote:

grep '>' ../References/GENOME_DIR/genome.fa
>10
>11
>12
>13
>14
>15
>16
>17
>18
>1
>2
>3
>4
>5
>6
>7
>8
>9
>MT
>X
>Y
However it appears that BBMap is only creating 6 chromosomes as follows:

Quote:

Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=../References/GENOME_DIR/genome.fa]

BBMap version 31.32
Set OVERWRITE to true
Cigar strings enabled.
Retaining first best site only for ambiguous mappings.
No output file.
Writing reference.
Executing dna.FastaToChromArrays [../References/GENOME_DIR/genome.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing ref/genome/1/chr1.chrom.gz
Writing ref/genome/1/chr2.chrom.gz
Writing ref/genome/1/chr3.chrom.gz
Writing ref/genome/1/chr4.chrom.gz
Writing ref/genome/1/chr5.chrom.gz
Writing ref/genome/1/chr6.chrom.gz
Set genome to 1

Loaded Reference: 0.008 seconds.
Loading index for chrom 1-6, genome 1
No index available; generating from reference genome: /group/gcore/machine/sbs/run00289/131127_M00657_0033_000000000-A6EDC/hr00671_and-yanhua/006356_-945-Flag-1-946-/ref/index/1/chr1-3_index_k13_c2_b1.block
No index available; generating from reference genome: /group/gcore/machine/sbs/run00289/131127_M00657_0033_000000000-A6EDC/hr00671_and-yanhua/006356_-945-Flag-1-946-/ref/index/1/chr4-6_index_k13_c2_b1.block
Indexing threads started for block 0-3
Indexing threads started for block 4-6
Indexing threads finished for block 4-6
Indexing threads finished for block 0-3
Generated Index: 274.860 seconds.
Finished Writing: 29.271 seconds.
No reads to process; quitting.

Total time: 362.577 seconds.
Any ideas? Feel free to send me email westerman at purdue dot edu

Brian Bushnell 03-07-2014 09:15 AM

Westerman,

Sorry for the confusion. The files are called "chrom" for legacy reasons, and they don't correspond to chromosomes anymore. I should rename them "chunks". Originally, BBMap wrote one chromosome per chrom file, which is fine for complete reference genomes. But de-novo assemblies can have thousands or millions of scaffolds, so now I bundle them into chunks.

Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes.

westerman 03-07-2014 10:19 AM

Quote:

Originally Posted by Brian Bushnell (Post 134537)
Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes.

So they are. Different mapping and slightly more -- in terms of raw matches -- than tophat. I'll have to do more work in order to see if the matches are accurate but so far it is encouraging.

Quote:

samtools idxstats BBMapped.bam | sort -n

* 0 0 174134
MT 16613 121832 3149
X 144288218 11619 559
Y 1637716 16 4
1 315321322 31627 2152
2 162569375 33762 1920
3 144787322 40457 3319
4 143465943 49722 2064
5 111506441 27714 1515
6 157765593 60316 2405
7 134764511 27906 2778
8 148491826 14523 1241
9 153670197 26155 1155
10 79102373 6924 601
11 87690581 13340 650
12 63588571 25561 1007
13 218635234 29541 1744
14 153851969 29737 2588
15 157681621 25974 2666
16 86898991 9360 334
17 69701581 23454 913
18 61220071 6550 332

Tophat accepted_hits.bam

Quote:

MT 16613 85810 0
X 144288218 11197 0
Y 1637716 12 0
1 315321322 35442 0
2 162569375 24092 0
3 144787322 46519 0
4 143465943 43747 0
5 111506441 22186 0
6 157765593 45857 0
7 134764511 18867 0
8 148491826 13938 0
9 153670197 27813 0
10 79102373 4661 0
11 87690581 12493 0
12 63588571 19511 0
13 218635234 33190 0
14 153851969 19774 0
15 157681621 11172 0
16 86898991 14862 0
17 69701581 19776 0
18 61220071 5159 0

Brian Bushnell 03-07-2014 10:37 AM

By the way, BBMap's default mode is "ambig=best", meaning that any ambiguously-mapped reads will go to the first top-scoring location (ambiguous hits will be annotated "XT:A:R" rather than "XT:A:U", so you will know which ones are ambiguous). This is fine for variant-calling, but if you are doing RNA-seq expression quantification, it's usually better to set "ambig=random" or "ambig=all". Otherwise, if there are two almost-identical genes, all the reads will map to the first one.

Also, for vertebrate RNA-seq, I suggest these settings:
xstag=us
generate XS tags, needed by cufflinks, according to "unstranded" protocol, or if you don't know the protocol. Alternatives are "ss" (second strand) and "fs" (first strand). If you DO know the protocol, please set it as ss or fs.

intronlen=10
All deletions up to this length will be annotated as 'D' in the cigar string; longer ones will be annotated as 'N' (meaning skipped). This may or may not matter to downstream tools. For DNA mapping you don't need this flag.

maxindel=200000
The longest deletion (i.e., intron) that it will look for. May still find longer ones. Around 98% of human introns are shorter than 100kbp. If you do not set this, it will default to 16kbp, which is too short for vertebrate RNA-seq (though lots of plants and fungi have tiny introns of around 200bp).

I'll add this recommendation to the readme for the next release.

westerman 03-07-2014 11:35 AM

Thanks for the recommendations, especially the ambig= one. The help for bbmap.sh is ... confusing, at least at first glance.

One thing I think is working but giving me an improper warning message is the -Xmx parameter. From starting like so:

Quote:

bbmap.sh -Xmx30G path=../BBMap_ref in1=$R1 in2=$R2 ...
I get:

Quote:

This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool.
java -ea -Xmx30G ....
In other words a warning but Java itself is starting up properly with 30G.

Brian Bushnell 03-07-2014 11:45 AM

Hmm, that's interesting, thanks for noting it. It's because I first try to detect memory, then parse the -Xmx override. I will reverse the order :)

I'll try to improve the help information, so any concrete suggestions/complaints would be very helpful.

westerman 03-07-2014 12:44 PM

As for help, the major problem is the formatting on screens with 80 columns. Wrapping is hard to read plus there is too much information for a quick reference. I suggest trying to limit the help messages to 80 columns and then direct the user to a longer formatted README or extended help. I know that the current help says to look at the README.

Also parsing '-h' on the command line would be nice. While displaying help when no commands are given is good it is also useful to be able to use '-h'.

As an example of the first paragraph, at the moment the help includes:

Code:

maxsites=5          Maximum number of total alignments to print per read.  Only relevant when secondary=t.
quickmatch=f        Generate cigar strings more quickly.  Must be true to generate secondary site cigar strings.
keepnames=f        Keep original names of paired reads, rather than ensuring both reads have the same name.

All useful information to be sure but these could shortened and the extended information put into a formatted README. E.g.

Code:

maxsites=5          Max. number of total alignments to print
quickmatch=f        Generate cigar strings more quickly.
keepnames=f        Keep original names of paired reads


luc 03-07-2014 07:07 PM

Hi,

I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version).

"This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool."

"Invalid maximum heap size: -Xmx30G
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine."

Brian Bushnell 03-07-2014 09:44 PM

Quote:

Originally Posted by luc (Post 134587)
Hi,

I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version).

"This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool."

"Invalid maximum heap size: -Xmx30G
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine."

luc,

That means that you have a 32-bit version of Java installed. You need a 64-bit version, which you can get here.

Westerman,

Thanks for your feedback; I'll simplify the help menus for the next version.

-Brian

luc 03-10-2014 11:53 AM

Hi Brian,

thanks a lot for the advice. I did now change the java environments several times (via aliases to the the latest JREs) and tried to use both the your java6 and java7 versions and I always end up with the same messages as before. Our OS is old (CentOS 5.9) , as is our default java (it is 64-bit nevertheless (build 1.6.0_16-b01)). Thus, it is very possible that our configuration is outdated.
I will try to somehow set java "ulimits".


Quote:

Originally Posted by Brian Bushnell (Post 134594)
luc,

That means that you have a 32-bit version of Java installed. You need a 64-bit version, which you can get here.

Westerman,

Thanks for your feedback; I'll simplify the help menus for the next version.

-Brian


Brian Bushnell 03-10-2014 12:08 PM

luc,

That's unfortunate. If you download and unzip a 64-bit version of java 7, then path directly to the java executable rather than relying on the default java of your environment, it should work. Otherwise, the limit is something like 1600MB (specified by the flag -Xmx1600m) if I recall correctly, which is plenty for small references like microbes and fungi.

-Brian

Thias 03-25-2014 04:08 AM

Has anybody experience with running bbmap on clusters?
 
Hello everybody,

I would like to know, if anyone has (good) experience with running bbmap alignments on clusters, which require to submit jobs via a scheduler?

Because I can successfully start/finish an alignment on the university's cluster when started directly from the terminal (which is prohibited, but who cares on Sundays at 3am?), but the exactly same command written to Portable Batch System (PBS) job script and submitted via qsub to Torque/Maui scheduler fails:

Code:

Exception in thread "Thread-5" java.lang.NullPointerException
    at align2.Block.<init>(Block.java:33)
    at align2.Block.read(Block.java:156)
    at align2.IndexMaker4$BlockMaker.makeArrays(IndexMaker4.java:139)
    at align2.IndexMaker4$BlockMaker.run(IndexMaker4.java:127)
Generated Index:    3.816 seconds.
Exception in thread "main" java.lang.NullPointerException
    at align2.BBIndex.analyzeIndex(BBIndex.java:116)
    at align2.BBMap.loadIndex(BBMap.java:389)
    at align2.BBMap.main(BBMap.java:30)

I guess bbmap recognizes that there are more cores available on the system, than I am allowing for the job via the job script and tries to use them nevertheless?

Since I would like to extent my alignment times beyond weekend nights :rolleyes:, I would be happy for suggestions.

GenoMax 03-25-2014 04:58 AM

Quote:

Originally Posted by Thias (Post 136022)
I guess bbmap recognizes that there are more cores available on the system, than I am allowing for the job via the job script and tries to use them nevertheless?

Are you explicitly indicating number of threads to use for the bbmap command by using the "threads=x" option (BTW: this option is set to "auto" by default so unless you specify a number bbmap will try to use all cores)?

Thias 03-25-2014 07:49 AM

Yes, I set the threads=8 option and also a memory limit of 32g for bbmap.

However I realised that I initially wrote the job script in a way that the scheduler could assign random distributed cores to the job. When I limited that only to cores of one particular node, then everything runs smoothly. So no problem with bbmap, only with me using the batch system correctly :confused:.

HGV 04-15-2014 09:30 AM

problems with bbmap switches
 
Hi Brian,
Very decent mapper that you wrote, and supergreat that it is finally
available .

I was playing around with bbmap, sooo many cool features and an impressive
speed! But I could not figure out a couple of things:
1) How to point to directories with path=
I would like to be able to use a set of indices that I created previously
and that are stored in a specific 'databases' path mounted on all nodes of
our cluster. But I did not get this to work setting path= during the
mapping process because it changes where bbmap searches for the reads. My
second trial was to call bbmap in this folder of the references and set
path to the folder of the reads, but then I always got an error that the
read file was not found. The only thing that worked for me is calling bbmap
from a folder which also includes the /ref folder, but this means copying
both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
bbmap similarly
2) using outm
I am mapping shotgun metagenomic illumina paired end reads to references
that are gene databases. I was expecting to get different output for out=
and outm= but the files produced are identical. I would expect to see some
read pairs where only one of the reads maps to the gene database and the
other not, and as far as I understood out= gives me the mapping pairs and
outm= gives me the mapping pairs and the single mapping reads with their
pairs.
3) how can I get sorted unmapped pairs written to a file?
While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.

Keep up the good work
Harald

Brian Bushnell 04-15-2014 10:46 AM

Quote:

Originally Posted by HGV (Post 137860)
Hi Brian,
Very decent mapper that you wrote, and supergreat that it is finally
available.

I was playing around with bbmap, sooo many cool features and an impressive
speed!

Thanks! Glad you like it.

Quote:

But I could not figure out a couple of things:
1) How to point to directories with path=
I would like to be able to use a set of indices that I created previously
and that are stored in a specific 'databases' path mounted on all nodes of
our cluster. But I did not get this to work setting path= during the
mapping process because it changes where bbmap searches for the reads. My
second trial was to call bbmap in this folder of the references and set
path to the folder of the reads, but then I always got an error that the
read file was not found. The only thing that worked for me is calling bbmap
from a folder which also includes the /ref folder, but this means copying
both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
bbmap similarly
BBMap can use absolute paths for all of those, but the syntax is a bit different. Let's say you have 2 builds of the human genome, HG18 and HG19, at /fastas/hg18.fa and /fastas/hg19.fa, reads at /data/reads.fq, and you want to write output to /output/. There are 3 ways you could handle this.

Version 1: Don't write an index to disk. This usually makes the startup slower, but is more I/O efficient. It's the simplest and best if you are only mapping to a reference once (like when evaluating multiple assemblies).
bbmap.sh in=/data/reads.fq ref=/fastas/hg18.fa out=/output/mapped18.sam nodisk
bbmap.sh in=/data/reads.fq ref=/fastas/hg19.fa out=/output/mapped19.sam nodisk


Version 2: Write each index to its own directory.
bbmap.sh ref=/fastas/hg18.fa path=/human/hg18/
bbmap.sh ref=/fastas/hg19.fa path=/human/hg19/

bbmap.sh in=/data/reads.fq path=/human/hg18/ out=/output/mapped18.sam
bbmap.sh in=/data/reads.fq path=/human/hg19/ out=/output/mapped19.sam


Version 3: Write indices to the same directory, but specify a build number.
bbmap.sh ref=/fastas/hg18.fa path=/human/ build=18
bbmap.sh ref=/fastas/hg19.fa path=/human/ build=19

bbmap.sh in=/data/reads.fq path=/human/ build=18 out=/output/mapped18.sam
bbmap.sh in=/data/reads.fq path=/human/ build=19 out=/output/mapped19.sam


If you don't specify a build number, the default is always 1.

Quote:

2) using outm
I am mapping shotgun metagenomic illumina paired end reads to references
that are gene databases. I was expecting to get different output for out=
and outm= but the files produced are identical. I would expect to see some
read pairs where only one of the reads maps to the gene database and the
other not, and as far as I understood out= gives me the mapping pairs and
outm= gives me the mapping pairs and the single mapping reads with their
pairs.
For paired reads, outm captures all reads that are mapped OR have a mate that is mapped. You can change this behavior with the flag "po" which stands for "pairedonly". The default is "po=f". If you set "po=t" then reads will be unmapped unless they are paired. So, if only one read maps, or if they both map but are not in a valid pairing configuration, they will go to "outu" instead of "outm". "out" will get both of them no matter what.
Quote:

3) how can I get sorted unmapped pairs written to a file?
While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
Are you sure? It works for me... though it is not mentioned in the documentation. I'll clarify that.
bbmap.sh in=reads.fq outu1=r1.fq outu2=r2.fq
produces 2 files, r1.fq and r2.fq
You can only specify output streams 1 and 2 for fasta or fastq output, though, not sam. Also - by default, whenever you output paired reads to a single file, they will come out interleaved (read 1, read 2, read 1, read 2, etc). You can split them back into 2 files with reformat:
reformat.sh in=reads.fq out1=read1.fq out2=read2.fq
Quote:

4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.
Actually, I only track the average insert size, not the median, because that's the easiest, though I admit the median and SD would be more useful. I'll add that, since I'm tracking the information anyway (if you specify inserthistogram).

As for what you noticed with the pairlen flag, I had not noticed that - sounds like a possible bug. I will look into it.

Quote:

Keep up the good work
Harald
I will. Thanks for the feedback!

Brian Bushnell 04-16-2014 05:02 PM

I have uploaded a new version of BBTools, v32.06. Some improvements:


Shellscripts:

Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.


All programs:


Added append flag; allows you to append to existing output files rather than overwriting.


BBMap (bbmap.sh):

Now reports insert size median and standard deviation (as long as you have the "ihist" flag set).

Added 'qahist' flag, which outputs the observed quality of bases for each claimed quality value. Works for both indel and substitution error models.

Added 'bhist' flag, which outputs the frequency of ACGTN by base position. Reformat and BBDuk.sh also support 'bhist' and 'qhist' flags.

'pairlen' flag now works correctly (for capping the max allowed insert size).

IUPAC reference codes are now converted to N prior to mapping.

Statistics are now reported both by number of reads and number of bases.

Added BBWrap (bbwrap.sh), a wrapper for BBMap that allows you to map multiple sets of reads while only loading the index once (important for huge indexes and small read sets). The mapped reads can be written all to the same output file, or to one output file per input file.

Slightly increased accuracy.

PacBio reads can now be mapped in fastq format.


BBMerge (bbmerge.sh):


Now supports dual output files for unmerged reads, 'outu1' and 'outu2'.

Improved accuracy.


BBDuk (bbduk.sh):

Much faster when used without any reference (e.g. for quality trimming).


RandomReads:


Much better simulated PacBio reads.


Thanks to everyone at seqanswers who has given me feedback!

GenoMax 04-16-2014 05:33 PM

Quote:

Originally Posted by Brian Bushnell (Post 138027)

Shellscripts:

Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.

Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

Thanks!

luc 04-16-2014 07:08 PM

Hi Brian,

I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.

Quote:

Originally Posted by Brian Bushnell (Post 138027)
I have uploaded a new version of

RandomReads:


Much better simulated PacBio reads.
!


Brian Bushnell 04-16-2014 07:14 PM

Quote:

Originally Posted by GenoMax (Post 138032)
Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

Thanks!

For clusters, I recommend setting the -Xmx flag to indicate the amount of memory you requested and 'threads' (or 't') to the number of slots you have requested. We (JGI) run on clusters too, but most of the nodes are exclusive-only, so all hardware resources are available to the user.

The new memory-detection system should work on Linux/UGE clusters (it works here). The shellscript calculates the system's free virtual memory, free physical memory, and 'ulimit -v'. Then it will invoke Java with the minimum of those 3 values, and thus, hopefully, will never fail initialization, and never use more RAM than is allowed. Although here, at least, it's impossible to use more RAM than allowed because our cluster will detect that and kill the process.

Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.

Threads is a little more difficult. For now, unless you have exclusive access to a node, I recommend that you specify the number of threads allowed (e.g. t=4 if you reserved 4 slots). There is an environment variable "NSLOTS" that apparently gets set by UGE, which I will parse in my next release, and default to using the minimum of the system's reported logical cores and NSLOTS. But it appears to not always be correct, and furthermore, hyperthreading makes the matter more confusing (BBMap does not benefit much from hyperthreading, but many of the other BBTools do, like BBNorm).

Also - there are basically 4 types of programs in BBTools, resource-wise:

1) Super-heavyweight:
Requests all available threads and memory by default, but can be capped with the '-Xmx' and 't' flags. That's because all are both multithreaded and require an amount of memory dependent on the input, which the shellscript can't calculate. This includes bbmap.sh, bbsplit.sh, bbwrap.sh, mapPacBio.sh, mapPacBio8k.sh, bbnorm.sh, ecc.sh, khist.sh, bbduk.sh, countkmersexact.sh, and dedupe.sh.

2) Heavyweight:
Requests all available RAM but only one primary thread. Pipeline-multithreading (input thread, processing thread, output thread) cannot be disabled, so the 't' flag has no effect unless you are doing multithreaded compression using pigz. Includes pileup.sh, randomreads.sh, bbcountunique.sh, and bbmask.sh.

3) Midweight:
Requests all available threads by default, but can be capped with the 't' flag; needs little memory, so the -Xmx flag is hard-coded at an appropriate value that should work for all inputs (ranging from 100m to 400m) and not affected by available memory, though you can still override it. This includes bbmerge.sh.

4) Lightweight:
Needs minimal memory and only one primary thread, so again the -Xmx flag is hard-coded to a low value (at most 200m) rather than dependent on autodetection. Again, 't' flag has no effect unless you are doing multithreaded compression using pigz. This includes reformat.sh, stats.sh, statswrapper.sh, readlength.sh, bbest.sh, bbsplitpairs.sh, gradesam.sh, translate6frames.sh, and samtoroc.sh.

Note that if you have pigz installed, you can accelerate all BBTools performance on gzipped input or output using the "unpigz" and "pigz" flags (which can be set to 't' or 'f'). pigz allows multithreaded .gz compression and decompression, and is both faster and more efficient (even with only a single thread) than Java or gzip. So with pigz installed, even a mostly singlethreaded program like reformat.sh can (and by default, will) use all all available threads if you write gzipped output:

reformat.sh in=read1.fq.gz in2=read2.fq.gz out=interleaved.fasta.gz zl=8

..will compress the output to gzip compression level 8. If pigz is installed, it will use pigz for both compression and decompression instead of Java, resulting in decompression using around 1.5 cores per input file and compression using all allowed cores. You can prevent this with the 't=1' flag (to cap compression at 1 thread) or 'unpigz=f' and 'pigz=f' (to disable decompression/compression in pigz processes).

Due to some bugs in UGE (which can kill processes that fork in certain circumstances), pigz is by default set to false in high-memory processes and true in low-memory processes.

If you have not tried pigz, I highly recommend it! It's great and will become increasingly important as core counts increase. It's a drop-in replacement for gzip - same command line, 100% inter-compatible, but multithreaded and more efficient per thread. The only downside is increased memory usage, but it's not bad - around 12 MB per thread. Compression (for genomic data) is within ~0.1% of gzip at the same compression level. As I mentioned, decompression uses at most 1.5 cores in my tests, while compression can use all cores.

Brian Bushnell 04-16-2014 07:45 PM

Quote:

Originally Posted by luc (Post 138039)
Hi Brian,

I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.

Luc,

Sorry about that; randomreads is currently not documented. I will try to remedy that tomorrow, and let you know when it's done.

luc 04-16-2014 10:15 PM

Thanks a lot Brian,

BBMap does have a lot of different tools.

GenoMax 04-17-2014 07:51 AM

Quote:

Originally Posted by Brian Bushnell (Post 138040)
Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.

With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.

BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).

Brian Bushnell 04-17-2014 09:42 AM

Quote:

Originally Posted by GenoMax (Post 138124)
With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.

Hmm, thanks for testing. Sounds like LSF is setting the ulimit too low, probably for 1 slot no matter how many slots you reserve, I'm guessing.

Quote:

BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).
Thanks for noticing that; I've uploaded a new version 32.07 which fixes it. They work fine for me but some versions of bash can't process DOS encoding.

By the way - I forgot to mention it, but I added the phiX reference and truseq adapters to the /resources/ directory, for use with BBDuk.

Brian Bushnell 04-24-2014 10:17 AM

Quote:

Originally Posted by luc (Post 138039)
Hi Brian,

I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.

Luc,

I released a new version (32.14) which now has RandomReads documented (run randomreads.sh with no arguments).

By default it will use an Illumina error model, which assigns qualities in a pattern that looks roughly like the average quality histogram of Illumina reads (peaking at about position 20, and sloping down toward the head and tail). Then some bases will by randomly changed based on the quality score. To generate Illumina-like reads from e.coli:
randomreads.sh ref=ecoli.fa reads=100000 length=150 maxq=40 midq=30 minq=10 out=synth.fq
It can also generate paired reads if desired.

To generate PacBio-like reads:
randomreads.sh ref=ecoli.fa reads=100000 minlength=200 maxlength=4000 pacbio pbminrate=0.13 pbmaxrate=0.17 out=synth.fq

That will add substitutions and single-base-pair indels with per-base probability ranging from 0.13 to 0.17.

There are also a lot of other options for adding specific numbers or lengths of insertions, deletions, substitutions, and Ns, in the help menu.

luc 04-24-2014 03:58 PM

Thanks a lot for the randomreads instructions Brian!

Thias 04-26-2014 09:22 AM

Quote:

Originally Posted by Brian Bushnell (Post 138806)
bbmap.sh in1=read1.fq.gz in2=read2.fq.gz ref=zv9.fa out=mapped.sam maxindel=200000

@ Brian: Is maxindel the only bbmap parameter you recommend to change for mapping reads for subsequent Cufflinks analysis?

Because I interpetreted the Cufflinks Doc in a way, that it will use the xmtag to lower the read weight by the amount of sites it maps to and therefore decided to set ambig=all.

I therefore chose to set for my RNAseq:

Code:

maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=T
Was I misled?

Brian Bushnell 04-26-2014 10:11 AM

Thias,

For analysis of vertebrate RNA-seq data with Cufflinks, I do recommend all of those parameters:
maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=t

Although "ambig=all" or "ambig=random" would both be OK. The "intronlen", "xstag" and "xmtag" are things I added specifically for Cufflinks compatibility (or the tuxedo package in general) so they are not needed if you do downstream analysis with other tools. The only ones that affect the actual mapping are "ambig" and "maxindel". xstag and xmtag just add optional tags, and intronlen changes the "D" symbol in cigar strings to "N" for deletions longer than the specified length. There's no reason why any of them should be necessary, but for whatever reason, Cufflinks needs them.

Note that if you your data is strand-specific, and you know which protocol it was, then it's better to set "xstag=firststrand" or "xstag=secondstrand" than unstranded.

Thias 04-26-2014 02:41 PM

Thanks Brian for your response almost at warp speed.

Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
  • ambig=all:
    All best alignments are reported. All but one are flagged with the secondary alignment bit (0x100) and the xmtag (an integer ?) additionally reports the number of best alignments for the tuxedo package.
  • ambig=random:
    From the best alignments, one is randomly chosen. This alignment is reported as if there would not have been any secondary alignments. Only the xmtag > 1 indicates, that there would have been other best alignments as well, but the coordinates of those alignments are not present in the output.

I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon.

Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student :rolleyes:.

Brian Bushnell 04-26-2014 06:36 PM

Quote:

Originally Posted by Thias (Post 138856)
Thanks Brian for your response almost at warp speed.

Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
  • ambig=all:
    All best alignments are reported. All but one are flagged with the secondary alignment bit (0x100) and the xmtag (an integer ?) additionally reports the number of best alignments for the tuxedo package.
  • ambig=random:
    From the best alignments, one is randomly chosen. This alignment is reported as if there would not have been any secondary alignments. Only the xmtag > 1 indicates, that there would have been other best alignments as well, but the coordinates of those alignments are not present in the output.

I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon.

Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student :rolleyes:.

That sounds correct. ambig=all reports a randomly selected primary site and all other equal-scoring sites as secondary. ambig=random only reports a primary site, selected at random from top sites. If you use "random" then possibly (probably?) you shouldn't include the xmtag, as that might confuse cufflinks. Also, ambiguously-mapped reads get the "XT:A:R" tag. Note that any tag that starts with "X" is not part of the official sam specification.

Thias 04-28-2014 01:30 AM

Quote:

Originally Posted by Brian Bushnell (Post 138861)
If you use "random" then possibly (probably?) you shouldn't include the xmtag, as that might confuse cufflinks.

This in in accordance with my gut feeling, since I interpet the Cufflinks docs in the way, that the algorithm adjusts the read weight based on the xmtag:

Quote:

By default, Cufflinks will uniformly divide each multi-mapped read to all of the positions it maps to. In other words, a read mapping to 10 positions will count as 10% of a read at each position.
However Cufflinks in a later stage also needs the secondary alignment positions for the fine grained abundance estimation:

Quote:

Cufflinks will then re-estimate the abundances dividing each multi-mapped read probabalistically based on the initial abundance estimation of the genes it maps to
But thanks Brian for working this out!

WhatsOEver 05-07-2014 12:57 AM

Dear Brian,

I tested bbmap successfully on a small dataset and so far I'm very happy with the results :)
However, I'm running into some memory problems using my real dataset:


My RAM looks currently as follows:
Quote:

total used free shared buffers cached
Mem: 66066316 42861240 23205076 0 3700 42478768
When I now run bbmap with different -Xmx setting or also without the parameter, it always quits with:

Quote:

Max Memory = 28633 MB (or other, depending on the value I used)
Available Memory = 10830 MB (ranging from 8Gb - 12Gb)
Reducing threads from 16 to 15 due to low system memory.
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
To me it looks like a bug within the memory detection or am I missing something?

Brian Bushnell 05-07-2014 10:11 AM

Quote:

Originally Posted by WhatsOEver (Post 139707)
Dear Brian,

I tested bbmap successfully on a small dataset and so far I'm very happy with the results :)
However, I'm running into some memory problems using my real dataset:

...

To me it looks like a bug within the memory detection or am I missing something?

WhatsOEver,

Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know:
1) The amount of physical memory available (it looks like 64GB, is that correct?)
2) The size of the reference you are using.
3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful.

You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread.

Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details.

Thanks,
Brian

WhatsOEver 05-08-2014 08:29 AM

Quote:

Originally Posted by Brian Bushnell (Post 139755)
WhatsOEver,

Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know:
1) The amount of physical memory available (it looks like 64GB, is that correct?)
2) The size of the reference you are using.
3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful.

You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread.

Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details.

Thanks,
Brian

Hi Brian,

it was indeed increasing Xmx to 58g which did the trick. The threads (t) parameter, however, didn't help at all. So, when I run with -Xmx58g everything works fine (independent of t=8 or t=16 or leaving t=auto). When I run with -Xmx30g and t=auto the prog quits with the following error:

concerning your questions:
1) The amount of physical memory available (it looks like 64GB, is that correct?)
yes
2) The size of the reference you are using.
its the human genome assembly 19 from UCSC (~3.1Gb)
3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
mapPacBio8k.sh
4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
see quotes below

Quote:

java -ea -Xmx30g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 in=./121128_1.fastq out=/projects/testData/bbmap_121128_1_mapped.sam xstag=unstranded xmtag=t ambig=all maxindel=200000 qin=33 -Xmx30g
Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, in=./121128_1.fastq, out=/projects/testData/bbmap_121128_1_mapped.sam, xstag=unstranded, xmtag=t, ambig=all, maxindel=200000, qin=33, maxindel=150000, -Xmx30g]

BBMap version 32.14
Set overwrite to true
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
Set fastq input ASCII offset to 33
Retaining all best sites for ambiguous mappings.
Set genome to 1

Loaded Reference: 6.143 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 9.693 seconds.
Analyzed Index: 24.885 seconds.
Started output stream: 0.038 seconds.
Cleared Memory: 0.530 seconds.

Max Memory = 28633 MB
Available Memory = 10826 MB
Reducing threads from 16 to 15 due to low system memory.
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at align2.MultiStateAligner9PacBio.<init>(MultiStateAligner9PacBio.java:73)
at align2.MSA.makeMSA(MSA.java:43)
at align2.AbstractMapThread.<init>(AbstractMapThread.java:130)
at align2.BBMapThreadPacBio.<init>(BBMapThreadPacBio.java:100)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:378)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6

**************************************************************************

Warning! 15 mapping threads did not terminate normally.
Please check the error log; the output may be corrupt or incomplete.

**************************************************************************


Exception in thread "main" java.lang.RuntimeException: Aborting due to prior error.
at align2.AbstractMapper.abort(AbstractMapper.java:69)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:381)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
When I run it with -Xmx30g and t=8 I get the same heap space error, although its not complaining that there is not enough memory available

Quote:

java -ea -Xmx30g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 t=8 in=./121128_1.fastq out=/projects/testData/bbmap_121128_1_mapped.sam xstag=unstranded xmtag=t ambig=all maxindel=200000 qin=33 -Xmx30g
Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, t=8, in=./121128_1.fastq, out=/projects/testData/bbmap_121128_1_mapped.sam, xstag=unstranded, xmtag=t, ambig=all, maxindel=200000, qin=33, maxindel=150000, -Xmx30g]

BBMap version 32.14
Set overwrite to true
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
Set threads to 8
Set fastq input ASCII offset to 33
Retaining all best sites for ambiguous mappings.
Set genome to 1

Loaded Reference: 5.773 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 9.148 seconds.
Analyzed Index: 27.428 seconds.
Started output stream: 0.015 seconds.
Cleared Memory: 0.511 seconds.
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at align2.MultiStateAligner9PacBio.<init>(MultiStateAligner9PacBio.java:73)
at align2.MSA.makeMSA(MSA.java:43)
at align2.AbstractMapThread.<init>(AbstractMapThread.java:130)
at align2.BBMapThreadPacBio.<init>(BBMapThreadPacBio.java:100)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:378)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
Detecting finished threads: 0, 1, 2, 3, 4, 5

**************************************************************************

Warning! 8 mapping threads did not terminate normally.
Please check the error log; the output may be corrupt or incomplete.

**************************************************************************


Exception in thread "main" java.lang.RuntimeException: Aborting due to prior error.
at align2.AbstractMapper.abort(AbstractMapper.java:69)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:381)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)

Brian Bushnell 05-08-2014 09:44 AM

With bbmap.sh, HG19 needs about 23 GB. I had never tested HG19 with mapPacBio8k.sh but it looks like I need to refine the memory prediction. Thanks for the report!

WhatsOEver 05-23-2014 04:48 AM

Dear Brian,

I have some questions about the optional fields (TAG:TYPE:VALUE stuff) that can be added in the SAM file by bbmap. Take the following multiple-mapped read as example

Quote:

readNameHere 16 chr11 106018977 55 6M3I17M * 0 0 GCTTTGAAAAATTTTCTGCATGCCCA /4122:...00....=<<<=<<4449 XT:A:R NM:i:5 AM:i:55 XM:i:2 NH:i:20
readNameHere 272 chrX 119093161 50 18M5I3M * 0 0 * * NM:i:6 AM:i:50 XM:i:2 NH:i:20
readNameHere 272 chrX 45789727 49 1M3I22M * 0 0 * * NM:i:6 AM:i:49 XM:i:2 NH:i:20
readNameHere 272 chr20 23542459 49 18M5I3M * 0 0 * * NM:i:6 AM:i:49 XM:i:2 NH:i:20
readNameHere 256 chr3 151169846 47 1M3I3M3I16M * 0 0 * * NM:i:6 AM:i:47 XM:i:2 NH:i:20
readNameHere 256 chr4 27414930 47 1M1I5M12N17M1I1M * 0 0 * * NM:i:2 AM:i:47 XM:i:2 XS:A:+ NH:i:20
readNameHere 272 chrX 89494814 46 26M * 0 0 * * NM:i:8 AM:i:46 XM:i:2 NH:i:20
readNameHere 272 chrY 3677157 46 26M * 0 0 * * NM:i:8 AM:i:46 XM:i:2 NH:i:20
readNameHere 272 chr12 126969193 46 4M2D22M * 0 0 * * NM:i:6 AM:i:46 XM:i:2 NH:i:20
readNameHere 256 chr21 31585993 45 5M1I20M * 0 0 * * NM:i:5 AM:i:45 XM:i:2 NH:i:20
readNameHere 256 chr5 80894028 45 7M1I18M * 0 0 * * NM:i:6 AM:i:45 XM:i:2 NH:i:20
readNameHere 272 chr6 85600169 45 26M * 0 0 * * NM:i:7 AM:i:45 XM:i:2 NH:i:20
readNameHere 272 chr9 81596079 44 18M1I7M * 0 0 * * NM:i:6 AM:i:44 XM:i:2 NH:i:20
readNameHere 256 chr3 170762912 43 1M3I22M * 0 0 * * NM:i:7 AM:i:43 XM:i:2 NH:i:20
readNameHere 256 chr13 20507692 42 26M * 0 0 * * NM:i:6 AM:i:42 XM:i:2 NH:i:20
readNameHere 256 chr12 28169470 42 5M2I19M * 0 0 * * NM:i:6 AM:i:42 XM:i:2 NH:i:20
readNameHere 256 chr7 42988607 41 26M * 0 0 * * NM:i:7 AM:i:41 XM:i:2 NH:i:20
readNameHere 256 chr3 103868859 41 26M * 0 0 * * NM:i:8 AM:i:41 XM:i:2 NH:i:20
readNameHere 272 chrX 45152661 40 26M * 0 0 * * NM:i:7 AM:i:40 XM:i:2 NH:i:20
readNameHere 272 chr18 60932877 39 26M * 0 0 * * NM:i:8 AM:i:39 XM:i:2 NH:i:20
Concerning XM: From the readme it says "Indicates number of best alignments". But what are these exactly? I expected all 26M from my example to be best alignments?! Or does it mean that there are mismatches in some of these alignments which I don't see as I didn't change the samversion parameter to 1.4 (btw: I have samtools 0.1.19 installed. This is able to handle '=' and 'X', isn't it?)

Concerning XS: I run bbmap using -xstag=unstranded as I read it somewhere in this thread and I want to use cufflinks afterwards. I thought that the '+' and '-' as values refer to the strand in this field and I also have '+' and '-' values set in different reads. If this is true, where is the difference to -xstag=true?

Concerning YS: This stores the end position of a read? So its pos + the alignment length of the read?

Concerning YI: The identity is simple the number of (mismatches + indels) / read-length?


Thanks a lot in advance for your help and this amazing amount of possibilities :D!

WhatsOEver 05-23-2014 08:20 AM

Dear Brian,

I have yet another issue ;)

Looking through the readme I saw that the standard state for stoptag and idtag is true. However, looking at my sam file I couldn't find any YS or YI fields. I therefore tried to set the parameter explicitly in the start command, but this resulted in an error:

Quote:

/projects/bbmap$ ./bbmap ../bbAlign/DNAseq_140306.fastq
java -ea -Xmx56g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 t=16 in=../bbAlign/DNAseq_140306.fastq out=/projects/bbAlign/bbmap_DNAseq_140306_mapped.sam xstag=unstranded xmtag=t stoptag=t idtag=t ambig=all maxindel=10 qin=33 samversion=1.4 qhist=/projects/bbAlign/bb_qhist.txt mhist=/projects/bbAlign/bb_mhist.txt qahist=/projects/bbAlign/qahist.txt ihist=/projects/bbAlign/ihist.txt bhist=/projects/bbAlign/bhist.txt scafstats=/projects/bbAlign/scafStats refStats=/projects/bbAlign/refStats.txt -Xmx56g
Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, t=16, in=../bbAlign/DNAseq_140306.fastq, out=/projects/bbAlign/bbmap_DNAseq_140306_mapped.sam, xstag=unstranded, xmtag=t, stoptag=t, idtag=t, ambig=all, maxindel=10, qin=33, samversion=1.4, qhist=/projects/bbAlign/bb_qhist.txt, mhist=/projects/bbAlign/bb_mhist.txt, qahist=/projects/bbAlign/qahist.txt, ihist=/projects/bbAlign/ihist.txt, bhist=/projects/bbAlign/bhist.txt, scafstats=/projects/bbAlign/scafStats, refStats=/projects/bbAlign/refStats.txt, -Xmx56g]

BBMap version 32.14
Set overwrite to true
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
Set threads to 16
Set fastq input ASCII offset to 33
Set quality histogram output to /projects/bbAlign/bb_qhist.txt
Set match histogram output to /projects/bbAlign/bb_mhist.txt
Set quality accuracy histogram output to /projects/bbAlign/qahist.txt
Set insert size histogram output to /projects/bbAlign/ihist.txt
Set base content histogram output to /projects/bbAlign/bhist.txt
Scaffold statistics will be written to /projects/bbAlign/scafStats
Reference set statistics will be written to /projects/bbAlign/refStats.txt
Retaining all best sites for ambiguous mappings.
Set genome to 1

Loaded Reference: 6.389 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 8.479 seconds.
Analyzed Index: 23.584 seconds.
Started output stream: 0.014 seconds.
Creating scaffold statistics table: 0.001 seconds.
Cleared Memory: 0.506 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 16 mapping threads.
Exception in thread "Thread-11" java.lang.NullPointerException
at stream.SamLine.<init>(SamLine.java:1130)
at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:154)
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:19)
Exception in thread "Thread-24" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-21" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-19" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-12" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Detecting finished threads: 0Exception in thread "Thread-25" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-18" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-23" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-27" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-15" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-17" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-20" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-13" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
, 1Exception in thread "Thread-26" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-22" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-14" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
, 2, 3Exception in thread "Thread-16" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

**************************************************************************

Warning! 16 mapping threads did not terminate normally.
Please check the error log; the output may be corrupt or incomplete.

**************************************************************************




------------------ Results ------------------

Genome: 1
Key Length: 14
Max Indel: 10
Minimum Score Ratio: 0.4
Mapping Mode: normal
Reads Used: 850 (150375 bases)

Mapping: 16.779 seconds.
Reads/sec: 50.66
kBases/sec: 8.96


Read 1 data: pct reads num reads pct bases num bases

mapped: 71.8824% 611 84.5406% 127128
unambiguous: 63.6471% 541 75.8251% 114022
ambiguous: 8.2353% 70 8.7155% 13106
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0000% 0 0.0000% 0
semiperfect site: 0.0000% 0 0.0000% 0

Match Rate: NA NA 24.8088% 109654
Error Rate: 73.7923% 611 75.1826% 332304
Sub Rate: 72.9469% 604 1.8880% 8345
Del Rate: 59.6618% 494 71.2368% 314864
Ins Rate: 71.9807% 596 2.0577% 9095
N Rate: 4.1063% 34 0.0086% 38
Splice Rate: 14.8551% 123 (splices at least 10 bp)
Exception in thread "main" java.lang.RuntimeException: BBMap terminated in an error state; the output may be corrupt.
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:407)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
When I run it without both parameters, everything works fine. My dataset includes 1.6M single end 454 reads if this is important.

Brian Bushnell 05-23-2014 11:14 AM

WhatSoEver,

For the first problem, try setting the flag "sam=1.4". By default, cigar strings are presented in Sam 1.3 format, which uses the 'M' symbol. The does not mean match - it means match OR mismatch! Sam format 1.4 fixed this bad design decision by using '=' for match and 'X' for mismatch, but by default I output sam format 1.3 because many downstream tools (such as old versions of Samtools) cannot process format 1.4. The lastest Samtools can, though. If you set "sam=1.4" then it will be more obvious how many substitutions the reads have. The NM tag tells you the edit distance, but BBMap does not determine score directly from edit distance; it uses affine transforms. So 4 consecutive substitutions will be penalized less than 4 non-adjacent substitutions, for example.

Please note that neither XM nor XS are defined in the sam spec; both are bowtie/tophat flags, and XM particularly is poorly defined. For that, I output the number of alignments with scores EXACTLY equal to the top score, even though when you set "ambig=all" I will output sites with scores that are within a certain threshold of the top score. So there can be more alignments printed than the value of XM.

For the XS tag, you can set it to "firststrand", "secondstrand", or "unstranded". But internally, "true" is equivalent to "firststand" which is equivalent to "unstranded"; all produce identical output. "secondstrand" produces the opposite sign of "firststrand". For firststrand/unstranded, if read 1 maps to the plus strand, it gets "XS:A:+"; for secondstrand, it gets "XT:A:-". Read 2 always gets the opposite of read 1.

The YS tag is the stop position - specifically, the reference location to which the rightmost base maps. So yes, it is pos+alignment length.

The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
(matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
(100)/(100+sqrt(400)) = 100/120= 83.3% identity.
If there were 50 individual single-base-pair deletions, however, the identity would be:
(100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.

For the crash bug, that's my fault for not testing the interaction between "ambig=all" and "stoptag"; they don't work together. I'll fix that. In the mean time, you can set "saa=f" (secondaryalignmentasterisks=false) which will circumvent the problem.

Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.

WhatsOEver 05-26-2014 03:42 AM

Thanks Brian, that helps a lot! Many thanks!! :)

There are just two little things I'd like to clearify:

---------------------------------------------------------------------------------------

Quote:

The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
(matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
(100)/(100+sqrt(400)) = 100/120= 83.3% identity.
If there were 50 individual single-base-pair deletions, however, the identity would be:
(100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.
I'm not sure if I understand the meaning of f(deletions). Does it just mean, that all consecutive deletions are square rooted and the results summed up?
Like in the following example from my data:

Quote:

read1
length: 202 (+11 Deletions)
cigar: 5=6X1=1X2=1X1=1I4=10D2=1D2=1X1=1X6=1I9=1I19=1I79=1I56=

Matches: 5+1+2+1+4+2+2+1+6+9+19+79+56=187
Mismatches: 6+1+1+1+1=10
Insertions: 1+1+1+1+1=5
Deletions: 10+1=11

Identity: 187/(187+10+5+11) = 87.8% (this is the "standard" calculation)
Identity: 187/(187+10+5+sqrt(10)+sqrt(1))= 90.7% (this is your calculation?)
If the former is true, why don't you treat insertions in the same way? Like in:

Quote:

read2
length: 264 (+2 Deletions)
cigar: 1X2=2X1=2X4=3X3=20I8=1D19=1D157=1X28=1X3=1X1=3I3=1X

Matches: 2+1+4+3+8+19+157+28+3+1+3=229
Mismatches: 1+2+2+3+1+1+1+1=12
Insertions: 20+3=23
Deletions: 1+1=2

Identity: 229/(229+12+23+2*sqrt(1)) = 86.1% (This is your calculation?)
Identity: 229/(229+12+sqrt(20)+sqrt(3)+2*sqrt(1)) = 91.9% (Why not this way?)
---------------------------------------------------------------------------------------

Quote:

Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.
intronlength=10 is the standard value, isn't it? I didn't set it, but can neither find any cigar strings with deletions greater than 10D nor one with N's smaller than 11N.
And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).

Brian Bushnell 05-26-2014 08:02 AM

WhatSoEver,

If you have a cigar string like ths:

20=1I20=1D20=25D20=20I20=

...I would calculate:
(20+20+20+20+20) matches / (20+1+20+root(1)+20+25+20+root(25)+20)
=
(100)/(20+1+20+1+20+25+20+5+20)
=
(100)/(132)=75.75%

I know that's asymmetric - if there's a deletion, aligning the reference to the read would give a different identity than the read to the reference - but there are a couple reasons for that. First, you can get deletions of unbounded length in cigar strings, but insertions can never be longer than the read length, so it is (according to my logic) important to correct for deletions but not really anything else. Second, insertions occur INSTEAD of matches, while deletions occur IN ADDITION to matches. So if a 100bp read has 1 bp match and 99bp insertion, does it make sense to call that 1/(1+root(99)) = about 10% identity? Not really. But with a 100bp deletion in the middle, you'd have 100/110 = about 90% identity, which is maybe a bit high, but it seems reasonable for an alignment with two nearby 50bp consecutive matches, which can't arise by chance. Third, it seems to me that long deletions are more common than long insertions, but that could be discovery bias. Lastly, decreasing the identity penalty of insertions would cause a particular problem. Let's say you have a chimeric 200bp read, 100bp from somewhere in chromosome 1 and 100bp from chromosome 2. You could map it in two places, with cigars string of 100=100I and 100I100=, or 100=100X and 100X100=. My current formula would give that 50% identity for either location, whether it used the X or I symbol for the chimeric bases. But if you take the root of the insertions, then you get 100/(100+10) = 91% identity to two different sequences that have nothing in common, which does not make sense.

It could be that calculating identity this way is not a good idea; I may put in a flag to remove the sqrt from deletions, but it seems to me that it gives a more useful answer in terms of short-read alignment when doing recruitment studies, which is what I actually put the idtag in for.

Quote:

Originally Posted by WhatsOEver (Post 141212)
intronlength=10 is the standard value, isn't it? I didn't set it, but can neither find any cigar strings with deletions greater than 10D nor one with N's smaller than 11N.

By default, it is disabled. However, if you set other flags like "xstag" that are used by Tophat, and you don't explicitly set "intronlen", it will be automatically enabled and set to 10.

Quote:

And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).
Well, I said "for plants and fungi". I work with fungi and the introns I've seen tend to be 300bp or less, and some of the plants I've seen also seem to have short introns of typically 250bp or so. But I don't know if there might be some out there that are different! The main reason is that very high settings of 'maxindel' (over 100kb) decrease speed and accuracy, but low settings like 16000 don't really have a noticeable effect on either yet it COULD let you discover something new. So I generally run with 16k as the default for BBMap even on DNA. The PacBio version has a lower default because PacBio reads are very long and have a high error rate (which exacerbates the effect on speed and accuracy), and we don't do RNA-seq for PacBio right now.

Mchicken 05-27-2014 05:21 AM

Hey Brian,

i just briefly checked BBMap and have some questions.

1) I am working, among others, with RNA-Seq data of a cyanobacterium. Is it possible that BBMap performs a spliced alignments for my RNA-Seq data? In a small run i did not observed spliced alignments but nevertheless i was wondering about it. If so, can i prevent this case by setting 'intronlen' to 0?

2) Is there a way to apply soft- or hard-clipping?
For example bwa performs for a read soft-clipping leading to 94M6S where BBMap leads to 100M although the last 6 bases are partially mismatches and therefor i want them to be clipped.

3) Is there a way to determine the maximal number of mismatches?

4) For CRAC and PAR-Clip sequencing data i want only 1 respectively 2 mismatches/indels. So when i want at most 2 indels i have to set maxindel= 2 and strictmaxindel = true. Am i right?

5) Does BBMap reports only the best alignment (or best alignments if alignments with equal quality are present) or is there a way to report also alignments that are not as good as the best alignment? Would this be the maxsites option?

Thanks for your nice support!

Mchicken

Brian Bushnell 05-27-2014 09:47 AM

Mchicken,

1) 'intronlen' does not actually affect alignment, just the way splices are reported - deletions shorter than 'intronlen' will have a 'D' symbol in the cigar string, while longer ones will have an 'N' symbol. To prevent gapped alignments, you should set 'maxindel=0', which will prevent them from being looked for, though some may still be found (or 'strictmaxindel=0' to absolutely ban all alignments with indels). But prokaryotes do have some self-splicing; these are typically very short, like 20bp or so, I think. So you may want to set 'maxindel=100' or something like that, which will prevent long ones.

2) Soft-clipping is applied by default if a read goes off the end of a scaffold. You can also force it on alignments like the one you describe with the flag 'local=t'. Also, note that if you use the flag 'sam=1.4' then cigar strings will be generated with '= for match and 'X' for mismatch, instead of 'M' for both.

3) You can use a flag like 'minid=98' to prevent alignments with lower than approximately 98% identity, which would be 2 mismatches for a 100bp read, but substitutions, deletions, and insertions are all scored differently, and whether they are contiguous or scattered also affects the score, so this is not exact. BBMap does not have any way to ensure that the best alignment with at most X mismatches is returned, like Bowtie 1 can do, because the scoring is fundamentally different.

4) "maxindel" actually controls the length of individual indels, not the number of them, which is not controlled. There's also a "maxindel2" flag that controls the sum of the length of all indels, which by default is set to double maxindel. So if you did this:
maxindel=10 maxindel2=15 strictmaxindel
...then individual indels could be up to 10bp; the sum of the length of indels could be up to 15bp; and any alignment with a single indel longer than 10bp would be banned. But there is no way to limit the total number of indel events.

5) By default the best alignment is reported, and if there are multiple equally-scoring alignments, the first one (in genomic coordinates) is reported, but it is marked as ambiguous (XT:A:R tag). This corresponds to the setting 'ambig=best'. You can alternatively set 'ambig=random' or 'ambig=all' (which is what you are looking for). 'maxsites' will limit the total number of alignments displayed in 'ambig=all' mode, but won't have any effect in default mode.

Thanks for using BBMap!

-Brian

simono101 05-28-2014 03:41 AM

Escaping spaces using the shell script
 
Hi Brian,
I can't seem to work out how to pass a filepath with spaces in it to BBMap using the shell script. Nothing that I normally would use seems to work...

e.g.

Code:

~/bbmap/bbmap.sh ref="../Reference\ genome/myref.fasta" build=2 in1=./Reads1.fastq.gz in2=./Reads2.fastq.gz out=./mapped.sam
Yields the error

Code:

Exception in thread "main" java.lang.RuntimeException: Unknown parameter: genome/myref.fasta
Is there a way to pass a string with escaped spaces to bbmap using the shell script or do I need to call bbmap directly?

Thanks!

GenoMax 05-28-2014 04:27 AM

Have you pre-built the index for "myref.fasta"?
Code:

$ bbmap.sh ref=/path_to_your_genome.fa
This step has to be done independently of the mapping step, AFAIK.

GenoMax 05-28-2014 04:32 AM

On second reading of your post, it appears that you have a space in your directory name and you are asking about escaping that? Let me test.

Edit: Standard ways of escpaing spaces in shell script are not working but directly running the java command (escaping the space) is working.

Brian should be able to provide the final answer.

simono101 05-28-2014 06:38 AM

@GenoMax,
Thanks, yes your second post is exactly what I was referring to. I can't seem to find a way to escape the spaces in a directory path (but I too successfully called it directly rather than through the shell script).

Brian Bushnell 05-28-2014 08:53 AM

simono,

In Windows, you can handle that with quotes, e.g. ref="/a/path with spaces/x.fa"

In Linux... I don't know how to do it easily. Perhaps the easiest way would be to make a symlink that doesn't have any spaces. Alternatively, as in this article, you could export the path as an environment variable.

-Brian

Thias 06-02-2014 03:50 AM

Hello Brian,

I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted?

Code:

java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa, path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index, build=1, threads=4, -Xmx32g, midpad=2000, k=5, minscaf=5]

BBMap version 31.56
Set OVERWRITE to true
Cigar strings enabled.
Set threads to 4
Retaining first best site only for ambiguous mappings.
No output file.
Exception in thread "main" java.lang.RuntimeException: Cannot read file /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa
        at align2.RefToIndex.makeIndex(RefToIndex.java:28)
        at align2.BBMap.setup(BBMap.java:238)
        at align2.AbstractMapper.<init>(AbstractMapper.java:56)
        at align2.BBMap.<init>(BBMap.java:39)
        at align2.BBMap.main(BBMap.java:28)
[zepper@zivsmp001 Pia]$ head /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa
>Acta1_Mmi503488
TGCTGTACAGGTCCTTCCTGATGTCGGTTTTGGCCACTGACTGACCGACATCAAAGGACCTGTA
>Acta1_Mmi503489
TGCTGTCTGCATGCGGTCAGCGATACGTTTTGGCCACTGACTGACGTATCGCTCCGCATGCAGA
>Acta1_Mmi503490
TGCTGAAGAGCGGTGGTCTCGTCTTCGTTTTGGCCACTGACTGACGAAGACGACCACCGCTCTT
>Ahr_Mmi503972
TGCTGATTACAGGGAGCAAAGTTCTGGTTTTGGCCACTGACTGACCAGAACTTCTCCCTGTAAT
>Ahr_Mmi503973
TGCTGTATGGATGAGCTCATATACGCGTTTTGGCCACTGACTGACGCGTATATGCTCATCCATA


GenoMax 06-02-2014 04:07 AM

Quote:

Originally Posted by Thias (Post 141804)
Hello Brian,

I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted?

[CODE]
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa, path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index, build=1, threads=4, -Xmx32g, midpad=2000, k=5, minscaf=5]

Perhaps because you are using both ref= and path= options?

Since you have pre-built the BBMap index you can just use path= to point to the indexes.

Thias 06-02-2014 05:45 AM

@GenoMax: Thanks for your answer, but unfortunately this can't be the problem.

The folder in ref points to an empty directory, because I do not want to include the shRNA sequences to my regular set of indexed genomes. However the syntax with ref= and path= is exactly the syntax I used to successfully index the eukaryotic genomes before. (I used the standard k-mer size 13 and a midpad size of 200000 for the genomes though).

Brian Bushnell 06-02-2014 09:40 AM

Thias,

That specific error is triggered by:

File f=new File(reference);
if(!f.exists() || !f.isFile() || !f.canRead()){


So according to Java, either it does not exist, is not a file, or cannot be read. Perhaps the permissions are wrong, or the metadata on the file is wrong? It triggered before even looking inside the file, so the data is probably not corrupted.

You could technically do this:

cat /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa |
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=stdin.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5

...which would circumvent Java trying to read the file. But I'd like to hear if you discover anything odd about the file permissions or metadata (like if it claims to be a directory, or something).

WhatsOEver 06-03-2014 06:01 AM

Dear Brian,

I'm coming with another problem (bug?), I cannot find an answer to. I run a RNAseq mapping with bbmap and got afterwards the following results:

Quote:

Genome: 1
Key Length: 14
Max Indel: 200000
Minimum Score Ratio: 0.4
Mapping Mode: normal
Reads Used: 702748 (302588801 bases)

Mapping: 2780.014 seconds.
Reads/sec: 252.79
kBases/sec: 108.84


Read 1 data: pct reads num reads pct bases num bases

mapped: 98.1544% 689778 98.8899% 299229721
unambiguous: 96.8285% 680460 97.9391% 296352687
ambiguous: 1.3259% 9318 0.9508% 2877034
low-Q discards: 0.0011% 8 0.0001% 185

...
As I don't like the samtools flagstat output, I wrote my own samfile-parser in order to compare the output of different mappers and different parameters used for mapping. From the samfile that was produced by bbmap, my parser returned the following

Quote:

Found 762672 entries in the samfile.
Identified 702748 different reads.
663897 of the reads have unique mapping positions.
25881 of the reads have 85805 ambiguous mapping positions.
12970 of the reads did not map to the reference.
Elapsed seconds: 4
I verified these results by counting the individual bit-flags.
As this is quite different to your program output (except for the unmapped reads), I'm wondering were the mistake is?

Cheers,
W

Brian Bushnell 06-03-2014 10:10 AM

W,

I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous.

So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports. Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement.

WhatsOEver 06-04-2014 12:44 AM

Sorry, but I don't get it completely.

Quote:

Originally Posted by Brian Bushnell (Post 141953)
I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous.

What is your definition of an "unambiguously mapped read" if it has secondary alignments? If secondary alignments are printed as a result of a more liberal threshold, the read is no longer unambiguous under the used threshold, is it? This should be independent of how bad your sec alignments are, the chosen threshold shouldn't change the terminology, should it?
So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments?

Quote:

Originally Posted by Brian Bushnell (Post 141953)
So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports.

You are right - if I count XT:A:R, I get the same number as the program output (XT:A:U are however not visible/set?, but should of course be the remaining)

Quote:

Originally Posted by Brian Bushnell (Post 141953)
Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement.

Yes, I know that and I really think that this is a confusing point in the SAM spec. But no, I don't think that it's a subjective judgement just because the word "ambiguity" is not explicitly used in the specs.
Or is it just my german misinterpretation of the strictness in the word ambiguity? :confused: ;)

But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time :D

punto_c 06-04-2014 01:54 AM

Hello Brian

I am trying to compare some mapping tools, and I am also using BBMap
I execute 4 tests using PE readed preprocessed with bbduck, cutadapt, fastx and trimmomatic respectively
With the reads preprocessed using bbduck and trimmomatic the program is able to finish without any error

When I run the reads preprocessed with cutadapt or fastx bbmap crashes...

do you have any idea why is this happening?

Here is the parameters I used and the stde output:
overwrite=true fastareadlen=500 -Xmx200g threads=96 outputunmapped=f sam=1.4

BBMap version 32.15
Set overwrite to true
Set threads to 96
Set DONT_OUTPUT_UNMAPPED_READS to true
Retaining first best site only for ambiguous mappings.
Set genome to 1

Loaded Reference: 9.674 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 15.609 seconds.
Analyzed Index: 22.798 seconds.
Started output stream: 0.345 seconds.
Started output stream: 0.014 seconds.
Cleared Memory: 0.146 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 96 mapping threads.
Exception in thread "Thread-110" java.lang.IndexOutOfBoundsException: Index: 172, Size: 172
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:484)
at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:361)
at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:205)
at java.lang.Thread.run(Thread.java:679)

Brian Bushnell 06-04-2014 09:57 AM

Salvatore,

That error is because there are unequal numbers of reads in the read1 and read2 files (I'll add a better error message that explains it). This happens when a read-trimming tools throws away one read in a pair because it's too short, but not the other. This also ruins the pairing order, so mapping paired reads will no longer work correctly. BBDuk and Trimmomatic both keep paired reads together correctly.

This is a known problem with fastx; you basically should not use it with paired reads. I think cutadapt is able to handle pairs correctly when you use the "-p" flag, but I'm not sure.

You can re-pair the broken files with my repair.sh script, in the latest release of BBTools (32.27), like this:
repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq

Brian Bushnell 06-04-2014 10:23 AM

Quote:

Originally Posted by WhatsOEver (Post 141980)
Sorry, but I don't get it completely.

What is your definition of an "unambiguously mapped read" if it has secondary alignments? If secondary alignments are printed as a result of a more liberal threshold, the read is no longer unambiguous under the used threshold, is it? This should be independent of how bad your sec alignments are, the chosen threshold shouldn't change the terminology, should it?
So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments?

I wrote a function that determines whether or not I classify a read as ambiguous. It takes into account various factors:
1) how high the best score is
2) the difference there is between the best score and the second best score
3) how many additional sites are within a certain range of the best score
It's not a closed-form equation, and it's completely subjective (that's what I meant by subjective, not the SAM spec); I optimized it empirically by iteratively tweaking it to minimize false-positives and maximize true-positives. And that alone decides whether or not I classify a read as "ambiguous".

When "ambig=all" is chosen, all sites with a score of at least 95% of the best score are printed, whether or not the best site is considered ambiguous. This is a bit different. I think I'll change it to make this number a parameter, and perhaps only print secondary sites for reads that are considered ambiguous, which would make more sense.

Quote:

But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time :D
Feel free to post whatever questions you have; they may be shared by other people, and may indicate that I could be doing something better or more clearly.

punto_c 06-04-2014 07:01 PM

thank you
 
Quote:

Originally Posted by Brian Bushnell (Post 142048)
Salvatore,

That error is because there are unequal numbers of reads in the read1 and read2 files (I'll add a better error message that explains it). This happens when a read-trimming tools throws away one read in a pair because it's too short, but not the other. This also ruins the pairing order, so mapping paired reads will no longer work correctly. BBDuk and Trimmomatic both keep paired reads together correctly.

This is a known problem with fastx; you basically should not use it with paired reads. I think cutadapt is able to handle pairs correctly when you use the "-p" flag, but I'm not sure.

You can re-pair the broken files with my repair.sh script, in the latest release of BBTools (32.27), like this:
repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq

Thank you a lot Brian
Your explanation was as always very simple and clear
I could trimm in PE mode using Cutadapt and then was able to map the reads
Nothing to do for Fastx

thanks again

Salvatore

coryfunk 06-25-2014 06:17 PM

Brian,

I'm wondering if you can point or provide me with some recommended parameters for running shRNAseq datasets on BBMap.

My initial efforts using BBMap using the default parameters did yield a sam file, but when I converted it to a sorted bam files and attempted to use it on IGV, it said "does not contain any sequence names which match the current genome."

I'm pretty sure I'm overlooking something. I did create the indices with the miRBase mature.fa file. That seemed to work fine, but noticed in the output that said:

No index available; generating from reference genome:

Any help would be appreciated.

Brian Bushnell 06-25-2014 11:09 PM

coryfunk,

IGV is very sensitive to the names of scaffolds; they must be identical in the fasta used for mapping and the fasta used by IGV. When I encounter that error, "does not contain any sequence names which match the current genome", it's usually because I'm using the wrong reference. so if, for example, you are mapping to a fasta of RNAs and trying to display the results in IGV against the full genome, that won't work.

The methodology would be something like this:

index:
bbmap.sh ref=mature.fa

map:
bbmap.sh in=reads.fq out=mapped.sam bs=script.sh

translate to bam:
sh script.sh
(that creates and runs a shellscript that creates a sorted indexed bam file from the sam output, if samtools is installed, though of course you can do it manually too)

Now, run IGV and choose "load genome" and point it to the same fasta file that you used for indexing, then load the bam file. If it still does not work, look at the header of the sam file, and verify that the names of the scaffolds in IGV are the same as the names in the sam header. Or, actually, that's what you should do first.

As for specifics about good settings to use for shRNAseq, I have no experience with it. Normally for RNAseq I suggest setting the "maxindel" flag to an appropriate value for the organism's expected intron length (say, 200000 for vertebrates or 8000 for fungi and plants with short introns). But if shRNAseq involves only short, unspliced RNAs, then you can just go with the defaults.

-Brian

HGV 07-09-2014 04:41 AM

rel. 33.04 fails
 
Hi Brian
I constantly update bbmap and now ran into problems when using the newest release 33.04 - it performs the mapping, but then fails during the Results statistics reporting, when 32.23 reports perfectly fine for exactly the same command:

Quote:

------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 10
Minimum Score Ratio: 0.65
Mapping Mode: normal
Reads Used: 7525278 (1888844778 bases)

Mapping: 96,483 seconds.
Reads/sec: 38997,80
kBases/sec: 19576,90


Pairing data: pct reads num reads pct bases num bases

mated pairs: 0,0231% 869 0,0231% 436238
bad pairs: 0,0037% 139 0,0037% 69778
insert size avg: 675,02
Exception in thread "main" java.lang.NullPointerException
at align2.ReadStats.mergeAll(ReadStats.java:140)
at align2.AbstractMapper.printOutput(AbstractMapper.java:1473)
at align2.BBMap.testSpeed(BBMap.java:432)
at align2.BBMap.main(BBMap.java:31)
What is going on here?
Cheers
Harald

Brian Bushnell 07-09-2014 09:26 AM

Harald,

Thanks for finding that! I added a new histogram, "qahist" (quality accuracy histogram), which correlates the claimed quality of bases with their actual quality as measured by error rate. But if you use the "qhist" flag and not the "qahist" flag, that crash occurred because an array was not getting initialized. It's fixed now; I'll upload the fixed version later today.

-Brian

Brian Bushnell 07-09-2014 04:37 PM

I've uploaded the fixed version (33.08) to sourceforge.

easoncm 07-30-2014 07:48 AM

Repair.sh
 
1 Attachment(s)
Hello,

I got the same error saying my files had a different number of reads and tried the suggestion of running repair.sh per this thread's reply and the documentation within the program. However, when I run repair.sh, it tells me to there are different number of reads and to run repair.sh.

I don't understand why it is telling me to run the program I am trying to run.

Any ideas?

Brian Bushnell 07-30-2014 09:46 AM

Quote:

Originally Posted by easoncm (Post 146227)
Hello,

I got the same error saying my files had a different number of reads and tried the suggestion of running repair.sh per this thread's reply and the documentation within the program. However, when I run repair.sh, it tells me to there are different number of reads and to run repair.sh.

I don't understand why it is telling me to run the program I am trying to run.

Any ideas?

Heh, that's kind of a funny error message. I wrote repair.sh with the assumption people would use it on interleaved files... I will fix that bug. But for now, you should be able to run it like this:

cat file1.fq file2.fq | repair.sh in=stdin.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq overwrite=t

andrej-gnip 08-12-2014 06:11 AM

Hi Brian,

I am trying to create a tool for identification of viral genetic material in a sample. It's supposed to work for RNA viruses, so RNA is reversed transcribed and the resulting DNA is sequenced. I end up with fastq files, where majority of reads comes from the host, but a significant minority comes from virus. I want to map the reads to a database of known RNA viral sequences, and I'm considering using BBMap for this purpose. I need a mapper which would be able to do local as well as end-to-end alignment... I tried bowtie2 at first, but the local version didn't work properly and I didn't manage to find anyone who'd know what's wrong, so I want to try another mapper now. The files with reads can be quite large. Do you think BBMap is a good tool for this type of purpose? Also, I looked through mapping parameters in the README file, but couldn't find seed length. Is there any reason for that?

Thank you very much.

Brian Bushnell 08-12-2014 10:45 AM

Andrej,

BBMap should work well for this purpose, as it can handle very low identity alignments which are typical of cross-species mapping. You can adjust the seed length with the "k" parameter. The default is 13, and I don't recommend going below about 7 as it becomes exponentially slower with a shorter seed length (speed is inversely proportional to 4^k). For high sensitivity, you might try something like this:

bbmap.sh in=reads.fq out=mapped.sam ref=virus.fa k=11 minratio=0.1 maxindel=200 slow

...or drop k even lower. But you still may not get any hits to your viral database unless it contains a relative of the virus in question.

If you have an assembly of the host genome, you can also remove reads that map to it like this, to reduce the volume of data you're dealing with:

bbmap.sh in=reads.fq outu=unmapped.fq ref=host.fa

-Brian

WhatsOEver 08-21-2014 08:07 AM

Hi Brian,

I have some questions about the optional histogram tables:

1) mhist: It displays per base matches/sequencing errors, doesn't it?
What is included in the "Other" column?
How are multiple insertions counted? Assume the following:
Code:

12345--6789
ACTAG--CATC
ACTAGGGCATC

Will it count for this read 2 insertions at position 5?
I have different read length from 30bp - 1.2kbp. Is the error rate somehow adjusted to read length?

2) qhist: Assuming that Read_linear is the average base quality, what is Read_log?

3) ihist: This one is always empty?!

4) scafStats: Ambiguous reads are counted multiple times, aren't they? This means the overall ambiguousMB is in fact much smaller, isn't it?

Thanks!

Brian Bushnell 08-21-2014 10:07 AM

Quote:

Originally Posted by WhatsOEver (Post 148173)
Hi Brian,

I have some questions about the optional histogram tables:

1) mhist: It displays per base matches/sequencing errors, doesn't it?

Yes.
Quote:

What is included in the "Other" column?
That's for any other cigar string symbols, which in practice means soft-clipping. If you don't use the 'local' flag, only reads that go off the end of scaffolds will be clipped.
Quote:

How are multiple insertions counted? Assume the following:
Code:

12345--6789
ACTAG--CATC
ACTAGGGCATC

Will it count for this read 2 insertions at position 5?
All counts are relative to the read position. But what I call an insertion is extra bases in the query that are not present in the reference, so I would classify that as a deletion, assuming that the top line (with '--') is the read and the bottom line is the reference. That deletion would be counted exactly once, regardless of the length, at position 5. An insertion of length L will be counted L times, once at each read base it covers. So, for example:

Code:

pos  123456789
read  ACTAGGGCATC
ref  ACTAG--CATC

That would be classified as an insertion and counted once at position 6 and once at position 7.
Quote:

I have different read length from 30bp - 1.2kbp. Is the error rate somehow adjusted to read length?
Yes, the sum of all columns should always be 1 for any line; it represents the fraction of cigar symbols at that position, so if there is 1 read of length 100 and 2 reads of length 200, then an error at position 50 would be represented as a 0.33 substitution rate, while an error at position 150 would be represented as a 0.5 substitution rate. As a result the graph gets more noisy going toward the right, with variable-length reads. Be sure you are using PacBio mode (mapPacBio8k.sh) for long reads; bbmap.sh has a read-length limit of 600bp, and reads longer than that will be broken into pieces.
Quote:

2) qhist: Assuming that Read_linear is the average base quality, what is Read_log?
The linear average sums the quality values, and divides by the number of quality values. The log average converts the phred quality score into a probability of error, sums the probabilities of error, divides that by the number of quality values, then transforms the average probability of error back into a phred score. As a result, the linear average (though commonly used) is kind of meaningless, while the logarithmic average can actually be used to calculate the expected error rate at that position. For example, if you had 200 reads and the log average at position 5 was 20, then you would expect there to be 2 total errors at position 5. However, if the linear average is 20, that doesn't actually tell you anything useful, but I include it anyway since that is what most programs plot.
Quote:

3) ihist: This one is always empty?!
"ihist" will contain the insert size distribution of properly-paired reads, if the mapping is done paired. It will be empty if no reads were proper pairs, or if reads were mapped single-ended. By default, BBMap expects reads in the Illumina fragment library orientation where reads are on opposite strands, pointing inward. You can override this with the "rcs=f" flag (requirecorrectstrand=false) which will allow any orientation, or with the "rcompmate" flag if both reads are on the same strand. Also, only reads that map to the same scaffold can be considered proper pairs.

You can, alternately, use the 'lhist' flag to just plot the read lengths, regardless of mapping.
Quote:

4) scafStats: Ambiguous reads are counted multiple times, aren't they? This means the overall ambiguousMB is in fact much smaller, isn't it?
Yes, unambiguously-mapped reads (and megabases) are counted once and ambiguously-mapped reads are counted at least twice.
Quote:

Thanks!
You're welcome!

HGV 09-17-2014 02:16 PM

pairlen= broken in 33.41
 
Hi Brian

I was using pairlen=1200 to limit insert sizes for paired end mapping and with the newest version 33.41 bbmap reports unkown option, while in 33.40b it used to work. What has happened?

Cheers
Harald

Brian Bushnell 09-17-2014 02:53 PM

Harald,

Thanks for noting that! I accidentally deleted the line that parsed that command. I've uploaded the fixed version, 33.42.


On another topic, I'd like to announce that a second developer, Jon Rood, has begun porting certain aspects of BBTools over to C using JNI calls. The currently ported classes are BBMerge, BBMap, and Dedupe. This is optional (you can enable it at runtime with the 'usejni' flag), and the output is identical, but there is a substantial speedup:
BBMap: +30%
BBMerge: +60%
Dedupe: up to +200% (when allowing an edit distance)

If you are interested in a free speed increase, instructions for compiling the C code for OS X or Linux are in /bbmap/jni/README.txt

kbseah 09-22-2014 09:00 AM

Hello,

I hope this is the right place to post questions related to BBMap, but since the last reply wasn't too long ago....

I've been using BBMap to map paired-end Fastq reads where the headers have been renamed for downstream analysis ("_1" and "_2" have been added to header names for forward and reverse reads respectively). When I look at the SAM file from the mapping, the forward and reverse reads that map have nonzero POS field, but the PNEXT fields are always zero. Is this caused by my editing the read names? Bowtie2 doesn't have the same problem, and assembling with SPAdes and IDBA-UD worked normally with the edited read names.

Example of the SAM entries for a read pair:
Quote:

HWI-ST863:279:H03F7ADXX:1:1101:7656:2184_1 16 NODE_207_length_5463_cov_10917.5_ID_413 125 44 13=1X137= * 0 0 [...] [...] NM:i:1 AM:i:44
HWI-ST863:279:H03F7ADXX:1:1101:7656:2184_2 0 NODE_207_length_5463_cov_10917.5_ID_413 5275 45 151= * 0 0 [...] [...] NM:i:0 AM:i:45
Thanks a lot in advance for your help.
Brandon

Brian Bushnell 09-22-2014 10:38 AM

Brandon,

Those were not recognized as paired. BBMap recognizes only the normal Illumina naming schemes:

"* /1"
and
"* 1:"

If those reads are interleaved in a single file, use the "int=t" flag which will force BBMap to recognized them as being interleaved.

HGV 09-22-2014 11:47 AM

bbmap hitstats - unambiguous Hits
 
Hi Brian
I was looking at the hitstats files and I realized that the %unambiguousReads
can add up to more than 100%, and similarly the unambiguousReads count can be higher than the total input reads... How can that be?

Cheers
Harald

Brian Bushnell 09-22-2014 12:26 PM

Harald,

Thanks for noticing that. It works correctly for single-ended reads, but it appears that improper pairs (where one read maps to one scaffold, and the other maps to a different scaffold) are double-incrementing the counts on both scaffolds. I'll fix that in the next release.

-Brian

kbseah 09-22-2014 01:14 PM

Thanks for the quick reply, Brian!

Brian Bushnell 09-22-2014 04:55 PM

Quote:

Originally Posted by kbseah (Post 150422)
Thanks for the quick reply, Brian!

You're welcome!
Quote:

Originally Posted by HGV (Post 150412)
Hi Brian
I was looking at the hitstats files and I realized that the %unambiguousReads
can add up to more than 100%, and similarly the unambiguousReads count can be higher than the total input reads...

Fixed now, as of v33.46.

andrej-gnip 09-24-2014 10:32 AM

Hi,

I'm developing a tool for analysis of sequence reads from viral genetic material, and mapping to reference viral genomes is part of the process. First version used bowtie2, but now I'm trying to make a new version with better user flexibility, more options etc. And I'd like to use bbmap, as it seems to be the best for this purpose. The user manual to bbmap seems more straightforward, and also people I talked to who used both mappers told me bbmap is generally better.

Now, I did some tests, e.g. I ran bbmap on made up sequences to see how it would perform. I created a fastq file with one read, and two fasta files as references. One fasta file had only one sequence similar to the read, while the other fasta file had 4 such sequences (one of which was the same as the one in the other file, and this one was most similar). Naturally, the read was always mapped to the sequence with highest similarity. The mapping quality was the same in the 2 cases. According to that, I can say that mapping quality is completely independent from other sequences in the reference, it only depends on the read and the particular reference sequence to which it was mapped. I'm not sure, though, so I wanted to ask whether this assumption is actually true. Thanks a lot.

Brian Bushnell 09-24-2014 10:59 AM

The mapping quality is dependent on the other reference sequences, but only if they are within some threshold of similarity (roughly 6 edits) to the best site. If you copy the same reference sequence twice in the fasta file, the read will map ambiguously and get a score of 3 or less. If you add one or two edits to one copy, the read will map to the unedited one but get a reduced score. But if they differ by, say, 10 edits, then the best mapping location will not get any score penalty. The penalty is also influenced by the number of alternative sites; for example, if there are 5 sites that are each 2 edits worse than the best site, that will give a greater score penalty than if there is only one alternative site.

kbseah 10-06-2014 08:47 AM

pileup.sh inconsistent with samtools pileup
 
Hi Brian,

I tried using the "fastaorf" function in pileup.sh, to look at the read depth for a bunch of ORFs predicted with Prodigal. The input is in Prodigal's output format as specified. However, I get per-orf coverage results (the depthSum field) that are inconsistent with the output from samtools mpileup.

Briefly: I produced a pileup file with samtools mpileup and for each orf, simply summed the read depth (4th column of the pileup file) for each position that falls within the orf.
I double-checked this by converting the Prodigal output to a BED feature table, and used bedtools multicov and the original BAM file to produce a per-feature read depth. This gives per-feature read depths which are not identical to what I got by summing the depths but roughly a multiple (i.e. plotting the depths per orf from both methods against each other gives an approximately linear relationship).
The output from pileup.sh, on the other hand, doesn't give anything close to a linear relationship.

Is there something different in how pileup.sh calculates the per-orf coverage?

Thanks a lot,
Brandon

Brian Bushnell 10-06-2014 10:11 AM

Brandon,

I did introduce a bug recently when I added support for tracking only read start positions rather than total coverage, which manifested in some situations. It's fixed now and I just uploaded the fixed version (33.57). Would you mind downloading that and confirming whether it works correctly?

Thanks!

kbseah 10-07-2014 01:10 AM

Hi Brian,

Thanks for your reply. Unfortunately the output seems to be the same. Should I send you the output from pileup.sh vs. the output from bedtools so that you can see what I mean?

Best,
Brandon

Quote:

Originally Posted by Brian Bushnell (Post 151503)
Brandon,

I did introduce a bug recently when I added support for tracking only read start positions rather than total coverage, which manifested in some situations. It's fixed now and I just uploaded the fixed version (33.57). Would you mind downloading that and confirming whether it works correctly?

Thanks!


Brian Bushnell 10-07-2014 09:36 AM

Quote:

Originally Posted by kbseah (Post 151539)
Hi Brian,

Thanks for your reply. Unfortunately the output seems to be the same. Should I send you the output from pileup.sh vs. the output from bedtools so that you can see what I mean?

Best,
Brandon

Yes, please do, as well as the command line and stdout/stderr messages.

holmrenser 10-20-2014 06:38 AM

Hi Brian,

I would be interested in knowing when you intend to publish BBmap in a paper, can you enlighten us?

Cheers

Brian Bushnell 10-20-2014 11:06 AM

I hope to have it submitted by the end of this year.

-Brian

dbrami 10-22-2014 11:24 AM

Hi Brian,
Thank you for previous answer through email which I am posting:
"Daniel,

Unfortunately, there is not currently a master document, though I certainly want to make one. BBMap is documented in /docs/readme.txt while everything else is just documented in its shellscript. But I do have individual threads on SeqAnswers about the most commonly-used tools:

http://seqanswers.com/forums/showthread.php?t=41057

(first post contains links to the other threads)

For metagenomics, I consider bbmap.sh, bbsplit.sh, bbduk.sh, bbnorm.sh, pileup.sh, dedupe.sh, and stats.sh to be the most directly relevant.

As for your specific questions:

Is there any documentation on the mapping output summary? Ie: How do numbers in mapped pairs vs read 1 mapped differ?

To be considered proper pairs, by default, the reads must map to opposite strands, on the same scaffold, with an insert size of under 32kbp. Read 1 will typically have a higher mapping rate than the pairing rate, because sometimes the reads map discordantly, and sometimes only one read maps. This is especially true in a metagenome which may have very short scaffolds.

Can we output the mapping results to .fast.gz directly?

All tools in the BBMap package can input and output gzipped files, fasta or fastq. It detects this based on filename, so a gzipped fastq file should be named "*.fq.gz" or "*.fastq.gz".

Is it possible to bin out mappings where both reads map vs either read maps vs unmapped through command line without bitwise comparison of SAM flags?

Yes. BBMap supports a "po" or "pairedonly" flag; you can for example do this:

bbmap.sh in1=read1.fq in2=read2.fq po=t outm=mapped.fq outu=unmapped.fq

...which will split the reads into proper pairs with both mapped (outm stream) and pairs where either one or both didn't map, or they did not map in the proper orientation (outu stream). With "po=f" (pairedonly=false), outm will accept anything where at least one read mapped, and outu will only get pairs in which neither mapped."
I am trying to find a value in mapping summary to would correspond to bowtie2's '% overall alignment rate'

Also, which BBMap parameters should I tweak if I want to accurately want to reproduce bowtie2 results obtained using "-k 1 --fast-local" ?

Brian Bushnell 10-22-2014 11:48 AM

Daniel,

Quote:

Originally Posted by dbrami (Post 152682)
I am trying to find a value in mapping summary to would correspond to bowtie2's '% overall alignment rate'

Also, which BBMap parameters should I tweak if I want to accurately want to reproduce bowtie2 results obtained using "-k 1 --fast-local" ?

For paired reads, the overall alignment rate is ((percent mapped for read 1)+(percent mapped for read 2))/2. Or in other words half of the sum of the two numbers in red below.

Code:

Pairing data:          pct reads      num reads      pct bases          num bases

mated pairs:            99.5206%            3114        99.5206%            934200
bad pairs:                0.3196%              10        0.3196%              3000
insert size avg:          321.31


Read 1 data:            pct reads      num reads      pct bases          num bases

mapped:                  99.9361%            3127        99.9361%            469050
unambiguous:            98.6897%            3088        98.6897%            463200
ambiguous:                1.2464%              39        1.2464%              5850
low-Q discards:          0.0000%              0        0.0000%                  0

perfect best site:        1.9175%              60        1.9175%              9000
semiperfect site:        1.9175%              60        1.9175%              9000
rescued:                  0.0000%              0

Match Rate:                  NA              NA        62.0582%            441661
Error Rate:              96.5462%            3019        37.6163%            267711
Sub Rate:                87.4320%            2734        2.3412%              16662
Del Rate:                42.8846%            1341        34.0933%            242638
Ins Rate:                49.8881%            1560        1.1818%              8411
N Rate:                  49.4084%            1545        0.3254%              2316


Read 2 data:            pct reads      num reads      pct bases          num bases

mapped:                  99.9041%            3126        99.9041%            468900
unambiguous:            98.6897%            3088        98.6897%            463200
ambiguous:                1.2144%              38        1.2144%              5700
low-Q discards:          0.0000%              0        0.0000%                  0

perfect best site:        1.6619%              52        1.6619%              7800
semiperfect site:        1.6619%              52        1.6619%              7800
rescued:                  0.0000%              0

Match Rate:                  NA              NA        61.2099%            441930
Error Rate:              96.7370%            3024        38.4685%            277739
Sub Rate:                88.2278%            2758        2.2755%              16429
Del Rate:                43.9859%            1375        35.0546%            253091
Ins Rate:                49.1683%            1537        1.1384%              8219
N Rate:                  50.7358%            1586        0.3216%              2322

Flags equivalent to bowtie2's "-k 1 --fast-local" would be something like:

"fast local maxindel=10"

dbrami 10-22-2014 03:38 PM

de-interleave fastq outtput
 
Can BBMap de-interleave the fastq output for reads 1 and reads 2?
Ie: if I Use "pairedonly=f outu=${B}.noplant.fastq.gz", can BBMap put the unmapped forward reads ${B}_R1.noplant.fastq.gz and the unmapped reverse reads in ${B}_R2.noplant.fastq.gz.
Or should I just de-interleave the reads as a post process?

Brian Bushnell 10-22-2014 03:46 PM

Daniel,

You can set "outu1=" and "outu2=" or "outm1=" and "outm2=" for output in paired files.

prashant_singh 11-07-2014 06:46 AM

Issues with BBMap
 
Hi
I am a newbie to NGS data analysis- basically a bench scientist trying to do analysis.
I have sequenced human tissue samples on the PacBio RSII and am trying to align to human genome using the bbmap tool.

As mentioned in the forum, I am trying to use the MapPacBio.sh script for the alignment. Previously I was getting an error where it could not find the class align2.MapPacBio . I hard coded the path in the file and then

Here is the new error I am getting:

java -Djava.library.path=/opt/compsci/bbmap/33.73/jni/ -ea -Xmx51798m -cp BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88 ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa
Error: Could not find or load main class build=1

Java version is "1.7.0_51"
OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)


Could you please help?

Thanks
Prashant

westerman 11-07-2014 07:09 AM

@prashant_singh. I am hardly an expert on BBTools. Brian will probably log in soon -- he is on USA west coast time as far as I know -- and so will give a better answer. But I happened to start up a mapPacBio run just as your post came in. Comparing your Java to my Java reveals:

Yours ...
Code:

java
  -Djava.library.path=/opt/compsci/bbmap/33.73/jni/
  -ea -Xmx51798m
  -cp BBMapPacBio
  build=1
  overwrite=true
  minratio=0.40
  fastareadlen=6000
  ambiguous=best
  minscaf=100
  startpad=10000
  stoppad=10000
  midpad=6000
  in=/data/*****/PacBio/****/consensus_isoforms.fasta
  out=/data/***/pac88 
  ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa

Mine ...
Code:

java
  -Djava.library.path=/group/bioinfo/apps/apps/BBMap-33.57/jni/
  -ea -Xmx126058m
  -cp /group/bioinfo/apps/apps/BBMap-33.57/current/
  align2.BBMapPacBio
  build=1
  overwrite=true
  minratio=0.40
  fastareadlen=6000
  ambiguous=best
  minscaf=100
  startpad=10000
  stoppad=10000
  midpad=6000
  ref=kmer60-10.fa
  in=all_spades_2000plus.fa
  out=LR_vs_scaffolds.sam

The major different is the 'cp' where I have the complete path plus the line after the 'cp' which shows the program to run.

You mentioned:

Quote:

Previously I was getting an error where it could not find the class align2.MapPacBio . I hard coded the path in the file and then
....
Error: Could not find or load main class build=1
I suspect that you mis-coded the path and probably erased the part which states the program to run. I suggest starting anew and if you really need to hard-code the path then be careful in doing so.

Brian Bushnell 11-07-2014 09:38 AM

That's correct, and yes, I'm on West Coast time :) The command line should probably be:

java -ea -Xmx51798m -cp /opt/compsci/bbmap/33.73/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88.sam ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa

Note both of the things in bold: "/opt/compsci/bbmap/33.73/current/" is probably correct, but maybe not, depending on how it was installed. I'm guessing that somewhere under "/opt/compsci/bbmap/33.73/" is a subdirectory named "current" - that is what you want to point it to.

For "out=/data/***/pac88.sam", I just added the ".sam" suffix. All of the programs in BBTools are extension-sensitive, so if you don't include an extension, it will just output the default format (which in this case is sam, but it's always best to specify it).

sdriscoll 11-07-2014 04:21 PM

I apologize if this has been covered already but I've just downloaded and run bbmap a few times (seems great so far and I like the control provided via the available command line options) and I have a quick question. When building the index is it possible for bbmap to ignore parts of the reference names past the first blank space? For example I'm using the Ensembl human release and the FASTA reference names look something like this:

>1 dna_sm:chromosome chromosome:GRCh37:1:1:249250621:1 REF

Currently when I run alignments on this FASTA file that entire name shows up in the SAM file while instead I should only see "1". Other aligners (STAR, bowtie, bwa, etc) throw out everything after the first space when indexing leaving just the main reference name. This is particularly important for using aligners as part of a gene expression pipeline using something like eXpress since some transcriptome extraction tools (programs that parse a GTF and extract transcripts from an accompanying genome FASTA file) usually name the transcripts as >TRANSCRIPT_ID GENE_SYMBOL. eXpress will throw out everything after the first space however if the alignments contain the full reference names then no alignments are matched up to transcripts in the output.

Brian Bushnell 11-07-2014 04:26 PM

Yes, I put in an option for that because some programs require it. The flag is trd=t (or trimreaddescriptions=true), which will truncate all names (both read names and reference names) at the first whitespace symbol.

sdriscoll 11-07-2014 04:43 PM

rad. thanks for that.


All times are GMT -8. The time now is 09:12 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.