SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 05-23-2014, 03:48 AM   #41
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Dear Brian,

I have some questions about the optional fields (TAG:TYPE:VALUE stuff) that can be added in the SAM file by bbmap. Take the following multiple-mapped read as example

Quote:
readNameHere 16 chr11 106018977 55 6M3I17M * 0 0 GCTTTGAAAAATTTTCTGCATGCCCA /4122:...00....=<<<=<<4449 XT:A:R NM:i:5 AM:i:55 XM:i:2 NH:i:20
readNameHere 272 chrX 119093161 50 18M5I3M * 0 0 * * NM:i:6 AM:i:50 XM:i:2 NH:i:20
readNameHere 272 chrX 45789727 49 1M3I22M * 0 0 * * NM:i:6 AM:i:49 XM:i:2 NH:i:20
readNameHere 272 chr20 23542459 49 18M5I3M * 0 0 * * NM:i:6 AM:i:49 XM:i:2 NH:i:20
readNameHere 256 chr3 151169846 47 1M3I3M3I16M * 0 0 * * NM:i:6 AM:i:47 XM:i:2 NH:i:20
readNameHere 256 chr4 27414930 47 1M1I5M12N17M1I1M * 0 0 * * NM:i:2 AM:i:47 XM:i:2 XS:A:+ NH:i:20
readNameHere 272 chrX 89494814 46 26M * 0 0 * * NM:i:8 AM:i:46 XM:i:2 NH:i:20
readNameHere 272 chrY 3677157 46 26M * 0 0 * * NM:i:8 AM:i:46 XM:i:2 NH:i:20
readNameHere 272 chr12 126969193 46 4M2D22M * 0 0 * * NM:i:6 AM:i:46 XM:i:2 NH:i:20
readNameHere 256 chr21 31585993 45 5M1I20M * 0 0 * * NM:i:5 AM:i:45 XM:i:2 NH:i:20
readNameHere 256 chr5 80894028 45 7M1I18M * 0 0 * * NM:i:6 AM:i:45 XM:i:2 NH:i:20
readNameHere 272 chr6 85600169 45 26M * 0 0 * * NM:i:7 AM:i:45 XM:i:2 NH:i:20
readNameHere 272 chr9 81596079 44 18M1I7M * 0 0 * * NM:i:6 AM:i:44 XM:i:2 NH:i:20
readNameHere 256 chr3 170762912 43 1M3I22M * 0 0 * * NM:i:7 AM:i:43 XM:i:2 NH:i:20
readNameHere 256 chr13 20507692 42 26M * 0 0 * * NM:i:6 AM:i:42 XM:i:2 NH:i:20
readNameHere 256 chr12 28169470 42 5M2I19M * 0 0 * * NM:i:6 AM:i:42 XM:i:2 NH:i:20
readNameHere 256 chr7 42988607 41 26M * 0 0 * * NM:i:7 AM:i:41 XM:i:2 NH:i:20
readNameHere 256 chr3 103868859 41 26M * 0 0 * * NM:i:8 AM:i:41 XM:i:2 NH:i:20
readNameHere 272 chrX 45152661 40 26M * 0 0 * * NM:i:7 AM:i:40 XM:i:2 NH:i:20
readNameHere 272 chr18 60932877 39 26M * 0 0 * * NM:i:8 AM:i:39 XM:i:2 NH:i:20
Concerning XM: From the readme it says "Indicates number of best alignments". But what are these exactly? I expected all 26M from my example to be best alignments?! Or does it mean that there are mismatches in some of these alignments which I don't see as I didn't change the samversion parameter to 1.4 (btw: I have samtools 0.1.19 installed. This is able to handle '=' and 'X', isn't it?)

Concerning XS: I run bbmap using -xstag=unstranded as I read it somewhere in this thread and I want to use cufflinks afterwards. I thought that the '+' and '-' as values refer to the strand in this field and I also have '+' and '-' values set in different reads. If this is true, where is the difference to -xstag=true?

Concerning YS: This stores the end position of a read? So its pos + the alignment length of the read?

Concerning YI: The identity is simple the number of (mismatches + indels) / read-length?


Thanks a lot in advance for your help and this amazing amount of possibilities !
WhatsOEver is offline   Reply With Quote
Old 05-23-2014, 07:20 AM   #42
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Dear Brian,

I have yet another issue

Looking through the readme I saw that the standard state for stoptag and idtag is true. However, looking at my sam file I couldn't find any YS or YI fields. I therefore tried to set the parameter explicitly in the start command, but this resulted in an error:

Quote:
/projects/bbmap$ ./bbmap ../bbAlign/DNAseq_140306.fastq
java -ea -Xmx56g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 t=16 in=../bbAlign/DNAseq_140306.fastq out=/projects/bbAlign/bbmap_DNAseq_140306_mapped.sam xstag=unstranded xmtag=t stoptag=t idtag=t ambig=all maxindel=10 qin=33 samversion=1.4 qhist=/projects/bbAlign/bb_qhist.txt mhist=/projects/bbAlign/bb_mhist.txt qahist=/projects/bbAlign/qahist.txt ihist=/projects/bbAlign/ihist.txt bhist=/projects/bbAlign/bhist.txt scafstats=/projects/bbAlign/scafStats refStats=/projects/bbAlign/refStats.txt -Xmx56g
Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, t=16, in=../bbAlign/DNAseq_140306.fastq, out=/projects/bbAlign/bbmap_DNAseq_140306_mapped.sam, xstag=unstranded, xmtag=t, stoptag=t, idtag=t, ambig=all, maxindel=10, qin=33, samversion=1.4, qhist=/projects/bbAlign/bb_qhist.txt, mhist=/projects/bbAlign/bb_mhist.txt, qahist=/projects/bbAlign/qahist.txt, ihist=/projects/bbAlign/ihist.txt, bhist=/projects/bbAlign/bhist.txt, scafstats=/projects/bbAlign/scafStats, refStats=/projects/bbAlign/refStats.txt, -Xmx56g]

BBMap version 32.14
Set overwrite to true
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
Set threads to 16
Set fastq input ASCII offset to 33
Set quality histogram output to /projects/bbAlign/bb_qhist.txt
Set match histogram output to /projects/bbAlign/bb_mhist.txt
Set quality accuracy histogram output to /projects/bbAlign/qahist.txt
Set insert size histogram output to /projects/bbAlign/ihist.txt
Set base content histogram output to /projects/bbAlign/bhist.txt
Scaffold statistics will be written to /projects/bbAlign/scafStats
Reference set statistics will be written to /projects/bbAlign/refStats.txt
Retaining all best sites for ambiguous mappings.
Set genome to 1

Loaded Reference: 6.389 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 8.479 seconds.
Analyzed Index: 23.584 seconds.
Started output stream: 0.014 seconds.
Creating scaffold statistics table: 0.001 seconds.
Cleared Memory: 0.506 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 16 mapping threads.
Exception in thread "Thread-11" java.lang.NullPointerException
at stream.SamLine.<init>(SamLine.java:1130)
at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:154)
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:19)
Exception in thread "Thread-24" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-21" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-19" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-12" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Detecting finished threads: 0Exception in thread "Thread-25" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-18" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-23" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-27" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-15" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-17" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-20" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-13" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
, 1Exception in thread "Thread-26" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-22" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
Exception in thread "Thread-14" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
, 2, 3Exception in thread "Thread-16" java.lang.RuntimeException: Writing to a terminated thread.
at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
at align2.AbstractMapThread.run(AbstractMapThread.java:320)
, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

**************************************************************************

Warning! 16 mapping threads did not terminate normally.
Please check the error log; the output may be corrupt or incomplete.

**************************************************************************




------------------ Results ------------------

Genome: 1
Key Length: 14
Max Indel: 10
Minimum Score Ratio: 0.4
Mapping Mode: normal
Reads Used: 850 (150375 bases)

Mapping: 16.779 seconds.
Reads/sec: 50.66
kBases/sec: 8.96


Read 1 data: pct reads num reads pct bases num bases

mapped: 71.8824% 611 84.5406% 127128
unambiguous: 63.6471% 541 75.8251% 114022
ambiguous: 8.2353% 70 8.7155% 13106
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0000% 0 0.0000% 0
semiperfect site: 0.0000% 0 0.0000% 0

Match Rate: NA NA 24.8088% 109654
Error Rate: 73.7923% 611 75.1826% 332304
Sub Rate: 72.9469% 604 1.8880% 8345
Del Rate: 59.6618% 494 71.2368% 314864
Ins Rate: 71.9807% 596 2.0577% 9095
N Rate: 4.1063% 34 0.0086% 38
Splice Rate: 14.8551% 123 (splices at least 10 bp)
Exception in thread "main" java.lang.RuntimeException: BBMap terminated in an error state; the output may be corrupt.
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:407)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
When I run it without both parameters, everything works fine. My dataset includes 1.6M single end 454 reads if this is important.
WhatsOEver is offline   Reply With Quote
Old 05-23-2014, 10:14 AM   #43
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

WhatSoEver,

For the first problem, try setting the flag "sam=1.4". By default, cigar strings are presented in Sam 1.3 format, which uses the 'M' symbol. The does not mean match - it means match OR mismatch! Sam format 1.4 fixed this bad design decision by using '=' for match and 'X' for mismatch, but by default I output sam format 1.3 because many downstream tools (such as old versions of Samtools) cannot process format 1.4. The lastest Samtools can, though. If you set "sam=1.4" then it will be more obvious how many substitutions the reads have. The NM tag tells you the edit distance, but BBMap does not determine score directly from edit distance; it uses affine transforms. So 4 consecutive substitutions will be penalized less than 4 non-adjacent substitutions, for example.

Please note that neither XM nor XS are defined in the sam spec; both are bowtie/tophat flags, and XM particularly is poorly defined. For that, I output the number of alignments with scores EXACTLY equal to the top score, even though when you set "ambig=all" I will output sites with scores that are within a certain threshold of the top score. So there can be more alignments printed than the value of XM.

For the XS tag, you can set it to "firststrand", "secondstrand", or "unstranded". But internally, "true" is equivalent to "firststand" which is equivalent to "unstranded"; all produce identical output. "secondstrand" produces the opposite sign of "firststrand". For firststrand/unstranded, if read 1 maps to the plus strand, it gets "XS:A:+"; for secondstrand, it gets "XT:A:-". Read 2 always gets the opposite of read 1.

The YS tag is the stop position - specifically, the reference location to which the rightmost base maps. So yes, it is pos+alignment length.

The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
(matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
(100)/(100+sqrt(400)) = 100/120= 83.3% identity.
If there were 50 individual single-base-pair deletions, however, the identity would be:
(100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.

For the crash bug, that's my fault for not testing the interaction between "ambig=all" and "stoptag"; they don't work together. I'll fix that. In the mean time, you can set "saa=f" (secondaryalignmentasterisks=false) which will circumvent the problem.

Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.
Brian Bushnell is offline   Reply With Quote
Old 05-26-2014, 02:42 AM   #44
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Thanks Brian, that helps a lot! Many thanks!!

There are just two little things I'd like to clearify:

---------------------------------------------------------------------------------------

Quote:
The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
(matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
(100)/(100+sqrt(400)) = 100/120= 83.3% identity.
If there were 50 individual single-base-pair deletions, however, the identity would be:
(100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.
I'm not sure if I understand the meaning of f(deletions). Does it just mean, that all consecutive deletions are square rooted and the results summed up?
Like in the following example from my data:

Quote:
read1
length: 202 (+11 Deletions)
cigar: 5=6X1=1X2=1X1=1I4=10D2=1D2=1X1=1X6=1I9=1I19=1I79=1I56=

Matches: 5+1+2+1+4+2+2+1+6+9+19+79+56=187
Mismatches: 6+1+1+1+1=10
Insertions: 1+1+1+1+1=5
Deletions: 10+1=11

Identity: 187/(187+10+5+11) = 87.8% (this is the "standard" calculation)
Identity: 187/(187+10+5+sqrt(10)+sqrt(1))= 90.7% (this is your calculation?)
If the former is true, why don't you treat insertions in the same way? Like in:

Quote:
read2
length: 264 (+2 Deletions)
cigar: 1X2=2X1=2X4=3X3=20I8=1D19=1D157=1X28=1X3=1X1=3I3=1X

Matches: 2+1+4+3+8+19+157+28+3+1+3=229
Mismatches: 1+2+2+3+1+1+1+1=12
Insertions: 20+3=23
Deletions: 1+1=2

Identity: 229/(229+12+23+2*sqrt(1)) = 86.1% (This is your calculation?)
Identity: 229/(229+12+sqrt(20)+sqrt(3)+2*sqrt(1)) = 91.9% (Why not this way?)
---------------------------------------------------------------------------------------

Quote:
Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.
intronlength=10 is the standard value, isn't it? I didn't set it, but can neither find any cigar strings with deletions greater than 10D nor one with N's smaller than 11N.
And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).
WhatsOEver is offline   Reply With Quote
Old 05-26-2014, 07:02 AM   #45
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

WhatSoEver,

If you have a cigar string like ths:

20=1I20=1D20=25D20=20I20=

...I would calculate:
(20+20+20+20+20) matches / (20+1+20+root(1)+20+25+20+root(25)+20)
=
(100)/(20+1+20+1+20+25+20+5+20)
=
(100)/(132)=75.75%

I know that's asymmetric - if there's a deletion, aligning the reference to the read would give a different identity than the read to the reference - but there are a couple reasons for that. First, you can get deletions of unbounded length in cigar strings, but insertions can never be longer than the read length, so it is (according to my logic) important to correct for deletions but not really anything else. Second, insertions occur INSTEAD of matches, while deletions occur IN ADDITION to matches. So if a 100bp read has 1 bp match and 99bp insertion, does it make sense to call that 1/(1+root(99)) = about 10% identity? Not really. But with a 100bp deletion in the middle, you'd have 100/110 = about 90% identity, which is maybe a bit high, but it seems reasonable for an alignment with two nearby 50bp consecutive matches, which can't arise by chance. Third, it seems to me that long deletions are more common than long insertions, but that could be discovery bias. Lastly, decreasing the identity penalty of insertions would cause a particular problem. Let's say you have a chimeric 200bp read, 100bp from somewhere in chromosome 1 and 100bp from chromosome 2. You could map it in two places, with cigars string of 100=100I and 100I100=, or 100=100X and 100X100=. My current formula would give that 50% identity for either location, whether it used the X or I symbol for the chimeric bases. But if you take the root of the insertions, then you get 100/(100+10) = 91% identity to two different sequences that have nothing in common, which does not make sense.

It could be that calculating identity this way is not a good idea; I may put in a flag to remove the sqrt from deletions, but it seems to me that it gives a more useful answer in terms of short-read alignment when doing recruitment studies, which is what I actually put the idtag in for.

Quote:
Originally Posted by WhatsOEver View Post
intronlength=10 is the standard value, isn't it? I didn't set it, but can neither find any cigar strings with deletions greater than 10D nor one with N's smaller than 11N.
By default, it is disabled. However, if you set other flags like "xstag" that are used by Tophat, and you don't explicitly set "intronlen", it will be automatically enabled and set to 10.

Quote:
And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).
Well, I said "for plants and fungi". I work with fungi and the introns I've seen tend to be 300bp or less, and some of the plants I've seen also seem to have short introns of typically 250bp or so. But I don't know if there might be some out there that are different! The main reason is that very high settings of 'maxindel' (over 100kb) decrease speed and accuracy, but low settings like 16000 don't really have a noticeable effect on either yet it COULD let you discover something new. So I generally run with 16k as the default for BBMap even on DNA. The PacBio version has a lower default because PacBio reads are very long and have a high error rate (which exacerbates the effect on speed and accuracy), and we don't do RNA-seq for PacBio right now.
Brian Bushnell is offline   Reply With Quote
Old 05-27-2014, 04:21 AM   #46
Mchicken
Member
 
Location: Germany

Join Date: Jan 2014
Posts: 38
Default

Hey Brian,

i just briefly checked BBMap and have some questions.

1) I am working, among others, with RNA-Seq data of a cyanobacterium. Is it possible that BBMap performs a spliced alignments for my RNA-Seq data? In a small run i did not observed spliced alignments but nevertheless i was wondering about it. If so, can i prevent this case by setting 'intronlen' to 0?

2) Is there a way to apply soft- or hard-clipping?
For example bwa performs for a read soft-clipping leading to 94M6S where BBMap leads to 100M although the last 6 bases are partially mismatches and therefor i want them to be clipped.

3) Is there a way to determine the maximal number of mismatches?

4) For CRAC and PAR-Clip sequencing data i want only 1 respectively 2 mismatches/indels. So when i want at most 2 indels i have to set maxindel= 2 and strictmaxindel = true. Am i right?

5) Does BBMap reports only the best alignment (or best alignments if alignments with equal quality are present) or is there a way to report also alignments that are not as good as the best alignment? Would this be the maxsites option?

Thanks for your nice support!

Mchicken
Mchicken is offline   Reply With Quote
Old 05-27-2014, 08:47 AM   #47
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

Mchicken,

1) 'intronlen' does not actually affect alignment, just the way splices are reported - deletions shorter than 'intronlen' will have a 'D' symbol in the cigar string, while longer ones will have an 'N' symbol. To prevent gapped alignments, you should set 'maxindel=0', which will prevent them from being looked for, though some may still be found (or 'strictmaxindel=0' to absolutely ban all alignments with indels). But prokaryotes do have some self-splicing; these are typically very short, like 20bp or so, I think. So you may want to set 'maxindel=100' or something like that, which will prevent long ones.

2) Soft-clipping is applied by default if a read goes off the end of a scaffold. You can also force it on alignments like the one you describe with the flag 'local=t'. Also, note that if you use the flag 'sam=1.4' then cigar strings will be generated with '= for match and 'X' for mismatch, instead of 'M' for both.

3) You can use a flag like 'minid=98' to prevent alignments with lower than approximately 98% identity, which would be 2 mismatches for a 100bp read, but substitutions, deletions, and insertions are all scored differently, and whether they are contiguous or scattered also affects the score, so this is not exact. BBMap does not have any way to ensure that the best alignment with at most X mismatches is returned, like Bowtie 1 can do, because the scoring is fundamentally different.

4) "maxindel" actually controls the length of individual indels, not the number of them, which is not controlled. There's also a "maxindel2" flag that controls the sum of the length of all indels, which by default is set to double maxindel. So if you did this:
maxindel=10 maxindel2=15 strictmaxindel
...then individual indels could be up to 10bp; the sum of the length of indels could be up to 15bp; and any alignment with a single indel longer than 10bp would be banned. But there is no way to limit the total number of indel events.

5) By default the best alignment is reported, and if there are multiple equally-scoring alignments, the first one (in genomic coordinates) is reported, but it is marked as ambiguous (XT:A:R tag). This corresponds to the setting 'ambig=best'. You can alternatively set 'ambig=random' or 'ambig=all' (which is what you are looking for). 'maxsites' will limit the total number of alignments displayed in 'ambig=all' mode, but won't have any effect in default mode.

Thanks for using BBMap!

-Brian
Brian Bushnell is offline   Reply With Quote
Old 05-28-2014, 02:41 AM   #48
simono101
Junior Member
 
Location: London

Join Date: Feb 2014
Posts: 3
Default Escaping spaces using the shell script

Hi Brian,
I can't seem to work out how to pass a filepath with spaces in it to BBMap using the shell script. Nothing that I normally would use seems to work...

e.g.

Code:
 ~/bbmap/bbmap.sh ref="../Reference\ genome/myref.fasta" build=2 in1=./Reads1.fastq.gz in2=./Reads2.fastq.gz out=./mapped.sam
Yields the error

Code:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter: genome/myref.fasta
Is there a way to pass a string with escaped spaces to bbmap using the shell script or do I need to call bbmap directly?

Thanks!
simono101 is offline   Reply With Quote
Old 05-28-2014, 03:27 AM   #49
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,283
Default

Have you pre-built the index for "myref.fasta"?
Code:
$ bbmap.sh ref=/path_to_your_genome.fa
This step has to be done independently of the mapping step, AFAIK.

Last edited by GenoMax; 05-28-2014 at 03:34 AM. Reason: Deleted a part not relevant to original question
GenoMax is offline   Reply With Quote
Old 05-28-2014, 03:32 AM   #50
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,283
Default

On second reading of your post, it appears that you have a space in your directory name and you are asking about escaping that? Let me test.

Edit: Standard ways of escpaing spaces in shell script are not working but directly running the java command (escaping the space) is working.

Brian should be able to provide the final answer.

Last edited by GenoMax; 05-28-2014 at 04:02 AM.
GenoMax is offline   Reply With Quote
Old 05-28-2014, 05:38 AM   #51
simono101
Junior Member
 
Location: London

Join Date: Feb 2014
Posts: 3
Default

@GenoMax,
Thanks, yes your second post is exactly what I was referring to. I can't seem to find a way to escape the spaces in a directory path (but I too successfully called it directly rather than through the shell script).
simono101 is offline   Reply With Quote
Old 05-28-2014, 07:53 AM   #52
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

simono,

In Windows, you can handle that with quotes, e.g. ref="/a/path with spaces/x.fa"

In Linux... I don't know how to do it easily. Perhaps the easiest way would be to make a symlink that doesn't have any spaces. Alternatively, as in this article, you could export the path as an environment variable.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 06-02-2014, 02:50 AM   #53
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 40
Default

Hello Brian,

I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted?

Code:
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa, path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index, build=1, threads=4, -Xmx32g, midpad=2000, k=5, minscaf=5]

BBMap version 31.56
Set OVERWRITE to true
Cigar strings enabled.
Set threads to 4
Retaining first best site only for ambiguous mappings.
No output file.
Exception in thread "main" java.lang.RuntimeException: Cannot read file /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa
        at align2.RefToIndex.makeIndex(RefToIndex.java:28)
        at align2.BBMap.setup(BBMap.java:238)
        at align2.AbstractMapper.<init>(AbstractMapper.java:56)
        at align2.BBMap.<init>(BBMap.java:39)
        at align2.BBMap.main(BBMap.java:28)
[[email protected] Pia]$ head /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa
>Acta1_Mmi503488
TGCTGTACAGGTCCTTCCTGATGTCGGTTTTGGCCACTGACTGACCGACATCAAAGGACCTGTA
>Acta1_Mmi503489
TGCTGTCTGCATGCGGTCAGCGATACGTTTTGGCCACTGACTGACGTATCGCTCCGCATGCAGA
>Acta1_Mmi503490
TGCTGAAGAGCGGTGGTCTCGTCTTCGTTTTGGCCACTGACTGACGAAGACGACCACCGCTCTT
>Ahr_Mmi503972
TGCTGATTACAGGGAGCAAAGTTCTGGTTTTGGCCACTGACTGACCAGAACTTCTCCCTGTAAT
>Ahr_Mmi503973
TGCTGTATGGATGAGCTCATATACGCGTTTTGGCCACTGACTGACGCGTATATGCTCATCCATA
Thias is offline   Reply With Quote
Old 06-02-2014, 03:07 AM   #54
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,283
Default

Quote:
Originally Posted by Thias View Post
Hello Brian,

I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted?

[CODE]
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa, path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index, build=1, threads=4, -Xmx32g, midpad=2000, k=5, minscaf=5]
Perhaps because you are using both ref= and path= options?

Since you have pre-built the BBMap index you can just use path= to point to the indexes.
GenoMax is offline   Reply With Quote
Old 06-02-2014, 04:45 AM   #55
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 40
Default

@GenoMax: Thanks for your answer, but unfortunately this can't be the problem.

The folder in ref points to an empty directory, because I do not want to include the shRNA sequences to my regular set of indexed genomes. However the syntax with ref= and path= is exactly the syntax I used to successfully index the eukaryotic genomes before. (I used the standard k-mer size 13 and a midpad size of 200000 for the genomes though).
Thias is offline   Reply With Quote
Old 06-02-2014, 08:40 AM   #56
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

Thias,

That specific error is triggered by:

File f=new File(reference);
if(!f.exists() || !f.isFile() || !f.canRead()){


So according to Java, either it does not exist, is not a file, or cannot be read. Perhaps the permissions are wrong, or the metadata on the file is wrong? It triggered before even looking inside the file, so the data is probably not corrupted.

You could technically do this:

cat /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa |
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=stdin.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5

...which would circumvent Java trying to read the file. But I'd like to hear if you discover anything odd about the file permissions or metadata (like if it claims to be a directory, or something).
Brian Bushnell is offline   Reply With Quote
Old 06-03-2014, 05:01 AM   #57
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Dear Brian,

I'm coming with another problem (bug?), I cannot find an answer to. I run a RNAseq mapping with bbmap and got afterwards the following results:

Quote:
Genome: 1
Key Length: 14
Max Indel: 200000
Minimum Score Ratio: 0.4
Mapping Mode: normal
Reads Used: 702748 (302588801 bases)

Mapping: 2780.014 seconds.
Reads/sec: 252.79
kBases/sec: 108.84


Read 1 data: pct reads num reads pct bases num bases

mapped: 98.1544% 689778 98.8899% 299229721
unambiguous: 96.8285% 680460 97.9391% 296352687
ambiguous: 1.3259% 9318 0.9508% 2877034
low-Q discards: 0.0011% 8 0.0001% 185

...
As I don't like the samtools flagstat output, I wrote my own samfile-parser in order to compare the output of different mappers and different parameters used for mapping. From the samfile that was produced by bbmap, my parser returned the following

Quote:
Found 762672 entries in the samfile.
Identified 702748 different reads.
663897 of the reads have unique mapping positions.
25881 of the reads have 85805 ambiguous mapping positions.
12970 of the reads did not map to the reference.
Elapsed seconds: 4
I verified these results by counting the individual bit-flags.
As this is quite different to your program output (except for the unmapped reads), I'm wondering were the mistake is?

Cheers,
W

Last edited by WhatsOEver; 06-03-2014 at 06:03 AM.
WhatsOEver is offline   Reply With Quote
Old 06-03-2014, 09:10 AM   #58
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

W,

I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous.

So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports. Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement.
Brian Bushnell is offline   Reply With Quote
Old 06-03-2014, 11:44 PM   #59
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Sorry, but I don't get it completely.

Quote:
Originally Posted by Brian Bushnell View Post
I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous.
What is your definition of an "unambiguously mapped read" if it has secondary alignments? If secondary alignments are printed as a result of a more liberal threshold, the read is no longer unambiguous under the used threshold, is it? This should be independent of how bad your sec alignments are, the chosen threshold shouldn't change the terminology, should it?
So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments?

Quote:
Originally Posted by Brian Bushnell View Post
So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports.
You are right - if I count XT:A:R, I get the same number as the program output (XT:A:U are however not visible/set?, but should of course be the remaining)

Quote:
Originally Posted by Brian Bushnell View Post
Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement.
Yes, I know that and I really think that this is a confusing point in the SAM spec. But no, I don't think that it's a subjective judgement just because the word "ambiguity" is not explicitly used in the specs.
Or is it just my german misinterpretation of the strictness in the word ambiguity?

But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time
WhatsOEver is offline   Reply With Quote
Old 06-04-2014, 12:54 AM   #60
punto_c
Member
 
Location: Japan, Osaka

Join Date: Oct 2013
Posts: 11
Default

Hello Brian

I am trying to compare some mapping tools, and I am also using BBMap
I execute 4 tests using PE readed preprocessed with bbduck, cutadapt, fastx and trimmomatic respectively
With the reads preprocessed using bbduck and trimmomatic the program is able to finish without any error

When I run the reads preprocessed with cutadapt or fastx bbmap crashes...

do you have any idea why is this happening?

Here is the parameters I used and the stde output:
overwrite=true fastareadlen=500 -Xmx200g threads=96 outputunmapped=f sam=1.4

BBMap version 32.15
Set overwrite to true
Set threads to 96
Set DONT_OUTPUT_UNMAPPED_READS to true
Retaining first best site only for ambiguous mappings.
Set genome to 1

Loaded Reference: 9.674 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 15.609 seconds.
Analyzed Index: 22.798 seconds.
Started output stream: 0.345 seconds.
Started output stream: 0.014 seconds.
Cleared Memory: 0.146 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 96 mapping threads.
Exception in thread "Thread-110" java.lang.IndexOutOfBoundsException: Index: 172, Size: 172
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:484)
at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:361)
at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:205)
at java.lang.Thread.run(Thread.java:679)
punto_c is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO