![]() |
Dear Brian,
I have some questions about the optional fields (TAG:TYPE:VALUE stuff) that can be added in the SAM file by bbmap. Take the following multiple-mapped read as example Quote:
Concerning XS: I run bbmap using -xstag=unstranded as I read it somewhere in this thread and I want to use cufflinks afterwards. I thought that the '+' and '-' as values refer to the strand in this field and I also have '+' and '-' values set in different reads. If this is true, where is the difference to -xstag=true? Concerning YS: This stores the end position of a read? So its pos + the alignment length of the read? Concerning YI: The identity is simple the number of (mismatches + indels) / read-length? Thanks a lot in advance for your help and this amazing amount of possibilities :D! |
Dear Brian,
I have yet another issue ;) Looking through the readme I saw that the standard state for stoptag and idtag is true. However, looking at my sam file I couldn't find any YS or YI fields. I therefore tried to set the parameter explicitly in the start command, but this resulted in an error: Quote:
|
WhatSoEver,
For the first problem, try setting the flag "sam=1.4". By default, cigar strings are presented in Sam 1.3 format, which uses the 'M' symbol. The does not mean match - it means match OR mismatch! Sam format 1.4 fixed this bad design decision by using '=' for match and 'X' for mismatch, but by default I output sam format 1.3 because many downstream tools (such as old versions of Samtools) cannot process format 1.4. The lastest Samtools can, though. If you set "sam=1.4" then it will be more obvious how many substitutions the reads have. The NM tag tells you the edit distance, but BBMap does not determine score directly from edit distance; it uses affine transforms. So 4 consecutive substitutions will be penalized less than 4 non-adjacent substitutions, for example. Please note that neither XM nor XS are defined in the sam spec; both are bowtie/tophat flags, and XM particularly is poorly defined. For that, I output the number of alignments with scores EXACTLY equal to the top score, even though when you set "ambig=all" I will output sites with scores that are within a certain threshold of the top score. So there can be more alignments printed than the value of XM. For the XS tag, you can set it to "firststrand", "secondstrand", or "unstranded". But internally, "true" is equivalent to "firststand" which is equivalent to "unstranded"; all produce identical output. "secondstrand" produces the opposite sign of "firststrand". For firststrand/unstranded, if read 1 maps to the plus strand, it gets "XS:A:+"; for secondstrand, it gets "XT:A:-". Read 2 always gets the opposite of read 1. The YS tag is the stop position - specifically, the reference location to which the rightmost base maps. So yes, it is pos+alignment length. The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is: (matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get: (100)/(100+sqrt(400)) = 100/120= 83.3% identity. If there were 50 individual single-base-pair deletions, however, the identity would be: (100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance. This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events. For the crash bug, that's my fault for not testing the interaction between "ambig=all" and "stoptag"; they don't work together. I'll fix that. In the mean time, you can set "saa=f" (secondaryalignmentasterisks=false) which will circumvent the problem. Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns. |
Thanks Brian, that helps a lot! Many thanks!! :)
There are just two little things I'd like to clearify: --------------------------------------------------------------------------------------- Quote:
Like in the following example from my data: Quote:
Quote:
Quote:
And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial). |
WhatSoEver,
If you have a cigar string like ths: 20=1I20=1D20=25D20=20I20= ...I would calculate: (20+20+20+20+20) matches / (20+1+20+root(1)+20+25+20+root(25)+20) = (100)/(20+1+20+1+20+25+20+5+20) = (100)/(132)=75.75% I know that's asymmetric - if there's a deletion, aligning the reference to the read would give a different identity than the read to the reference - but there are a couple reasons for that. First, you can get deletions of unbounded length in cigar strings, but insertions can never be longer than the read length, so it is (according to my logic) important to correct for deletions but not really anything else. Second, insertions occur INSTEAD of matches, while deletions occur IN ADDITION to matches. So if a 100bp read has 1 bp match and 99bp insertion, does it make sense to call that 1/(1+root(99)) = about 10% identity? Not really. But with a 100bp deletion in the middle, you'd have 100/110 = about 90% identity, which is maybe a bit high, but it seems reasonable for an alignment with two nearby 50bp consecutive matches, which can't arise by chance. Third, it seems to me that long deletions are more common than long insertions, but that could be discovery bias. Lastly, decreasing the identity penalty of insertions would cause a particular problem. Let's say you have a chimeric 200bp read, 100bp from somewhere in chromosome 1 and 100bp from chromosome 2. You could map it in two places, with cigars string of 100=100I and 100I100=, or 100=100X and 100X100=. My current formula would give that 50% identity for either location, whether it used the X or I symbol for the chimeric bases. But if you take the root of the insertions, then you get 100/(100+10) = 91% identity to two different sequences that have nothing in common, which does not make sense. It could be that calculating identity this way is not a good idea; I may put in a flag to remove the sqrt from deletions, but it seems to me that it gives a more useful answer in terms of short-read alignment when doing recruitment studies, which is what I actually put the idtag in for. Quote:
Quote:
|
Hey Brian,
i just briefly checked BBMap and have some questions. 1) I am working, among others, with RNA-Seq data of a cyanobacterium. Is it possible that BBMap performs a spliced alignments for my RNA-Seq data? In a small run i did not observed spliced alignments but nevertheless i was wondering about it. If so, can i prevent this case by setting 'intronlen' to 0? 2) Is there a way to apply soft- or hard-clipping? For example bwa performs for a read soft-clipping leading to 94M6S where BBMap leads to 100M although the last 6 bases are partially mismatches and therefor i want them to be clipped. 3) Is there a way to determine the maximal number of mismatches? 4) For CRAC and PAR-Clip sequencing data i want only 1 respectively 2 mismatches/indels. So when i want at most 2 indels i have to set maxindel= 2 and strictmaxindel = true. Am i right? 5) Does BBMap reports only the best alignment (or best alignments if alignments with equal quality are present) or is there a way to report also alignments that are not as good as the best alignment? Would this be the maxsites option? Thanks for your nice support! Mchicken |
Mchicken,
1) 'intronlen' does not actually affect alignment, just the way splices are reported - deletions shorter than 'intronlen' will have a 'D' symbol in the cigar string, while longer ones will have an 'N' symbol. To prevent gapped alignments, you should set 'maxindel=0', which will prevent them from being looked for, though some may still be found (or 'strictmaxindel=0' to absolutely ban all alignments with indels). But prokaryotes do have some self-splicing; these are typically very short, like 20bp or so, I think. So you may want to set 'maxindel=100' or something like that, which will prevent long ones. 2) Soft-clipping is applied by default if a read goes off the end of a scaffold. You can also force it on alignments like the one you describe with the flag 'local=t'. Also, note that if you use the flag 'sam=1.4' then cigar strings will be generated with '= for match and 'X' for mismatch, instead of 'M' for both. 3) You can use a flag like 'minid=98' to prevent alignments with lower than approximately 98% identity, which would be 2 mismatches for a 100bp read, but substitutions, deletions, and insertions are all scored differently, and whether they are contiguous or scattered also affects the score, so this is not exact. BBMap does not have any way to ensure that the best alignment with at most X mismatches is returned, like Bowtie 1 can do, because the scoring is fundamentally different. 4) "maxindel" actually controls the length of individual indels, not the number of them, which is not controlled. There's also a "maxindel2" flag that controls the sum of the length of all indels, which by default is set to double maxindel. So if you did this: maxindel=10 maxindel2=15 strictmaxindel ...then individual indels could be up to 10bp; the sum of the length of indels could be up to 15bp; and any alignment with a single indel longer than 10bp would be banned. But there is no way to limit the total number of indel events. 5) By default the best alignment is reported, and if there are multiple equally-scoring alignments, the first one (in genomic coordinates) is reported, but it is marked as ambiguous (XT:A:R tag). This corresponds to the setting 'ambig=best'. You can alternatively set 'ambig=random' or 'ambig=all' (which is what you are looking for). 'maxsites' will limit the total number of alignments displayed in 'ambig=all' mode, but won't have any effect in default mode. Thanks for using BBMap! -Brian |
Escaping spaces using the shell script
Hi Brian,
I can't seem to work out how to pass a filepath with spaces in it to BBMap using the shell script. Nothing that I normally would use seems to work... e.g. Code:
~/bbmap/bbmap.sh ref="../Reference\ genome/myref.fasta" build=2 in1=./Reads1.fastq.gz in2=./Reads2.fastq.gz out=./mapped.sam Code:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter: genome/myref.fasta Thanks! |
Have you pre-built the index for "myref.fasta"?
Code:
$ bbmap.sh ref=/path_to_your_genome.fa |
On second reading of your post, it appears that you have a space in your directory name and you are asking about escaping that? Let me test.
Edit: Standard ways of escpaing spaces in shell script are not working but directly running the java command (escaping the space) is working. Brian should be able to provide the final answer. |
@GenoMax,
Thanks, yes your second post is exactly what I was referring to. I can't seem to find a way to escape the spaces in a directory path (but I too successfully called it directly rather than through the shell script). |
simono,
In Windows, you can handle that with quotes, e.g. ref="/a/path with spaces/x.fa" In Linux... I don't know how to do it easily. Perhaps the easiest way would be to make a symlink that doesn't have any spaces. Alternatively, as in this article, you could export the path as an environment variable. -Brian |
Hello Brian,
I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted? Code:
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5 |
Quote:
Since you have pre-built the BBMap index you can just use path= to point to the indexes. |
@GenoMax: Thanks for your answer, but unfortunately this can't be the problem.
The folder in ref points to an empty directory, because I do not want to include the shRNA sequences to my regular set of indexed genomes. However the syntax with ref= and path= is exactly the syntax I used to successfully index the eukaryotic genomes before. (I used the standard k-mer size 13 and a midpad size of 200000 for the genomes though). |
Thias,
That specific error is triggered by: File f=new File(reference); if(!f.exists() || !f.isFile() || !f.canRead()){ So according to Java, either it does not exist, is not a file, or cannot be read. Perhaps the permissions are wrong, or the metadata on the file is wrong? It triggered before even looking inside the file, so the data is probably not corrupted. You could technically do this: cat /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa | java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=stdin.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5 ...which would circumvent Java trying to read the file. But I'd like to hear if you discover anything odd about the file permissions or metadata (like if it claims to be a directory, or something). |
Dear Brian,
I'm coming with another problem (bug?), I cannot find an answer to. I run a RNAseq mapping with bbmap and got afterwards the following results: Quote:
Quote:
As this is quite different to your program output (except for the unmapped reads), I'm wondering were the mistake is? Cheers, W |
W,
I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous. So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports. Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement. |
Sorry, but I don't get it completely.
Quote:
So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments? Quote:
Quote:
Or is it just my german misinterpretation of the strictness in the word ambiguity? :confused: ;) But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time :D |
Hello Brian
I am trying to compare some mapping tools, and I am also using BBMap I execute 4 tests using PE readed preprocessed with bbduck, cutadapt, fastx and trimmomatic respectively With the reads preprocessed using bbduck and trimmomatic the program is able to finish without any error When I run the reads preprocessed with cutadapt or fastx bbmap crashes... do you have any idea why is this happening? Here is the parameters I used and the stde output: overwrite=true fastareadlen=500 -Xmx200g threads=96 outputunmapped=f sam=1.4 BBMap version 32.15 Set overwrite to true Set threads to 96 Set DONT_OUTPUT_UNMAPPED_READS to true Retaining first best site only for ambiguous mappings. Set genome to 1 Loaded Reference: 9.674 seconds. Loading index for chunk 1-7, build 1 Generated Index: 15.609 seconds. Analyzed Index: 22.798 seconds. Started output stream: 0.345 seconds. Started output stream: 0.014 seconds. Cleared Memory: 0.146 seconds. Processing reads in paired-ended mode. Started read stream. Started 96 mapping threads. Exception in thread "Thread-110" java.lang.IndexOutOfBoundsException: Index: 172, Size: 172 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:484) at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:361) at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:205) at java.lang.Thread.run(Thread.java:679) |
All times are GMT -8. The time now is 07:50 PM. |
Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.