Originally posted by sdriscoll
View Post
Unconfigured Ad
Collapse
X
-
Hey Brian,
I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.
So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.
To give you an idea of some handy filtering options have a look at the help text from my perl script.
This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.Code:usage: bbmap-filter.pl [options] <alignments.bam> Parses alignments from bbmap and modifies them based on options. Options: -p INT Number of threads for processing (default: 2) -N INT Maximum number of mismatches allowed (default: any) -d INT Maximum number of deletions per alignment (default: any) -D INT Maximum length of deletion allowed (default: any) -i INT Maximum number of insertions per alignment (default: any) -I INT Maximum length of insertion allowed (default: any) -e INT Maximum number of INDELS total per alignment (default: any) -E INT Maximum total edits (INDELS+mismatches) per alignment (default: any) -c Set reads not passing filters to qc-failed (default: unaligned) -q INT Minimum MAPQ (default: any) -a Output alignments passing filter only (default: all) -h Show this message and exit
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
-
Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.
*edit*
My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
-
Those sound useful, and won't be very difficult, so I will go ahead and add them. Thanks for the suggestion!Originally posted by sdriscoll View PostHey Brian,
I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.
So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.
To give you an idea of some handy filtering options have a look at the help text from my perl script.
This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.Code:usage: bbmap-filter.pl [options] <alignments.bam> Parses alignments from bbmap and modifies them based on options. Options: -p INT Number of threads for processing (default: 2) -N INT Maximum number of mismatches allowed (default: any) -d INT Maximum number of deletions per alignment (default: any) -D INT Maximum length of deletion allowed (default: any) -i INT Maximum number of insertions per alignment (default: any) -I INT Maximum length of insertion allowed (default: any) -e INT Maximum number of INDELS total per alignment (default: any) -E INT Maximum total edits (INDELS+mismatches) per alignment (default: any) -c Set reads not passing filters to qc-failed (default: unaligned) -q INT Minimum MAPQ (default: any) -a Output alignments passing filter only (default: all) -h Show this message and exit
OK, I will add that.Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.
*edit*
My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
Incidentally, in my testing, JNI speeds up BBMap by around 25-30%, which is nice but not huge, probably because it is memory-bandwidth limited or too branchy. However, it speeds up Dedupe (when edits are allowed) and BBMerge by closer to 200%; both are compute-limited.
Comment
-
-
Awesome. Yeah for all of the sam/bam utilities out there I haven't found one that could do elaborate filtering based on the details of the alignment. Probably because those tools can't guarantee that aligners are writing the alignment files with the necessary flags. Within the ecosystem of BBTools, however, you can control all...so that would be fantastic.
I'm testing JNI with bbmap now and will report back with the speedup.
Speedup: ~ 2%
This was an alignment that I'd feed into eXpress for gene expression quantification. Options used:
Code:t=8 ambig=all minid=0.95 usejni=t saa=t secondary=t maxsites=20 sam=1.3 trd=t
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
-
bbmap and multimapper MAPQ
Hello Brian and others,
Usually I run bbmap with the ambig=best parameter and am good to go, but for one of our datasets, it may prove favourable to give multi-mapping reads some special treatment. Therefore I now have to familiarize myself with the SAM format and find optimal settings for high quality mapping of spliced stranded reads.
I ran bbmap for an initial test with those parameters
and am now messing around with the alignments. The current idea is to use bbmaps real MAPQ and modify it according to our additional criteria. Then I will sort the file based on the read ID and secondarily descending on the final MAPQ and neglect ID duplicates to resolve multi-mappers.Code:ambig=all minid=0.9 padding=6 tipsearch=200 maxindel=200000 intronlen=5 secondary=T quickmatch=T sssr=0.97 local=T saa=F xstag=T xmtag=T nhtag=T
However when I check the original MAPQ, I only see qualities between 1-3 despite I have chosen local=T option. Shouldn't there be some higher MAPQ reads also within the multi-mappers? Or did I misunderstand the local option?
Thanks a lotCode:# All MAPQ awk -F "\t" '!/^@/ {print $5}' $SAMFILE | sort | uniq # Only multimapping MAPQ grep -Fvw "NH:i:1" $SAMFILE | awk -F "\t" '!/^@/ {print $5}' | sort | uniq
Matthias
PS: Support for something like sdriscoll's bbmap-filter.pl becoming a part of bbmap suite.
Comment
-
-
Hmm... all multi-mapping reads get their quality penalized, according to how many sites there are and how close the scores are. Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.
You can also use the flag "idtag", though, which will print each read's percent identity to the reference in a custom field. This value is unaffected by multimapping, so you could use that for unbiased filtering by mapping quality.
Comment
-
-
Thanks a lot for your kind offer, but your hint with the "idtag" is perfectly fine for our purpose, so no need to adjust your code. I just misunderstood the doings of the local=T flag in exactly the proposed way. Since some downstream tools may depend on the real MAPQ, on second thought it now anyway seems wiser to introduce an extra column for that custom score, resolve the multi-mappers according to it and kick that column out again for the final SAM file.Originally posted by Brian Bushnell View PostBecause the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.
Thanks a lot!
Matthias
Comment
-
-
sdriscoll and Matthias,
I've added the following flags:
subfilter=-1 Ban alignments with more than this many substitutions.
insfilter=-1 Ban alignments with more than this many insertions.
delfilter=-1 Ban alignments with more than this many deletions.
indelfilter=-1 Ban alignments with more than this many indels.
editfilter=-1 Ban alignments with more than this many edits.
inslenfilter=-1 Ban alignments with an insertion longer than this.
dellenfilter=-1 Ban alignments with a deletion longer than this.
(all of those have no effect if the value is negative)
penalizeambiguous=t (pambig) Penalize the mapping score of reads that multimap.
Comment
-
-
Hello Brian,
Kudos to you for doing all of that at no time. This will certainly make the bbmap suite even more useful than it already is.
Thanks again
Matthias
BTW: I solved my problem by using the idtag and the scoretag:
I tweaked the scores a bit for our purpose and pasted them as first column to the sam. Then I sorted the file on the read ID and the scores and removed the duplicates.Code:grep -o 'YR:i:[0-9]*' $MULTIMAPSAMFILE | cut -f3- -d':'
sort -k2,2 -r -k1,1 $SCOREDMULTIMAPSAMFILE | awk -F "\t" '!x[$2]++' | cut -f2-
Comment
-
-
Hi Brian,
For the mismatch filter, is there (or could there be) an option for number of mismatches accepted based on the length of the read? My reads have variable lengths and some are much longer than others. So, I'm wondering if setting an absolute number of mismatches (eg. 2) would be too stringent on the long reads...
Also, my long reads tend to drop off in quality towards the right end, so I want to trim the right end off long reads (eg. keep only the first 200bp) without affecting the shorter reads. Is there good a way to do this? I can't trim a set number of bp because that would trim the short reads; and for maxlen option, I could break long reads (from the left side?), but I don't want to align the remaining right ends of long reads.
Thanks!
Comment
-
-
You can use "idfilter" to allow a variable number of edits depending on the read length. e.g. idfilter=0.98 would allow up to 4 edits in a 200bp read but only 2 in a 100bp read. For quality-trimming, you have a couple of options. You can use BBDuk to forcibly trim all reads to at most 200bp like this:
bbduk.sh in=reads.fq out=trimmed.fq forcetrimright=199
Alternately, you can use "qtrim=r trimq=10" for example to trim the right side of the read to Q10, so only low-quality bases will be removed (you can set that higher if you want). This flag works in BBDuk or BBMap. BBMap also allows you to trim before mapping, then restore the read to the original length afterward, using the "untrim" flag, for example:
bbmap.sh in=reads.fq out=mapped.sam qtrim=r trimq=10 untrim
This is useful when you only want to allow high-identity alignments, but you don't want to discard the low-quality bases at the end of the read.
Comment
-
-
Hello Brian,
I aligned fastq paired end reads using bbmap. I then tried to sort the SAM file using Picard and I am getting the following error:
It appears that for this read, each end is aligning to a different contig. But in the SAM record we have an '=' for the mate's contig name, which might be causing it. Is my reasoning correct, and how do I fix it ? Thanks a lot for your help.Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Mate Alignment start (146342323) must be <= reference seque
nce length (90354753) on reference 16; Line 1211
Line: R0266248:292:C5C0TACXX:4:1101:4088:2519 161 16 33530580 44 98= = 146342323 112811852 ATAGAAAATTATTCA
GCTATATTCACTGCCTCACCACCTTTGTTTTTTTGTACACAAAAAATAACATTATCATTATTTGATTGCTCTCATGAAGCACT CCCFFFFFGHHHHJJJJJJJJIJJJJJJJJJJIJJJJFJJJJIIJJJIJIJIIJJJIJJIHHH
FFFFFFEEEEEEEFEFEDDDDDDDDDEDDDDDDDD NM:i:0 AM:i:44 RG:Z:AS_N#1#4#AS_N_1
Comment
-
Latest Articles
Collapse
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
-
by SEQadmin2
With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.
Introduction
Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...-
Channel: Articles
05-22-2026, 06:42 AM -
-
by SEQadmin2
Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.
Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...-
Channel: Articles
05-06-2026, 09:04 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, 06-02-2026, 12:03 PM
|
0 responses
19 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 12:03 PM
|
||
|
Started by SEQadmin2, 06-02-2026, 11:40 AM
|
0 responses
14 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 11:40 AM
|
||
|
Started by SEQadmin2, 05-28-2026, 11:40 AM
|
0 responses
29 views
0 reactions
|
Last Post
by SEQadmin2
05-28-2026, 11:40 AM
|
||
|
Started by SEQadmin2, 05-26-2026, 10:12 AM
|
0 responses
31 views
0 reactions
|
Last Post
by SEQadmin2
05-26-2026, 10:12 AM
|
Comment