Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    Originally posted by sdriscoll View Post
    Brian, do you mind explaining how the 50bp match, 20k intron, 50bp match ends up with a rough score of 8841? And in terms of the above examples that means it would have a final score ratio of about 0.887, right? thanks.
    Oh, right - yes, the score ratio would correspond to 8841/9970 = 0.887, if 8841 was in fact the read's raw score.

    Comment

    • sdriscoll
      I like code
      • Sep 2009
      • 436

      Hey Brian,

      I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

      So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

      To give you an idea of some handy filtering options have a look at the help text from my perl script.

      Code:
      usage: bbmap-filter.pl [options] <alignments.bam>
      
      Parses alignments from bbmap and modifies them based on options.
      
      Options:
        -p INT           Number of threads for processing (default: 2)
        -N INT           Maximum number of mismatches allowed (default: any)
        -d INT           Maximum number of deletions per alignment (default: any)
        -D INT           Maximum length of deletion allowed (default: any)
        -i INT           Maximum number of insertions per alignment (default: any)
        -I INT           Maximum length of insertion allowed (default: any)
        -e INT           Maximum number of INDELS total per alignment (default: any)
        -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
        -c               Set reads not passing filters to qc-failed (default: unaligned)
        -q INT           Minimum MAPQ (default: any)
        -a               Output alignments passing filter only (default: all)
        -h               Show this message and exit
      This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment

      • sdriscoll
        I like code
        • Sep 2009
        • 436

        Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

        *edit*

        My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
        Last edited by sdriscoll; 12-04-2014, 01:17 PM. Reason: figured it out
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          Originally posted by sdriscoll View Post
          Hey Brian,

          I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

          So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

          To give you an idea of some handy filtering options have a look at the help text from my perl script.

          Code:
          usage: bbmap-filter.pl [options] <alignments.bam>
          
          Parses alignments from bbmap and modifies them based on options.
          
          Options:
            -p INT           Number of threads for processing (default: 2)
            -N INT           Maximum number of mismatches allowed (default: any)
            -d INT           Maximum number of deletions per alignment (default: any)
            -D INT           Maximum length of deletion allowed (default: any)
            -i INT           Maximum number of insertions per alignment (default: any)
            -I INT           Maximum length of insertion allowed (default: any)
            -e INT           Maximum number of INDELS total per alignment (default: any)
            -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
            -c               Set reads not passing filters to qc-failed (default: unaligned)
            -q INT           Minimum MAPQ (default: any)
            -a               Output alignments passing filter only (default: all)
            -h               Show this message and exit
          This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.
          Those sound useful, and won't be very difficult, so I will go ahead and add them. Thanks for the suggestion!

          Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

          *edit*

          My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
          OK, I will add that.

          Incidentally, in my testing, JNI speeds up BBMap by around 25-30%, which is nice but not huge, probably because it is memory-bandwidth limited or too branchy. However, it speeds up Dedupe (when edits are allowed) and BBMerge by closer to 200%; both are compute-limited.

          Comment

          • sdriscoll
            I like code
            • Sep 2009
            • 436

            Awesome. Yeah for all of the sam/bam utilities out there I haven't found one that could do elaborate filtering based on the details of the alignment. Probably because those tools can't guarantee that aligners are writing the alignment files with the necessary flags. Within the ecosystem of BBTools, however, you can control all...so that would be fantastic.

            I'm testing JNI with bbmap now and will report back with the speedup.

            Speedup: ~ 2%

            This was an alignment that I'd feed into eXpress for gene expression quantification. Options used:

            Code:
            t=8 ambig=all minid=0.95 usejni=t saa=t secondary=t maxsites=20 sam=1.3 trd=t
            Last edited by sdriscoll; 12-04-2014, 01:36 PM. Reason: added speedup result
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              That's a little disappointing, but thanks for the report! The speedup should be greater the worse the reads match the genome, as the JNI portion only affects the affine-transform alignment, which is not used unless reads have more than 1 mismatch to the reference.

              Comment

              • Thias
                Member
                • Mar 2013
                • 45

                bbmap and multimapper MAPQ

                Hello Brian and others,

                Usually I run bbmap with the ambig=best parameter and am good to go, but for one of our datasets, it may prove favourable to give multi-mapping reads some special treatment. Therefore I now have to familiarize myself with the SAM format and find optimal settings for high quality mapping of spliced stranded reads.

                I ran bbmap for an initial test with those parameters
                Code:
                ambig=all minid=0.9 padding=6 tipsearch=200 maxindel=200000 intronlen=5 secondary=T quickmatch=T sssr=0.97 local=T saa=F xstag=T xmtag=T nhtag=T
                and am now messing around with the alignments. The current idea is to use bbmaps real MAPQ and modify it according to our additional criteria. Then I will sort the file based on the read ID and secondarily descending on the final MAPQ and neglect ID duplicates to resolve multi-mappers.

                However when I check the original MAPQ, I only see qualities between 1-3 despite I have chosen local=T option. Shouldn't there be some higher MAPQ reads also within the multi-mappers? Or did I misunderstand the local option?

                Code:
                # All MAPQ
                awk -F "\t" '!/^@/ {print $5}' $SAMFILE | sort | uniq
                # Only multimapping MAPQ
                grep -Fvw "NH:i:1" $SAMFILE | awk -F "\t" '!/^@/ {print $5}' | sort | uniq
                Thanks a lot
                Matthias

                PS: Support for something like sdriscoll's bbmap-filter.pl becoming a part of bbmap suite.

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  Hmm... all multi-mapping reads get their quality penalized, according to how many sites there are and how close the scores are. Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.

                  You can also use the flag "idtag", though, which will print each read's percent identity to the reference in a custom field. This value is unaffected by multimapping, so you could use that for unbiased filtering by mapping quality.

                  Comment

                  • Thias
                    Member
                    • Mar 2013
                    • 45

                    Originally posted by Brian Bushnell View Post
                    Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.
                    Thanks a lot for your kind offer, but your hint with the "idtag" is perfectly fine for our purpose, so no need to adjust your code. I just misunderstood the doings of the local=T flag in exactly the proposed way. Since some downstream tools may depend on the real MAPQ, on second thought it now anyway seems wiser to introduce an extra column for that custom score, resolve the multi-mappers according to it and kick that column out again for the final SAM file.

                    Thanks a lot!
                    Matthias

                    Comment

                    • Brian Bushnell
                      Super Moderator
                      • Jan 2014
                      • 2709

                      sdriscoll and Matthias,

                      I've added the following flags:

                      subfilter=-1 Ban alignments with more than this many substitutions.
                      insfilter=-1 Ban alignments with more than this many insertions.
                      delfilter=-1 Ban alignments with more than this many deletions.
                      indelfilter=-1 Ban alignments with more than this many indels.
                      editfilter=-1 Ban alignments with more than this many edits.
                      inslenfilter=-1 Ban alignments with an insertion longer than this.
                      dellenfilter=-1 Ban alignments with a deletion longer than this.

                      (all of those have no effect if the value is negative)

                      penalizeambiguous=t (pambig) Penalize the mapping score of reads that multimap.

                      Comment

                      • Thias
                        Member
                        • Mar 2013
                        • 45

                        Hello Brian,

                        Kudos to you for doing all of that at no time. This will certainly make the bbmap suite even more useful than it already is.

                        Thanks again
                        Matthias

                        BTW: I solved my problem by using the idtag and the scoretag:

                        Code:
                         grep -o 'YR:i:[0-9]*' $MULTIMAPSAMFILE | cut -f3- -d':'
                        I tweaked the scores a bit for our purpose and pasted them as first column to the sam. Then I sorted the file on the read ID and the scores and removed the duplicates.
                        sort -k2,2 -r -k1,1 $SCOREDMULTIMAPSAMFILE | awk -F "\t" '!x[$2]++' | cut -f2-

                        Comment

                        • tkc.yang
                          Junior Member
                          • Dec 2014
                          • 3

                          Hi Brian,

                          For the mismatch filter, is there (or could there be) an option for number of mismatches accepted based on the length of the read? My reads have variable lengths and some are much longer than others. So, I'm wondering if setting an absolute number of mismatches (eg. 2) would be too stringent on the long reads...

                          Also, my long reads tend to drop off in quality towards the right end, so I want to trim the right end off long reads (eg. keep only the first 200bp) without affecting the shorter reads. Is there good a way to do this? I can't trim a set number of bp because that would trim the short reads; and for maxlen option, I could break long reads (from the left side?), but I don't want to align the remaining right ends of long reads.

                          Thanks!

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            You can use "idfilter" to allow a variable number of edits depending on the read length. e.g. idfilter=0.98 would allow up to 4 edits in a 200bp read but only 2 in a 100bp read. For quality-trimming, you have a couple of options. You can use BBDuk to forcibly trim all reads to at most 200bp like this:

                            bbduk.sh in=reads.fq out=trimmed.fq forcetrimright=199

                            Alternately, you can use "qtrim=r trimq=10" for example to trim the right side of the read to Q10, so only low-quality bases will be removed (you can set that higher if you want). This flag works in BBDuk or BBMap. BBMap also allows you to trim before mapping, then restore the read to the original length afterward, using the "untrim" flag, for example:

                            bbmap.sh in=reads.fq out=mapped.sam qtrim=r trimq=10 untrim

                            This is useful when you only want to allow high-identity alignments, but you don't want to discard the low-quality bases at the end of the read.

                            Comment

                            • rkanwar
                              Junior Member
                              • Feb 2011
                              • 8

                              Hello Brian,

                              I aligned fastq paired end reads using bbmap. I then tried to sort the SAM file using Picard and I am getting the following error:

                              Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Mate Alignment start (146342323) must be <= reference seque
                              nce length (90354753) on reference 16; Line 1211
                              Line: R0266248:292:C5C0TACXX:4:1101:4088:2519 161 16 33530580 44 98= = 146342323 112811852 ATAGAAAATTATTCA
                              GCTATATTCACTGCCTCACCACCTTTGTTTTTTTGTACACAAAAAATAACATTATCATTATTTGATTGCTCTCATGAAGCACT CCCFFFFFGHHHHJJJJJJJJIJJJJJJJJJJIJJJJFJJJJIIJJJIJIJIIJJJIJJIHHH
                              FFFFFFEEEEEEEFEFEDDDDDDDDDEDDDDDDDD NM:i:0 AM:i:44 RG:Z:AS_N#1#4#AS_N_1
                              It appears that for this read, each end is aligning to a different contig. But in the SAM record we have an '=' for the mate's contig name, which might be causing it. Is my reasoning correct, and how do I fix it ? Thanks a lot for your help.

                              Comment

                              • Brian Bushnell
                                Super Moderator
                                • Jan 2014
                                • 2709

                                Sounds like a bug, and I agree, it seems as though the mate is mapped to a different (longer) contig. What version of BBMap are you using? Also, can you give me the complete command line and, if possible, the text of the second read? Thanks!

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...