Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • SNPsaurus
    Registered Vendor
    • May 2013
    • 525

    I have a question about the bbmap summary output. Here is an example:

    Code:
    Reads Used:             1061591 (151284618 bases)
    
    Mapping:                15.250 seconds.
    Reads/sec:              69611.57
    kBases/sec:             9920.17
    
    
    Read 1 data:            pct reads       num reads       pct bases          num bases
    
    mapped:                  76.2432%          809391        77.6086%          117409909
    unambiguous:             75.3885%          800318        76.8084%          116199316
    ambiguous:                0.8547%            9073         0.8002%            1210593
    low-Q discards:           0.0000%               0         0.0000%                  0
    
    perfect best site:       11.4049%          121073         9.7282%           14717263
    semiperfect site:        58.9222%          625513        60.0540%           90852423
    
    Match Rate:                   NA               NA        99.1429%          116413381
    Error Rate:              20.6634%          183594         0.2527%             296692
    Sub Rate:                19.9279%          177059         0.2369%             278137
    Del Rate:                 0.6244%            5548         0.0084%               9819
    Ins Rate:                 0.6302%            5599         0.0074%               8736
    N Rate:                  72.0298%          639985         0.6044%             709655
    What does an N rate of 72% mean in this context? The reference has 0 bases that are N. There are 20,000 out of the 1 million reads that contain an N. The N Rate percent bases of 0.6% seems high compared to 20,000 reads with 1 or 2 Ns, but closer to what I think I see than the 72% N for reads.
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      This means 72 percent of the reads mapped with an "N" symbol in the match string, an internal data structure similar to a cigar string. The "N" symbol denotes either an N in the read or an N in the reference, which can include sequences that go off the end of scaffolds (the ends are internally padded with Ns). Additionally, degenerate IUPAC codes are counted as Ns.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        Originally posted by kimng View Post
        Hello, I've been using different parts of the BBMap suite and mixing them with other softwares for WGS of bacterial samples (primarily for the purpose of automated QC). I'm a fan of callvariants.sh as it works quickly and doesn't require me to sort .sam files before hand and it's simple to clearfilters to return all values. Unfortunately when using callvariants.sh with .sam files created from minimap2 (https://github.com/lh3/minimap2) using illumina nextseq reads. I receive the error seen below. I was curious if this was a minor bug that could be corrected?

        I'm running callvariants.sh from bbmap v37.90 obtained from conda (https://anaconda.org/bioconda/bbmap). Appreciate any assistance given.

        -Kim

        Error:
        java -ea -Xmx48779m -Xms48779m -cp /home/ssi.ad/kimn/.conda/envs/env_kim/opt/bbmap-37.90/current/ var2.CallVariants in=aln.sam ref=contigs.fasta vcf=minimap.vcf ploidy=1 clearfilters
        Executing var2.CallVariants [in=aln.sam, ref=contigs.fasta, vcf=minimap.vcf, ploidy=1, clearfilters]

        Loading reference.
        Time: 0.304 seconds.
        Processing input files.
        Exception in thread "Thread-3" Exception in thread "Thread-4" Exception in thread "Thread-5" java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
        at stream.SamLine.toShortMatch(SamLine.java:1200)
        at stream.SamLine.toRead(SamLine.java:1972)
        at stream.SamLine.toRead(SamLine.java:1832)
        at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
        at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
        java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
        at stream.SamLine.toShortMatch(SamLine.java:1200)
        at stream.SamLine.toRead(SamLine.java:1972)
        at stream.SamLine.toRead(SamLine.java:1832)
        at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
        at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
        java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
        at stream.SamLine.toShortMatch(SamLine.java:1200)
        at stream.SamLine.toRead(SamLine.java:1972)
        at stream.SamLine.toRead(SamLine.java:1832)
        at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
        at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
        The problem here is that minimap uses old-style cigar strings (M symbol instead of = and X) and also does not produce MD tags. I've added the ability to handle reads in that situation and it will be released in v37.95 (soon).

        Prior to that you could probably add MD tags with "samtools calmd".

        Comment

        • SNPsaurus
          Registered Vendor
          • May 2013
          • 525

          The reference in this case is a list of RAD loci sequences, each at 150 bp. The reads are 150 bp and should match one of the RAD loci. Since there are no N (or degenerate nucleotides) in the reference, and only a small number of reads with an N, it must have to do with the padding. A trimmed read would not have N padding, would it? This sample was degraded DNA, so plenty of reads were trimmed to less than 150 nucleotides. I doubt that 72% of the loci from this sample had an insertion that made the read extend past the reference (generated from consensus of many samples). Odd, but I guess it is not critical for me to fully figure this out.
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment

          • tabwalsh
            Junior Member
            • Mar 2018
            • 2

            Bias towards forward strand in msa.sh

            Hello Brian,

            Thanks for the lovely tool. I've been using msa.sh with cutprimers.sh (as described here) to extract sequences between pairs of primers.

            However, I got some unexpectedly large sequences as output, and when I took a look at the SAM alignments output by msa.sh, none of them are mapping to the reverse strand.

            For example, with the following template sequence…
            >template
            CTCGAGGGGCCTAGACATTGCCCTCCAGAGAGAGCACCCAACACCCTCCAGGCTTGACCGGCCAGGGTGTCCCCTTGACACCCGTCAAGCCTCACGA

            …it should be possible to use the following commands to extract the sequence between the given primers:
            msa.sh in=template.fa primer=GGGGCCTAGACATTGCCCTCCA out=fwd.sam
            msa.sh in=template.fa primer=GACACCCTGGCCGGTCAAGCCT out=rev.sam
            cutprimers.sh in=template.fa sam1=fwd.sam sam2=rev.sam include=t out=amplicon.fa

            This outputs the following (primer alignments shown in uppercase):
            GGGGCCTAGACATTGCCCTCCAgagagagcacccaacaccctccaggcttgaccggccagggtgtccccttGACACCCGTCAAGCCT

            The sequence GACACCCGTCAAGCCT is the best match for the reverse primer on the forward strand, but not the best match overall.

            That is on the reverse strand, which would result in the following (primer alignments again in uppercase):
            GGGGCCTAGACATTGCCCTCCAgagagagcacccaacaccctccAGGCTTGACCGGCCAGGGTGTC

            For the latest version of BBMap (BBMap_37.95.tar.gz), it looks like this issue might be fixed by changing line 183 of file 'bbmap/current/jgi/FindPrimers.java' to:
            score2=ss.score=ss.quickScore;
            Last edited by tabwalsh; 03-29-2018, 06:31 AM.

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              Originally posted by tabwalsh View Post
              For the latest version of BBMap (BBMap_37.95.tar.gz), it looks like this issue might be fixed by changing line 183 of file 'bbmap/current/jgi/FindPrimers.java' to:
              score2=ss.score=ss.quickScore;
              I suggest that you create a ticket on BBMap SF site since you have specifically identified a fix.

              Comment

              • tabwalsh
                Junior Member
                • Mar 2018
                • 2

                Thanks, I've submitted a ticket here: https://sourceforge.net/p/bbmap/tickets/7/

                Comment

                • bwubb
                  Member
                  • Jan 2012
                  • 61

                  Greetings,

                  First time bbmap user and Im having some odd compatibility issues with certain tools using a bbmap generated bam. I do not think I did anything particularly odd in my bam workflow:

                  Left trim and right trim using bbduk | bbmerge | bbmap | samtools addreplacerg

                  From this I was able to successfully make variant calls and generate some metrics using GATK/Picard. However bedtools was reporting zero coverage at all intervals and I am unable to view reads restricting to regions using samtools.

                  Code:
                  [bwubb@node063 GRCh37]$ samtools view 4721.ready.bam 1
                  [main_samview] region "1" specifies an unknown reference name. Continue anyway.
                  
                  
                  [bwubb@node063 GRCh37]$ samtools view -H 4721.ready.bam | head -5
                  @HD     VN:1.4  SO:coordinate
                  @SQ     SN:1 dna:chromosome chromosome:GRCh37:1:1:249250621:1   LN:249250621
                  @SQ     SN:2 dna:chromosome chromosome:GRCh37:2:1:243199373:1   LN:243199373
                  @SQ     SN:3 dna:chromosome chromosome:GRCh37:3:1:198022430:1   LN:198022430
                  @SQ     SN:4 dna:chromosome chromosome:GRCh37:4:1:191154276:1   LN:191154276
                  Are they SN incorrect? Nothing looks really unusual. I guess the first line of my fastq does look like

                  Code:
                  >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
                  If this is the issue can I "correct" the bbmap index or should I just modify the fasta? Thanks.

                  -bwubb

                  Comment

                  • luc
                    Senior Member
                    • Dec 2010
                    • 469

                    Convenient way to avoid re-indexing references?

                    When working with big genomes one would really like to avoid re-indexing reference sequences. BBmap is always looking for the reference indices in the working directory and not in the directory where the reference resides. Is there any way to point BBMap to the index? When using "path=" it seems to automatically re-index i this location.
                    Thanks in advance.

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      Originally posted by bwubb View Post
                      Greetings,

                      First time bbmap user and Im having some odd compatibility issues with certain tools using a bbmap generated bam. I do not think I did anything particularly odd in my bam workflow:

                      Left trim and right trim using bbduk | bbmerge | bbmap | samtools addreplacerg

                      From this I was able to successfully make variant calls and generate some metrics using GATK/Picard. However bedtools was reporting zero coverage at all intervals and I am unable to view reads restricting to regions using samtools.

                      Code:
                      [bwubb@node063 GRCh37]$ samtools view 4721.ready.bam 1
                      [main_samview] region "1" specifies an unknown reference name. Continue anyway.
                      
                      
                      [bwubb@node063 GRCh37]$ samtools view -H 4721.ready.bam | head -5
                      @HD     VN:1.4  SO:coordinate
                      @SQ     SN:1 dna:chromosome chromosome:GRCh37:1:1:249250621:1   LN:249250621
                      @SQ     SN:2 dna:chromosome chromosome:GRCh37:2:1:243199373:1   LN:243199373
                      @SQ     SN:3 dna:chromosome chromosome:GRCh37:3:1:198022430:1   LN:198022430
                      @SQ     SN:4 dna:chromosome chromosome:GRCh37:4:1:191154276:1   LN:191154276
                      Are they SN incorrect? Nothing looks really unusual. I guess the first line of my fastq does look like

                      Code:
                      >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
                      If this is the issue can I "correct" the bbmap index or should I just modify the fasta? Thanks.

                      -bwubb
                      BBMap is one of the aligners that uses the full header present in your fasta file when creating the index and passes it along to alignment file. If there are spaces in the header name they are written to alignment. Some downstream programs have a problem with this.

                      You can use the option "trd=t" to truncate the fasta header names after the first space in the name. There is an option for reformat.sh that can do this after the fact for aligned data. I can look this up later if you don't find it.

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        Originally posted by luc View Post
                        Convenient way to avoid re-indexing references?

                        When working with big genomes one would really like to avoid re-indexing reference sequences. BBmap is always looking for the reference indices in the working directory and not in the directory where the reference resides. Is there any way to point BBMap to the index? When using "path=" it seems to automatically re-index i this location.
                        Thanks in advance.
                        Index the genome independently by doing
                        Code:
                        bbmap.sh ref=your_genome.fa
                        This will produce the index in the local directory. There will be a top level "ref" directory with several things in it. Don't worry about the exact organization/names of files in it. This is the pre-made index.

                        When you want to do the alignments instead of "ref=" you pass "path=/path_to_dir_containing_ref_dir". That should use the pre-made index.
                        Last edited by GenoMax; 04-07-2018, 09:37 AM.

                        Comment

                        • kevin199011
                          Junior Member
                          • May 2014
                          • 4

                          Hi Brian, can BBduk deal with trimming adapter sequence only but retain other bases? Forexample:
                          SEQUENCEADAPTERSEQUENCE; trimming adapter leaves: SEQUENCESEQUENCE?

                          Thank you!

                          Kevin

                          Comment

                          • Coyk
                            Member
                            • Jun 2011
                            • 15

                            basic java windows error help

                            Hello,
                            I've been trying to get any basic command to work on my windows 7 machine. I've installed bbmap in the C:\Users\me directory. My .fastq.gz files are in the bbmap folder. When I type the command
                            C:\Users\me>java -ea -Xmx1g -cp \bbmap\current\jgi.BBDukF in1=inputR1.fastq.gz in2=inputR2.fastq.gz out1=test1.fastq out2=test2.fastq ktrim=r mink=8 k=15

                            I get "could not find or load main class in1=inputR1.fastq.gz"

                            I've messed with the type of slash, putting the infiles in the current directory, and taking away parts of the directory names. I even truncated the argument after -cp to just jgi.BBDukF, which I know is wrong, and it still wants the main class of in1. I don't know a thing about java and when I read instructions about paths, classpaths, path environments, etc, I get confused quickly. When I type "echo %CLASSPATH% into the command prompt, I get CLASSPATH back so that's not set. I tried using Git Bash but when I typed "bbduk.sh" I got "command not found" so I'm stuck, but I would really like to use this tool on my windows machine. Anyone willing to give me very basic instructions on the exact commands that get bbduk to run on a windows machine?

                            Comment

                            • JenBarb
                              Member
                              • Oct 2010
                              • 47

                              filterbyname.sh is not pulling out reads

                              Hello,
                              Any suggestions as to why filterbyname is not pulling out reads that I know are there.

                              Help??

                              Jen

                              Comment

                              • GenoMax
                                Senior Member
                                • Feb 2008
                                • 7142

                                Originally posted by JenBarb View Post
                                Hello,
                                Any suggestions as to why filterbyname is not pulling out reads that I know are there.

                                Help??

                                Jen
                                Are you including the entire fastq/fasta header in your retrieval command?

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...