SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Yes .. BBMap can do that! (http://seqanswers.com/forums/showthread.php?t=58221)

sdriscoll 01-22-2018 12:57 PM

Hello-

I'm trying to understand this example in the original post:

Code:

bbmap.sh in=reads.fq outu=unmapped.fq int=f
repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq

It is provided as an example of how to deal with the non-deterministic behavior of bbmap when aligning paired-end reads. However what is described (map reads as single-end and then re-pair them using repair.sh) doesn't seem to be what these code lines show. If the goal is to get bbmap to produce deterministic paired-end alignments wouldn't the final outcome of the method be an alignment file and not just a FASTQ? Why would I be using the un-aligned reads as part of the final product?

milan.molbio 01-23-2018 12:46 AM

Hi guys,

I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

Code:

# file: sample.genome.fa
> 10 GRCz10
TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT

# file sample.reads.fa:
> r1
GTTTTACAACAACATGGCAG
GTTTTTTAAGAACATGGCAG

# run.bash
bbmap.sh ref=sample.genome.fa k=3
bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow

# results: mapped.sam
mapped.sam
@HD        VN:1.4        SO:unsorted
@SQ        SN: 10 GRCz10        LN:177
@PG        ID:BBMap        PN:BBMap        VN:37.85        CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
 r1        0        10 GRCz10        29        40        20=        *        0        0        GTTTTACAACAACATGGCAG        *        NM:i:0        AM:i:40        NH:i:1

expected output:
  1. 1 perfect hit (position 28)
  2. 1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
  3. 1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)


Code:

# stdout:

Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
Retaining all best sites for ambiguous mappings.
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
NOTE:        Ignoring reference file because it already appears to have been processed.
NOTE:        If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:        0.075 seconds.
Loading index for chunk 1-1, build 1
Generated Index:        0.004 seconds.
Analyzed Index:          0.002 seconds.
Started output stream:        0.021 seconds.
Cleared Memory:            0.116 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

  ------------------  Results  ------------------

Genome:                        1
Key Length:                    3
Max Indel:                    0
Minimum Score Ratio:          0.4486
Mapping Mode:                normal
Reads Used:                  1        (20 bases)

Mapping:                  0.208 seconds.
Reads/sec:              4.81
kBases/sec:              0.10


Read 1 data:              pct reads        num reads        pct bases          num bases

mapped:                  100.0000%                1        100.0000%                  20
unambiguous:            100.0000%                1        100.0000%                  20
ambiguous:                0.0000%                0          0.0000%                    0
low-Q discards:            0.0000%                0          0.0000%                    0

perfect best site:        100.0000%                1        100.0000%                  20
semiperfect site:        100.0000%                1        100.0000%                  20

Match Rate:                    NA                NA        100.0000%                  20
Error Rate:                0.0000%                0          0.0000%                    0
Sub Rate:                  0.0000%                0          0.0000%                    0
Del Rate:                  0.0000%                0          0.0000%                    0
Ins Rate:                  0.0000%                0          0.0000%                    0
N Rate:                    0.0000%                0          0.0000%                    0

Total time:            0.571 seconds.


sdriscoll 01-23-2018 11:46 AM

@milan.molbio-

bbmap looks for the best alignment and reports only that one by default. If there are multiple "best" then you can get them all by setting 'ambig=all'.

What you're looking for should be possible by enabling 'secondary=t' and then setting both 'sssr' and 'maxsites' to your liking. "secondary" alignments are those with lower mapping scores relative to the best alignment. "sssr" controls the ratio of the best score that is acceptable (default is 0.95 to allow secondary alignments with mapping scores as low as 95% of the best score). Finally "maxsites" sets a cap on the number of secondary alignment sites to report. You can probably just set that to a very high value to be sure you have them "all" - or at least all that bbmap can find.

milan.molbio 01-24-2018 05:03 AM

@sdriscoll thanks for the tip! with these two options bbmap does report one 100% semiperfect site read but it's not in the output file. Anything else I'm missing?

the exact command was:
Code:

bbmap.sh secondary=t sssr=0.80  k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  maxsites=10  in=sample.read.fa out=mapped.sam vslow

GenoMax 01-24-2018 05:41 AM

I am not sure why BBMap is not reporting the two sites in the output SAM. Perhaps that is by design.

Unfortunately @Brian has been missing from online forums for last couple of months. I pinged him again yesterday with ref to this question. We will need him to chime in on these questions.

sdriscoll 01-24-2018 11:31 AM

Yeah I've come to terms with the fact that BBMap wasn't really designed for this type of use. When I want to work with an alignment strategy where the goal is to find "all" alignments of a read up to some maximum error threshold I turn to bowtie (1 or 2) or BWA.

milan.molbio 01-25-2018 12:13 AM

Quote:

Originally Posted by sdriscoll (Post 214346)
Yeah I've come to terms with the fact that BBMap wasn't really designed for this type of use. When I want to work with an alignment strategy where the goal is to find "all" alignments of a read up to some maximum error threshold I turn to bowtie (1 or 2) or BWA.

i did give them both a try. bowtie seems to reliably find and report them but it's got a limit of maximum 3 mismatches (and I need 4 :) bwa is also struggling to find all the matches, and i've filed an issue: https://github.com/lh3/bwa/issues/176

Alex Lee 03-07-2018 02:35 PM

Hi Brian, I have a problem whereby I have a about 12 fastq files and I'm not completely convinced that the barcode for each of the samples are as named in the key. If I have 12 unique 6-mer barcodes eg GTCCGC, etc is there a way to definitively say which samples have the following barcode with bbmerge? thanks!

GenoMax 03-08-2018 03:20 AM

I assume you are referring to a standard barcode/index from Illumina tech. If so those are never part of actual read. If they are present in the fastq header then you could use "demuxbyname.sh" to separate the reads you are interested in.

If your index is "inline" (and at a specific location in the read e.g. beginning) then you could use bbduk.sh to match/separate those reads.

I am not sure bbmerge.sh is going to help in your case.

Brian Bushnell 03-21-2018 03:18 PM

Quote:

Originally Posted by Alex Lee (Post 215436)
Hi Brian, I have a problem whereby I have a about 12 fastq files and I'm not completely convinced that the barcode for each of the samples are as named in the key. If I have 12 unique 6-mer barcodes eg GTCCGC, etc is there a way to definitively say which samples have the following barcode with bbmerge? thanks!

BBMerge might be able to help in this case, if you have paired reads with a sufficient number of short inserts. You can run it like this:

Code:

bbmerge.sh in1=read1.fq in2=read2.fq outa=adapters.fa
This will indicate the adapter sequence for the library, based on read-through from short-insert reads. The barcode is typically part of the adapter, so if you compare the adapters from 2 multiplexed libraries, the location where they differ is the barcode region.

Brian Bushnell 03-21-2018 03:21 PM

Quote:

Originally Posted by milan.molbio (Post 214280)
Hi guys,

I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong.

While BBMap is not originally designed for this purpose; I made a version that does a much better job at finding all mappings above some identity threshold, bbmapskimmer.sh. The usage is the same as BBMap; just add the ambig=all flag.

GenoMax 03-21-2018 04:59 PM

Quote:

Originally Posted by Brian Bushnell (Post 215791)
While BBMap is not originally designed for this purpose; I made a version that does a much better job at finding all mappings above some identity threshold, bbmapskimmer.sh. The usage is the same as BBMap; just add the ambig=all flag.

Does this mean that bbmap functions identically but does not report all the alignments in the output file but includes them in the statistics for alignment?

darthsequencer 06-01-2018 12:14 PM

Is there a multi-sample pileup? I have a bunch of samples mapped to the same reference. It would be awesome to have that as a table with the coverage for each sample per reference sequence.

SDH 10-16-2018 11:06 PM

Hello

My sequencing output largely consists of Illumina (NextSeq and Miseq) NON-overlapping paired end reads (the occasional reads will overlap, as the DNA is not all fragmented evenly). For ease of storage and to simplify input data at the start of a workflow, I would like to combine the files.

If I understand correctly, bbmerge is specifically for merging overlapping reads? In which case I am best off simply using the bbmap reformat.sh to combine the reads in to one interleaved file?

Code:

reformat.sh in1=read1.fq in2=read2.fq out=reads.fq

Grammatonotus 02-21-2019 01:06 PM

shredding and mapping long reads to reference error
 
Hello,

I have a reference and I have an additional genome sequenced via 10x that contains 150k contigs from 1k to 7.5Mb. I am attempting to align the second to the first with BBMap by using the flag maxlen=2000 as there are extensive rearrangements. This morning I found this error message. Can anyone tell me what might have happened? Thanks.
Code:

'BBMap failed with error code 1
Executing align2.BBMapPacBio [maxlen=2000, maxindel=4000, ambiguous=random, k=15, saa=f, maxlen=6000, minratio=0.40, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, ref=ref.fasta, out=bbmap.sam, in=reads1.fasta]

BBMap version 37.64
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
Choosing a site randomly for ambiguous mappings.
Writing reference.
Executing dna.FastaToChromArrays2 [ref.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=100, midpad=6000, startpad=10000, stoppad=10000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Writing chunk 3
Writing chunk 4
Writing chunk 5
Writing chunk 6
Writing chunk 7
Set genome to 1

Loaded Reference: 0.007 seconds.
Loading index for chunk 1-7, build 1
No index available; generating from reference genome: /Users/wilfordbrimley/Geneious Prime/temp/17/ref/index/1/chr1-3_index_k15_c2_b1.block
No index available; generating from reference genome: /Users/wilfordbrimley/Geneious Prime/temp/17/ref/index/1/chr4-7_index_k15_c2_b1.block
Indexing threads started for block 0-3
Indexing threads started for block 4-7
Indexing threads finished for block 0-3
Indexing threads finished for block 4-7
Generated Index: 249.271 seconds.
Analyzed Index: 71.551 seconds.
Started output stream: 0.096 seconds.
Cleared Memory: 0.195 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 24 mapping threads.
Exception in thread "Thread-29" java.lang.AssertionError: -30, -30
89764, 2,0,149134394,149142098,0,00,971,475770,460506,0,389847,149134394~149139756~149140058~149142098,mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmSmmmmmmmmmmmSmmmSmmmmmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmSmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmSmmmSmmmmmmmmmmmmmmmSm
...
GCATTCAGACAAGGGTAGCAATATCAATCAACTCTATCTGCTTACAGGAATTGAGCTAGCATAAAGTGCTAAACTCATACTATCAGGCATAACCTAGTCAGGCTATCCTACTTACTGTCTACTCAATATAGGGGTGAGCATTTGGACCGGTGGACCGGACCGGACCGGTGAACCGGACCGGACCGAACCGGATCGGACCGTTTTTTTGGACCCCCGGGCCGGTTCTCGGTTCTAAGATTTCTCGCCGGCCCGGTCTGGTTTTTTTCCCGGTCCGGTCCGGTTTTTACCGGGTTTAGAACCGGTTGGACATGATTAAGATTTTTGTGACGTTTTTTAAACGCACATAACTCATTTGTTTTAAGTTGGATTGACCTGCGGTTTTTCCAACATTTAGTTTACAATTTGTAGATGAATTTAGACTATAATTTTGTTTATTTTGGTATTATATTCACCGTGTTACAATCCTTCAAAGTAGACCTAACAAAGCTGAAAAAATTCAGTTTTTCAGCAACATTTTGTAAGCTGTGACGTTTTTTAAACATCCATAACTCATTCGTTTTAATTCGGATTGACCTCCGATTTTTTCCAACATTTAGTTTACAGTTTGTAGATGAGTTTAGACTATAATTTTGTTTATTTTGGTATTATATTCACTGAGCTACAATCCTTTAAAGTAGAGCTATCAAAGTTGAAAAAATTTAGCTTTTTAGCAAAATTTTATAATCTGTGACGTTTTTTAAACACCCATAACTCATTCGTTTTAAGTTGGATTGACCTACGGTTTTTTCCAACATTTAGTTTACAATTTGTAGATGAATTTAGACTATAATTTTGTTTATTTTGGTATTATATGCATTGAGTTACAATCCTTCAAAGTAGAGCTATCAAAGCTGAAATATTTCAGCTTTTTAGTATTATTTTATACTATAATCCCGAGTCCGCGTTTAGCACTTGTGGTAGAGTTGTAAGTGAGTATCAAACTAGTTTATCGACATTGATAGTTGAAGCTTTGTTATGCACCCAAGATTGGGTGATAAAGTTTACAAATCCGATAGTTGACAACGTTGGTGACATCTTGAACGATGATGATATAACTTTGGGTAAAAAAATTACAATCTTTATTCTTTTTTTATACAATTATTACTTTTTATTAAATACTAACAACATTTTCATTTGTGTTTATGTAGAGATTGCACAAGCTTTAAATATGTTACATATGGATGACAATGATATGGGGAAGAGGCCAATAGAAGATTAGATTATGAGTTTATGAAATTTGTTAGTTTGATTGATTTTAAAGTTTAAACTATGATTTTTGTTGTTACGGTAATACTTTTGTATGTTTGTCTTAAAATTGTTTAACTTTTGTTGGTGAACTTGATTTATATGTTAGTTGCTTTTATTTGGAAAATTAATTTTTGCAAATTGCAAGTTATAATATTATACATCAATGTAGATTAGTTAGGCATGAAACGGAAACATATTAATTTTGCAAGTTATAATATTATACATCAATGTAGATTACTTATTATACTCTATGCAATTAAAGTTGTACCGGTCCAAACGAGTCCAAAAACCGGAAAACCCGGACCTAGACCCGGACACGGACACGGACACGGACACGACTAATCCGGTCCGGTCCATGGTCCACAAAATTCTCCATAGCCGGTCCGGTCCGATTCGGTCCGGGCCTGTCCAAATGCTCACCCCTAACTTAATATCTAGCATGCAATTCTTATACAACTGAAACACATAAAGCAGGCACATAAGGCATCACTCCTAGATCCTTAGTCCTATTCTAGCATGCTGTTCTACAAAAGCTGATAAACATAACATAAACTTGTAAGGGTACTTTGGGGAATACTTACTTGAGCTCGGCCGGTCGCGCACATCACACACTTCGTTCTTTCTCGAAACTCTTTTCTTAGTTTTTTAGAAAACATTTTCTTTTTAGAAAATCTTTTAATCCCTTGATTTGAGTTCAGACACCCCCGAAGGTGTGTCCGAATCCCTCAAACCAGGGCTCTGATACCAACTTGTAACGACCCAAATTTCACGTTCAAAAATTTCGTTTTAAAATGTTACTTTAGAAAACATTAATAATTAAAACATTGTTTGATCAAACCATAACACAACGTAAACCATGTCTAAAAGCCAAATCACACAACCAATCAAACATCAGAGTATAAAACCCAAAGAATCTCG
at align2.MultiStateAligner9PacBio.makeGref(MultiStateAligner9PacBio.java:1388)
at align2.MultiStateAligner9PacBio.fillLimited(MultiStateAligner9PacBio.java:96)
at align2.TranslateColorspaceRead.realign_new(TranslateColorspaceRead.java:565)
at align2.TranslateColorspaceRead.realign_new(TranslateColorspaceRead.java:641)
at align2.AbstractMapThread.genMatchStringForSite(AbstractMapThread.java:1006)
at align2.AbstractMapThread.genMatchString(AbstractMapThread.java:902)
at align2.BBMapThreadPacBio.processRead(BBMapThreadPacBio.java:563)
at align2.AbstractMapThread.run(AbstractMapThread.java:502)
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23

**************************************************************************
Warning! 2 mapping threads did not terminate normally.
Check the error log; the output may be corrupt or incomplete.
Please submit the full stderr output as a bug report, not just this message.
**************************************************************************




------------------ Results ------------------

Genome: 1
Key Length: 15
Max Indel: 4000
Minimum Score Ratio: 0.4
Mapping Mode: normal
Reads Used: 581616 (2969452913 bases)

Mapping: 177826.639 seconds.
Reads/sec: 3.27
kBases/sec: 16.70


Read 1 data: pct reads num reads pct bases num bases

mapped: 81.2921% 472808 78.5245% 2331747518
unambiguous: 76.0459% 442295 75.7328% 2248849209
ambiguous: 5.2462% 30513 2.7917% 82898309
low-Q discards: 12.7684% 74263 15.0015% 445463533

perfect best site: 4.9337% 28695 4.2346% 125743838
semiperfect site: 5.0009% 29086 4.2700% 126794871

Match Rate: NA NA 84.9205% 2173341879
Error Rate: 87.8894% 442044 13.2182% 338289025
Sub Rate: 84.8676% 426846 1.8940% 48472068
Del Rate: 67.4695% 339341 8.8901% 227520440
Ins Rate: 68.5198% 344624 2.4342% 62296517
N Rate: 30.3789% 152792 1.8614% 47637054
Exception in thread "main" java.lang.AssertionError:
The number of reads out does not add up to the number of reads in.
This may indicate that a mapping thread crashed.
If you submit a bug report, include the entire console output, not just this error message.
238640+234168+0+34543+74263 = 581614 != 581616
at align2.AbstractMapper.printOutputStats(AbstractMapper.java:1892)
at align2.AbstractMapper.printOutput(AbstractMapper.java:1022)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:462)
at align2.BBMapPacBio.main(BBMapPacBio.java:37)
'


Grammatonotus 02-21-2019 01:40 PM

A little bit more information: the DNA sequence in the error message maps to the middle of a contig about 2.8 Mb, starts 402,000 bases in. I tried running this both locally and on a cluster and got the same error message with almost exactly the same sequence in each-- 2045 bases in one, 2205 bases in the other, with ~99% overlap. Any insights appreciated.

GenoMax 02-21-2019 02:45 PM

@grammtonotus: How much memory are you allocating for this job? With that large sized fragments you are likely going to need plenty.

I also suggest that you look into using `minimap2` from Heng Li as an alternate option to BBMap.

Grammatonotus 02-21-2019 03:47 PM

When I run it locally about 52 Gb (I ran out of system memory when I had it set to 58), on the cluster 439 Gb. I hope that's not the problem!

Grammatonotus 02-21-2019 05:26 PM

a colleague suggests that the mapper may have come to a read that hits too many places in the reference. I have it set to place this at Random, would changing to First Best, None, or All be likely to give a different result?

Also wondering if setting maxlen=4000 might change anything. I don't know how diverged the two are so kind of just guessing really. Thanks.

GenoMax 02-22-2019 06:21 AM

Let us know if using more memory made a difference.

You should really be using a different tool designed for long read alignments. Even LASTZ may be appropriate since you have query sequences as long as 7.5Mb.

raquelnb 05-16-2019 12:58 PM

Hello Brian, it is just a question that pop up when using bbmap. For the outm, what type of flags do you use to both take pairs where both reads mappedand those where just one read of the pair mapped? Or how do you get read of both 77 and 141 flags?

Thank you and best

GenoMax 05-17-2019 04:03 AM

I think you are looking for the following options that are accessible from "reformat.sh".

Code:

Sam and bam processing options:

mappedonly=f            Toss unmapped reads.
unmappedonly=f          Toss mapped reads.
pairedonly=f            Toss reads that are not mapped as proper pairs.
unpairedonly=f          Toss reads that are mapped as proper pairs.
primaryonly=f          Toss secondary alignments.  Set this to true for sam to fastq conversion.
minmapq=-1              If non-negative, toss reads with mapq under this.
maxmapq=-1              If non-negative, toss reads with mapq over this.
requiredbits=0          (rbits) Toss sam lines with any of these flag bits unset.  Similar to samtools -f.
filterbits=0            (fbits) Toss sam lines with any of these flag bits set.  Similar to samtools -F.


bwubb 06-11-2019 04:01 AM

Greetings,

I have a seemingly simple problem, that Im sure bbmap can help with, but Im unclear what steps must be taken for success.

I have several bams from paired end sequencing where some number of mates have been removed. The remaining read still says it is paired so every tool seems to throw an error. Samtools or picard fixmate does not correct it, at best they yell the read names in question.

I am trying to use reformat.sh to remove those mates that remain, but I also seem to get a java exception, even when just trying to convert to fastq.

Code:

[bwubb@node107 disambres]$ reformat.sh -Xmx5g in=qsortA.bam out1=reads1.fq out2=reads2.fq ow
java -ea -Xmx5g -cp /home/bwubb/software/bbmap/current/ jgi.ReformatReads -Xmx5g in=qsortA.bam out1=reads1.fq out2=reads2.fq ow
Executing jgi.ReformatReads [-Xmx5g, in=qsortA.bam, out1=reads1.fq, out2=reads2.fq, ow]

Set INTERLEAVED to true
Found samtools 1.9
Input is being processed as paired
Exception in thread "Thread-2" java.lang.AssertionError: K00315:202:H577WBBXY:7:1101:1438:38767 13      -1      -      57328513        57328662        1000000100000000011    1      0  60      TCCAGAAAGAGAGTTCTTGAAAGAAAGAGGCGCTATCATTTTGACACAGATGGNAAGGGCTCGATTCACGATCAAAAAGGCTCCAAAAANAAAAAAAATTCTGTTTGTTNTNCTTTTTCCTCCNGATTCTAGTTTTTTTNTTATNTTTTG  AAFFFFFAJJJJJ7FFFJJ7JJ7FJJJJJFFJJJJJFJJJJJF-A-AAFFJFA!AF---AAJJ-AJF<FF-AAFAF<AAAJJJJFA-7A!--77-<A<JJ<A--7A<FJ!J!<FJJJJA<FF7!)7AJF---7AFJJFJ!JF--!7-7<F      .      30      C68m28Nm53      .

K00315:202:H577WBBXY:7:1101:1438:38767  113    19      57328582        60      68S82M  1      176081441      0      CAAAANATAANAAAAAAACTAGAATCNGGAGGAAAAAGNANAACAAACAGAATTTTTTTTNTTTTTGGAGCCTTTTTGATCGTGAATCGAGCCCTTNCCATCTGTGTCAAAATGATAGCGCCTCTTTCTTTCAAGAACTCTCTTTCTGGA      F<7-7!--FJ!JFJJFA7---FJA7)!7FF<AJJJJF<!J!JF<A7--A<JJ<A<-77--!A7-AFJJJJAAA<FAFAA-FF<FJA-JJAA---FA!AFJFFAA-A-FJJJJJFJJJJJFFJJJJJF7JJ7JJFFF7JJJJJAFFFFFAA      NM:i:1  MD:Z:28C53      MC:Z:150M      AS:i:80 XS:i:21 SA:Z:1,176081357,+,81S62M7S,54,7;      RG:Z:FGC2033.7
flag=1110001

        at stream.SamReadInputStream.toReadList(SamReadInputStream.java:136)
        at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:89)
        at stream.SamReadInputStream.hasMore(SamReadInputStream.java:53)
        at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:643)
        at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)


I thought I could make read1.fq reads2.fq and use repair.sh Clearly it finds the two problem reads. I have tried other arguments with the reformat.sh, and a few other tools, but none want to correct the MATE_NOT_FOUND problem. It would seem impossible that I am the first person to run into this issue. Thank you very much for any advice/help.

-bwubb

EDIT: Sorry if this post shows up multiple times. I literally tried making it three times and nothing showed up for a full day. Tried a quick reply and it showed up immediately.

GenoMax 06-12-2019 07:05 AM

@bwbubb: Option for reformat.sh that you are looking for is "pairedonly=f Toss reads that are not mapped as proper pairs". Set pairedonly=t and process your BAM files.

bwubb 06-13-2019 07:33 AM

Quote:

Originally Posted by GenoMax (Post 226803)
@bwbubb: Option for reformat.sh that you are looking for is "pairedonly=f Toss reads that are not mapped as proper pairs". Set pairedonly=t and process your BAM files.

EDIT:

Thank you @GenoMax. This actually does not work for for this case.

Code:

[bwubb@node061 disambres]$ reformat.sh pairedonly=t in=input.bam out=fix.bam
java -ea -Xmx200m -cp /home/bwubb/software/bbmap/current/ jgi.ReformatReads pairedonly=t in=input.bam out=fix.bam
Executing jgi.ReformatReads [pairedonly=t, in=input.bam, out=fix.bam]

Found samtools 1.9
Input is being processed as unpaired
Input:                          13726088 reads                  2041390865 bases
Output:                        12583399 reads (91.68%)        1880228501 bases (92.11%)

Time:                          349.142 seconds.
Reads Processed:      13726k    39.31k reads/sec
Bases Processed:      2041m    5.85m bases/sec

[bwubb@node061 disambres]$ samtools index fix.bam
[bwubb@node061 disambres]$ java -jar ~/software/picard/2.20.2/picard.jar ValidateSamFile I=fix.bam MODE=SUMMARY
INFO    2019-06-13 08:00:16    ValidateSamFile

...

## HISTOGRAM    java.lang.String
Error Type      Count
ERROR:MATE_NOT_FOUND    2

[Thu Jun 13 08:02:01 EDT 2019] picard.sam.ValidateSamFile done. Elapsed time: 1.74 minutes.
Runtime.totalMemory()=2075918336
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Not sure how things work internally, but from my understanding the "problem" reads think they are paired.


I toiled with this more yesterday and I believe I can use reformat.sh to make an interleaved fastq file, and then use repair.sh to make read1.fq and read2.fq and take it back to alignment. All attempts at correcting the bam directly seem to fail. I have yet to have total success, which in my case would be passing picardtools ValidateSamFile.

I am aligning with bwa mem -M, having some other issue in the validation that maybe is unrelated or will be fixed with the right reformat/repair arguments.

GenoMax 06-13-2019 08:16 AM

Have you tried adding "VALIDATION_STRINGENCY=LENIENT" to your picard command? Default value is STRICT and sometime it leads to picard throwing errors like this.

It does looks like reformat.sh took out some of the reads so the problem reads have been hopefully addressed.

bwubb 06-13-2019 09:50 AM

Quote:

Originally Posted by GenoMax (Post 226828)
Have you tried adding "VALIDATION_STRINGENCY=LENIENT" to your picard command? Default value is STRICT and sometime it leads to picard throwing errors like this.
...

Yes indeed, but it will still be an ERROR.

However, after more tinkering I was able to come up with a combination that worked.

Code:

reformat.sh -Xmx5g in=qSortA.bam out=qSortA_reads.fq
reformat.sh -Xmx5g int=t addcolon=t uniquenames=t in=qSortA_reads.fq out=out_reads.fq
repair.sh -Xmx5g in=out_reads.fq out1=reads1.fq out2=reads2.fq

Taking those reads into alignment and so forth will pass ValidateSamFile. Im guessing these actions are allowing me to completely rewrite the pairing status, which I am perfectly ok with.

Thank you for the comments and help!

noobie 06-26-2019 05:02 AM

I want to subsample 10 million reads from both an interleaved reads file and a singletons reads file at the same time. Can reformat.sh do this?

GenoMax 06-26-2019 06:32 AM

@Noobie: I don't think so. You will have to do them independently. What is the use case here, if I may ask?

bartn 07-08-2019 11:02 PM

randomreads.sh
 
Hi!

I want to created a (small) RNAseq test set from a small bacterial plasmid. So I used :
Code:

randomreads.sh reads=30000 paired=t metagenome=t  ref=transcripts.fasta out1=test_RNA_1.fq out2=test_RNA_2.fq
(Then I get 30k reads for each pair. No problem, I guess that's because the default is single end)

Then I mapped back the reads with BBmap
Code:

bbmap.sh ref=Rhizobium_Leguminosarum_plasmid_NZ_CP025014.1_transcripts.fasta  in1=Unlock_test_RNA_1.fq in2=Unlock_test_RNA_2.fq covstats=RNA.cov
Results:
Code:

Reads:                                      60000
Mapped reads:                                54225
Mapped bases:                                8063372
Ref scaffolds:                              562
Ref bases:                                  521537

Percent mapped:                              90.375
Percent proper pairs:                        64.822
Average coverage:                            15.461
Standard deviation:                            51.768
Percent scaffolds with any coverage:        59.96
Percent of reference bases covered:          45.87

The mapping percentages seem very low to me. Mapping to the genome gives similar results.
I am using randomreads.sh in the wrong way for this?

GenoMax 07-09-2019 03:25 AM

Since you asked for 30K reads you got 30K paired-end reads.

I would sugegst trying without the metagenome=t option for a single small input fasta. If you do have the entire genome then use that in your randomread generation step along with the plasmid.

bartn 07-09-2019 07:31 AM

Quote:

Originally Posted by GenoMax (Post 227250)
Since you asked for 30K reads you got 30K paired-end reads.

I would sugegst trying without the metagenome=t option for a single small input fasta. If you do have the entire genome then use that in your randomread generation step along with the plasmid.

Yes, that does seem to increase the mapping percentages. Any idea why it is not working with meta option? Changes read length and insert size does not seem to do much..
The plasmid is about 0.5MB with 562 transcripts. Median transcript size is 830bp, so that could be a bit short, (but then I would expect the same with the meta option off)

I will also look for another suitable (small) genome. I don't want to add the complete genome because it should finish quick, since it's for testing purposes of tools and pipelines.

(btw, one other small thing. Adding any statistic file in the arguments will give the small summary in the stdout (like I posted before). Otherwise it will not be printed. Not sure if that is expected behavior)

GenoMax 07-09-2019 09:52 AM

Metagenome option is supposed to take in multiple (genome) sequences to generate an artificial metagenomic dataset.

BBMap writes stats to STDERR by default. So you can capture that.

bartn 07-09-2019 11:13 PM

From the description of the option I understood it can be used for RNA as well:

metagenome=f Assign scaffolds a random exponential coverage level, to simulate a metagenomic or RNA coverage distribution

Which makes sense in my view since I just give multiple transcripts instead of genomes (albeit short compared to a genome)

So why are so many reads not mapping back even thought almost all of it with an even distributed coverage does map.


What I meant with the summary part, if I run just bbmap in its simplest form:
Code:

bbmap.sh ref=transcripts.fasta  in1=read_1.fq  in2=read_2.fq
the stderr does not contain this part:
Code:

Reads:                                      120000
Mapped reads:                                111248
Mapped bases:                                8357095
Ref scaffolds:                              1
Ref bases:                                  592529

Percent mapped:                              92.707
Percent proper pairs:                        56.370
Average coverage:                            14.104
Standard deviation:                            46.411
Percent scaffolds with any coverage:        100.00
Percent of reference bases covered:          41.67

But it does with adding a covstats=cov.stats for example.

GenoMax 07-10-2019 03:40 AM

Unfortunately Brian would be the only person who can provide an authoritative answer and he no longer participates on this forum. You may want to try emailing him with your question directly.

BBMap provides a wide range of coverage/histogram/stats outputs. Default output (while voluminous does not contain coverage stats) but mainly contains % alignments which most users want to see by default. In-line help (run bbmap.sh without any options/inputs) details different options available for coverage/histogram/stats outputs.

bartn 07-10-2019 05:09 AM

That's indeed unfortunate. I might try sending an email. A previous bug report was solved very quickly!

One of the reasons I like about BBMap is indeed the wide range of statistics it can generate. Just noticed that final summary percentages like I showed, are to me the most useful thing to look at on first check was missing by default.

And thanks a lot for your answers and your time GenoMax! :)

Illusive Man 08-05-2019 09:42 AM

Dear BBMap users or Brian,

I have a database of about 800 proteins and all I would like to do is create an abundance table of "hits" to these 800 proteins for each one of my 30 samples. Samples as columns, genes as rows, cells as counts. Can BBMap help me create this abundance table?

Thanks

GenoMax 08-05-2019 10:28 AM

BBMap is not designed to work with protein data. You will have to find a different tool.

Take a look at blat from UCSC's Jim Kent. That should produce the stats in parse-able tab-delimited columner format. You will need to do post-processing to get the table you are looking for.

bartn 08-05-2019 11:07 PM

another program that is often used for these cases is DIAMOND

https://github.com/bbuchfink/diamond

But like GenoMax said you have to do some post processing of the output to get to the format you want..

darthsequencer 08-21-2019 03:43 PM

Is there a command to output the kmers of each sequence in a multifasta file?

GenoMax 08-22-2019 01:34 AM

@darthsequencer: You can use kmercountexact.sh from BBMap suite.


All times are GMT -8. The time now is 10:45 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.