SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 07-06-2016, 10:47 AM   #341
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Thanks, nlubock; you've allowed me to resolve 2 unexpected corner-cases with very short kmer lengths. I've released a new version (36.14) fixing both of them.
Brian Bushnell is offline   Reply With Quote
Old 07-06-2016, 06:02 PM   #342
nlubock
Junior Member
 
Location: Los Angeles

Join Date: Jun 2016
Posts: 3
Default

Thank you! Love the product.

I'm not sure if this is a bug as well, but I noticed that some of the MD Tags in my output files are missing 0's between adjacent mismatches. However, if there is a deletion and then a mismatch, bbmap.sh will add a 0.

For example, if I align the following fasta file:
Code:
>perf
CATCAGGACCAAGACTGGGGAGATCATCTCTGCTGTCCACATTGAAGCCT
>mm-at-25-26
CATCAGGACCAAGACTGGGGAGATGGTCTCTGCTGTCCACATTGAAGCCT
>del-24-mm-at-25-26
CATCAGGACCAAGACTGGGGAGAGGTCTCTGCTGTCCACATTGAAGCCT
>del-24-mm-at-25
CATCAGGACCAAGACTGGGGAGAGATCTCTGCTGTCCACATTGAAGCCT
back onto the perfect sequence, and then run the resulting sam file through samtools calmd, I get:
Code:
50
24CA24    -> 24C0A24
23^T0CA24 -> 23^T0C0A24
23^T0C25
Maybe this is a samtools convention? The official spec was not very clear...
nlubock is offline   Reply With Quote
Old 07-07-2016, 02:58 PM   #343
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

There is no official spec for MD tags as far as I know; it's basically defined by usage. The sam PDF just provides vague guidance. I think 0's are optional for adjacent mismatches and there is no indication otherwise in the sam spec. For a deletion followed by a mismatch, the 0 allows you to visually determine how long the deletion is without looking at the cigar string (otherwise it would be hard to visually distinguish between ^AC meaning "deletion of A followed by mismatch of C" versus "deletion of AC". But when combined with the cigar string there is no ambiguity, so they are both correct. Especially considering MD tags have no formal specification For machine-processing, there's no reason to have 0's and they just waste space.
Brian Bushnell is offline   Reply With Quote
Old 08-10-2016, 10:42 AM   #344
parulagwl
Junior Member
 
Location: Canada

Join Date: Jul 2014
Posts: 6
Default

We have been liking our first experiences with BBmap for our wheat data and are impressed. And it works well when we use parts of a wheat chromosome for both indexing and mapping. However, when we move from using parts of a chromosome to a whole chromosome (e.g. chr 3B approx. 850 million bases) we get the following error during indexing:

bbmap.sh k=14 ref=3B.fa

java -Djava.library.path=/software/bbmap/bbmap/jni/ -ea -Xmx119912m -cp /software/bbmap/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 k=14 ref=3B.fa

Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, k=14, ref=3B.fa]

BBMap version 35.74

Retaining first best site only for ambiguous mappings.
No output file.

NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt

Set genome to 1

Loaded Reference: 0.008 seconds.

Exception in thread "main" java.lang.AssertionError: 1, 0
at align2.BBIndex.loadIndex(BBIndex.java:94)
at align2.BBMap.loadIndex(BBMap.java:361)
at align2.BBMap.main(BBMap.java:31)

If we split Chr 3B into two fasta sequences and run it works fine and dandy. Have also tried BBmap v36.20 and get the same error. I'm guessing there is a limitation on Chr size? Is there a way to adjust this?
Many Thanks.
parulagwl is offline   Reply With Quote
Old 08-10-2016, 10:47 AM   #345
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh, that's kind of irritating. BBMap as currently structured has a maximum reference sequence length of 500Mbp. I designed it that way because I was unaware of any chromosomes longer than that, and I believed the reason to be that 500Mbp was above the maximum stable length of an individual chromosome... looks like I may have been wrong!

I'll have to think about how to resolve this; there's no simple setting for it. Thanks for bringing it to my attention.
Brian Bushnell is offline   Reply With Quote
Old 08-10-2016, 11:24 AM   #346
parulagwl
Junior Member
 
Location: Canada

Join Date: Jul 2014
Posts: 6
Default

Thank you Brian for the quick response.
We would really appreciate your thoughts/inputs on how to work around our issue.
parulagwl is offline   Reply With Quote
Old 08-10-2016, 11:27 AM   #347
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

Purely speculating. Don't know where the centromere is in this chromosome but you could split it in a region where there are long stretches of N's (and the pieces remain smaller than 500 mb) that way chances of reads needing to map across this break would be small.
GenoMax is offline   Reply With Quote
Old 08-11-2016, 02:28 AM   #348
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default

Just because I sometimes stumble over that issue in tutorials (which don't seem to bother) and also saw it again in the recent question....

I once was thaugt (and got a deduction of points in a test for not knowing it) that using even k-mer sizes is frowned upon. The comprehensible rationale behind is, that only odd k-mer sizes ensure a kmer can never be its own reverse complement in the de Bruijn Graph. Such ambiguity created by palindromic k-mers in the de Bruijn graph supposedly make its resolution difficult.

So to settle that question once and for good: Does it really have an impact on mapping efficiency, if I chose an even or its neighboring odd k-mer?
Thias is offline   Reply With Quote
Old 08-11-2016, 09:01 AM   #349
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

No. The longer the kmer, the greater the speed (and memory consumption); even versus odd is not important.

Additionally, I don't see that even-length kmers cause problems in assembly, either. Genomic palindromes of kmer length or longer cause problems whether you are using an even or odd kmer length. These palindromes always have an even length, but - say you have a genomic palindrome of length 22. Using K=22, you will not (trivially) be able to resolve it. Nor will you with K=21. You will with K=23, and you will with K=24. It's not clear to me in this situation why K=23 would be preferable of K=24 with regards to palindromes, but K=24 can resolve longer repeats than K=23.
Brian Bushnell is offline   Reply With Quote
Old 08-11-2016, 10:28 AM   #350
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Actually, an odd k-mer ensures that the strand orientation can be determined, since the central nucleotide cannot be identical due to complementarity (an even k-mer can be a perfect palindrome in both orientations).

But the point about longer k-mers is spot-on.
HESmith is online now   Reply With Quote
Old 08-12-2016, 01:37 AM   #351
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default

Thanks a lot for your answers! Your exemplified replies were really helpful for some more insight.
Thias is offline   Reply With Quote
Old 08-14-2016, 12:14 PM   #352
darthsequencer
Member
 
Location: california

Join Date: Feb 2012
Posts: 35
Default

Hi I have a couple questions on the terminology used for retaining ambiguous sites using bbmap.

If "ambiguous=best" this means that if there are a bunch of reads all the with the sam score only the first match will be retained? Or does it mean that of all the reads mapping above a score cutoff the first one will be picked?

Along the same lines - for "ambiguous=all" does this mean that if say 5 locations all share the same highest score that they will be reported or does it mean that all locations above the score cutoff will be retained?
darthsequencer is offline   Reply With Quote
Old 08-15-2016, 10:48 AM   #353
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

"ambiguous=best" is a bit misleading, but it means the genomically first location with a maxmimum score will be used. "ambiguous=all" will report all locations within the ambiguity threshold of the first. This does not mean they need exactly the same score; it means that they are very close, so much so that none can be confidently determined to be the correct mapping location. Normally they're identical, but if for example one mapping had a single 1bp deletion and another mapping had two 1bp substitutions, the scores would be different, but would be close enough to be both reported. But if there was a third potential mapping with, say, 5 substitutions, that would be excluded. This can be controlled with the "secondarysitescoreratio" flag; if you set it to 1.0, only mappings with identical scores to the best score will be reported.
Brian Bushnell is offline   Reply With Quote
Old 08-26-2016, 08:58 AM   #354
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default

Hi, Brian

We recently increased our PacBio amplicon size from ~1100 to 3kb. With the smaller amplicon size we were able to map reads to our allele reference sequence library of non-full length allele sequences using "semiperfectmode" to allow for soft-clipping. Im now looking to map ~3kb read sequences obtained from gDNA sequencing to exon reference sequences of ~270 bases a piece and not able to tune the settings to get any mapping results. Is there a way to tune mapPacBio.sh to get hits for regions within long reads to short exon sequences that perfectly match?
lankage is offline   Reply With Quote
Old 08-26-2016, 04:42 PM   #355
susanklein
Senior Member
 
Location: oceania

Join Date: Feb 2014
Posts: 115
Default

Hi,

couldn't you just do it the other way around, Have the pacbio as ref and may your short refs to it?

Although I don't understand why you refs are so short.


S.
susanklein is offline   Reply With Quote
Old 08-26-2016, 05:01 PM   #356
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I agree with Susan. BBMap is a global aligner, and not really designed to map reads to substantially shorter reference sequences. But you could try with the flags "minid=0 local", which might work. Note that "semiperfectmode" will not allow a single mismatch or indel, so it's really only useful in special situations; "local" is more appropriate in this situation.
Brian Bushnell is offline   Reply With Quote
Old 08-27-2016, 05:49 AM   #357
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

@lankage: You don't have to align to the short amplicon regions. You could align to the genome (and find out if you have any non-specific amplification along the way).
GenoMax is offline   Reply With Quote
Old 08-30-2016, 02:37 AM   #358
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Hi,

Do you have a tool to compute insert size from a .bam file ?
I'm looking for statistics about orientation of my paired end or mate pair (RF, FR,...) and insert size corresponding to it.
moistplus is offline   Reply With Quote
Old 08-30-2016, 04:41 AM   #359
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

@moistplus: If you were to use bbmap.sh to do the alignments then you would get that information in the alignment report along with the bam file (as long as you have samtools available in $PATH).
GenoMax is offline   Reply With Quote
Old 09-22-2016, 02:51 AM   #360
Shini Sunagawa
Junior Member
 
Location: Germany

Join Date: Jan 2016
Posts: 8
Default

Hi Brian,

Since I saw increased activity lately again, I was wondering if you might have thought about the issue we discussed back in January (~post #300). It was about dedupe not writing out exact matched and contained sequence identifiers.

As mentioned before, solving this would make this tool very competitive to existing ones, due to the immense speed-up.

Thanks for your consideration!

Best wishes,
Shini
Shini Sunagawa is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:27 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO