![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
BBMap (aligner for DNA/RNAseq) is now open-source and available for download. | Brian Bushnell | Bioinformatics | 678 | 12-12-2019 09:36 AM |
BBMap for BitSeq | dietmar13 | Bioinformatics | 1 | 04-30-2015 09:40 AM |
Please help my BBMap cannot remove Illumina adapter | TofuKaj | Bioinformatics | 4 | 04-28-2015 09:53 AM |
BBMap Error | Phage Hunter | Bioinformatics | 5 | 01-14-2015 05:34 AM |
Introducing BBMap, a new short-read aligner for DNA and RNA | Brian Bushnell | Bioinformatics | 24 | 07-07-2014 10:37 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
BBMap suite has become an essential part of the bioinformatics tool arsenal for many ...
Thank you Brian! ---------------------------------------------------------------- Over the past year Brian has doled out specific advice on how to use a component from BBMap suite in standard (and not-so-standard) ways in various threads on SeqAnswers. I thought it would be useful to try and collect some "special" use cases in one thread so they could be found in one location. Feel free to add to this thread things I have missed. I have tried to categorize the list based on the BBMap programs. That classification is imperfect. There are common options for most BBMap suite programs and depending on the file extension the input/output format is automatically chosen/set. Note: For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file. This may be convenient if you use a certain set of parameters for different set of analyses. -------------------------------------------------------------------- BBMap.sh
BBMap.sh has a length cap of 6kbp. Reads longer than this will be broken into 6kbp pieces and mapped independently. Code:
$ mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400 The "maxlen" flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.
BBMap itself can only run single-ended or paired-ended in a single run, but it has a wrapper that can accomplish it, like this: Code:
$ bbwrap.sh in1=read1.fq,singletons.fq in2=read2.fq,null out=mapped.sam append Note about alignment stats: For paired reads, you can find the total percent mapped by adding the read 1 percent (where it says "mapped: N%") and read 2 percent, then dividing by 2. The different columns tell you the count/percent of each event. Considering the cigar strings from alignment, "Match Rate" is the number of symbols indicating a reference match (=) and error rate is the number indicating substitution, insertion, or deletion (X, I, D).
When mapping small RNA's with BBMap use the following flags to report only perfect matches. Code:
ambig=all vslow perfectmode maxsites=1000
BBMap is always nondeterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread (threads=1), or map the reads as single-ended and then fix pairing afterward: Code:
bbmap.sh in=reads.fq outu=unmapped.fq int=f repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq BBSplit is based on BBMap, so it is also nondeterministic in paired mode with multiple threads. BBDuk and Seal (which can be used similarly to BBSplit) are always deterministic. -------------------------------------------------------- Reformat.sh
Code:
$ reformat.sh in=reads.fq out=trimmed.fq ftr=19 Code:
$ kmercountexact.sh in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f Code:
ACCGTTACCGTTACCGTTAC 100 AAATTTTTTTCCCCCCCCCC 85
Code:
$ reformat.sh in=reads.sam out=out.sam sam=1.3
You can use BBDuk or Reformat with "qtrim=rl trimq=1". That will only trim trailing and leading bases with Q-score below 1, which means Q0, which means N (in either fasta or fastq format). The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale.
Code:
$ reformat.sh in=reads.fq out=sampled.fq sample=3000 Code:
To sample 10% of the reads: reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1 or more concisely: reformat.sh in=reads#.fq out=sampled#.fq samplerate=0.1 and for exact sampling: reformat.sh in=reads#.fq out=sampled#.fq samplereadstarget=100k
Remove anything after the first space in fasta header. Code:
reformat.sh in=sequences.fasta out=renamed.fasta trd
Code:
$ reformat.sh in=reads.sam out=reads.fastq
Code:
$ reformat.sh in=reads.fastq verifypairing
Code:
$ reformat.sh in1=r1.fq in2=r2.fq vpair Code:
$ reformat.sh in=reads.fastq out1=r1.fastq out2=r2.fastq
Code:
$ reformat.sh in=reads.fq qchist=qchist.txt
Code:
$ reformat.sh in=x.sam out=y.sam minlength=50 maxlength=200
Code:
$ reformat.sh in=mapped.bam out=filtered.bam maxdellen=50 ------------------------------------------------------------- Repair.sh
Code:
$ repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz BBMerge.sh BBMerge now has a new flag - "outa" or "outadapter". This allows you to automatically detect the adapter sequence of reads with short insert sizes, in case you don't know what adapters were used. It works like this: Code:
$ bbmerge.sh in=reads.fq outa=adapters.fa reads=1m Code:
>Read1_adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG >Read2_adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG Note: For BBMerge with micro-RNA, you need to add the flag mininsert=17. The default is 35, which is too long for micro-RNA libraries.
Code:
$ bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa ----------------------------------------------------------------- BBDuk.sh Note: BBDuk is strictly deterministic on a per-read basis, however it does by default reorder the reads when run multithreaded. You can add the flag "ordered" to keep output reads in the same order as input reads
Ref thread: http://seqanswers.com/forums/showpos...7&postcount=43 filter by minAvgQuality force-trim kmer-analyze (trim, filter, or mask) trim by overlap quality-trim
Code:
$ bbduk.sh -Xmx1g in=reads.fq outm=matched.fq outu=unmatched.fq restrictleft=25 k=25 literal=AAAAACCCCCTTTTTGGGGGAAAAA So you could bin all the reads with your known sequence, then look at the remaining reads to see what they have in common. You can do the same thing with the tail of the read using "restrictright" instead, though you can't use both restrictions at the same time. Code:
$ bbduk.sh in=reads.fq outm=matched.fq literal=NNNNNNCCCCGGGGGTTTTTAAAAA k=25 copyundefined
If you're not sure which adapters are used, you can add "ref=truseq.fa.gz,truseq_rna.fa.gz,nextera.fa.gz" and get them all (this will increase the amount of overtrimming, though it should still be negligible).
Code:
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
Code:
$ bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq literal=ACGTACGTACGTACGTAC k=18 mm=f hdist=2
Code:
$ bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
Code:
bbduk.sh in=reads.fq out=readsWithoutNs.fq outm=readsWithNs.fq maxns=0 General notes for BBDuk.sh BBDuk can operate in one of 4 kmer-matching modes: Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time. BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different. ------------------------------------------------------------------ Randomreads.sh
Code:
$ randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000 Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly. You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this: Code:
$ randomreads.sh maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5 $ randomreads.sh maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5 coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed). metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference. The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.
You can simulate a 4000bp jump library from your existing data like this. Code:
$ cat assembly1.fa assembly2.fa > combined.fa $ bbmap.sh ref=combined.fa $ randomreads.sh reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz Shred.sh Code:
$ shred.sh in=ref.fasta out=reads.fastq length=200 --------------------------------------------------------------- Demuxbyname.sh
Code:
$ demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,... outu=filename In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming. ---------------------------------------------------------------- Readlength.sh
Code:
$ readlength.sh in=file out=histogram.txt bin=10 max=80000 Alternatively, if these are assemblies and you're interested in continuity information (L50, N50, etc), you can run stats on each or statswrapper on all of them: Code:
stats.sh in=file Code:
statswrapper.sh in=file,file,file,file… Filterbyname.sh By default, "filterbyname" discards reads with names in your name list, and keeps the rest. To include them and discard the others, do this: Code:
$ filterbyname.sh in=003.fastq out=filter003.fq names=names003.txt include=t getreads.sh If you only know the number(s) of the fasta/fastq record(s) in a file (records start at 0) then you can use the following command to extract those reads in a new file. Code:
$ getreads.sh in=<file> id=<number,number,number...> out=<file> Parameters: in=<file> Specify the input file, or stdin. out=<file> Specify the output file, or stdout. id= Comma delimited list of numbers or ranges, in any order. For example: id=5,93,17-31,8,0,12-13 ---------------------------------------------------------------- Splitsam.sh
Code:
splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam reformat.sh in=plus.sam out=plus.fq reformat.sh in=minus.sam out=minus.fq rcomp BBSplit.sh BBSplit now has the ability to output paired reads in dual files using the # symbol. For example: Code:
$ bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2. Added feature: One can specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Reason this is useful is to use BBSplit is to have it split input into one output file per reference file. NOTE: 1 By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags "minid=0.76 maxindel=16k minhits=1". With those parameters it is extremely sensitive. NOTE: 2 BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps). NOTE: 3 Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false). NOTE: 4 BBSplit needs multiple reference files as input; one per organism, or one for target and another for everything else. It only outputs one file per reference file. Seal.sh, on the other hand, which is similar, can use a single concatenated file, as it (by default) will output one file per reference sequence within a concatenated set of references. -------------------------------------------------------------- Pileup.sh
Code:
$ pileup.sh in=mapped.sam normcov=normcoverage.txt normb=20 stats=stats.txt
Code:
$ pileup.sh in=mapped.sam covstats=coverage.txt
Program will take sam or bam, sorted or unsorted. Code:
$ pileup.sh in=mapped.sam out=stats.txt hist=histogram.txt It's also possible to generate coverage directly from BBMap, without an intermediate sam file, like this: Code:
$ bbmap.sh in=reads.fq ref=reference.fasta nodisk covstats=stats.txt covhist=histogram.txt
Code:
$ pileup.sh in=mapped.sam out=stats.txt bincov=coverage.txt binsize=1000 -------------------------------------------------------------- Dedupe.sh Dedupe ensures that there is at most one copy of any input sequence, optionally allowing contaminants (substrings) to be removed, and a variable hamming or edit distance to be specified. Usage: Code:
$ dedupe.sh in=assembly1.fa,assembly2.fa out=merged.fa Dedupe will merge assemblies, but it will not produce consensus sequences or join overlapping reads; it only removes sequences that are fully contained within other sequences (allowing the specified number of mismatches or edits). Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag "uniqueonly=t" then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads. However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this: 1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample. 2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed. --------------------------------------------------------------
[*]index the reference Code:
$ bbmap.sh ref=reference.fasta [*]Generate random reads Code:
$ randomreads.sh reads=100000 length=100 out=synth.fastq maxq=35 midq=25 minq=15 ...substitute this command with the appropriate one from your aligner of choice Code:
$ bbmap.sh in=synth.fq out=mapped.sam Code:
$ samtoroc.sh in=mapped.sam reads=100000
Code:
$ kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt -----------------------------------------------------------------
To see how many mapped reads (can be mapped concordant or discordant, doesn't matter) are shared between the two alignment files and how many mapped reads are unique to one file or the other. Code:
$ reformat.sh in=file1.sam out=mapped1.sam mappedonly $ reformat.sh in=file2.sam out=mapped2.sam mappedonly Code:
$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t Code:
$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f $ filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f -------------------------------------------------------------- BBrename.sh Code:
$ bbrename.sh in=old.fasta out=new.fasta You can also give a custom prefix if you want. The input has to be text format, not .doc. --------------------------------------------------------------------- BBfakereads.sh
Code:
$ bfakereads.sh in=reads.fastq out1=r1.fastq out2=r2.fastq length=100 ------------------------------------------------------------------ Randomreads.sh
Code:
$ randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000 You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping. --------------------------------------------------------------------
Code:
$ bbcountunique.sh in=reads.fq out=histogram.txt The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2). The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer. It plots a point every X reads (configurable, default 25000). In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful. In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads. ----------------------------------------------------------------- CalcTrueQuality.sh http://seqanswers.com/forums/showthread.php?p=170904 In light of the quality-score issues with the NextSeq platform, and the possibility of future Illumina platforms (HiSeq 3000 and 4000) also using quantized quality scores, I developed it for recalibrating the scores to ensure accuracy and restore the full range of values. ----------------------------------------------------------------- BBMapskimmer.sh BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is bbmapskimmer.sh and the usage is similar to bbmap.sh or mapPacBio.sh. For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag. -------------------------------------------------------------- msa.sh and curprimers.sh Quoted from Brian's response directly. I also wrote another pair of programs specifically for working with primer pairs, msa.sh and cutprimers.sh. msa.sh will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence. So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into cutprimers.sh, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences. I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference. ------------------------------------------------------ testformat.sh Identify type of Q-score encoding in sequence files Code:
$ testformat.sh in=seq.fq.gz sanger fastq gz interleaved 150bp kcompress.sh Newest member of BBTools. Identify constituent k-mers. http://seqanswers.com/forums/showthread.php?t=63258 ---------------------------------------------------- commonkmers.sh Find all k-mers for a given sequence. Code:
$ commonkmers.sh in=reads.fq out=kmers.txt k=4 count=t display=999 Code:
MISEQ05:239:000000000-A74HF:1:2110:14788:23085 ATGA=8 ATGC=6 GTCA=6 AAAT=5 AAGC=5 AATG=5 AGCA=5 ATAA=5 ATTA=5 CAAA=5 CATA=5 CATC=5 CTGC=5 AACC=4 AACG=4 AAGA=4 ACAT=4 ACCA=4 AGAA=4 ATCA=4 ATGG=4 CAAG=4 CCAA=4 CCTC=4 CTCA=4 CTGA=4 CTTC=4 GAGC=4 GGTA=4 GTAA=4 GTTA=4 AAAA=3 AAAC=3 AAGT=3 ACCG=3 ACGG=3 ACTG=3 AGAT=3 AGCT=3 AGGA=3 AGTA=3 AGTC=3 CAGC=3 CATG=3 CGAG=3 CGGA=3 CGTC=3 CTAA=3 CTCC=3 CTTA=3 GAAA=3 GACA=3 GACC=3 GAGA=3 GCAA=3 GGAC=3 TCAA=3 TGCA=3 AAAG=2 AACA=2 AATA=2 AATC=2 ACAA=2 ACCC=2 ACCT=2 ACGA=2 ACGC=2 AGAC=2 AGCG=2 AGGC=2 CAAC=2 CAGG=2 CCGC=2 GCCA=2 GCTA=2 GGAA=2 GGCA=2 TAAA=2 TAGA=2 TCCA=2 TGAA=2 AAGG=1 AATT=1 ACGT=1 AGAG=1 AGCC=1 AGGG=1 ATAC=1 ATAG=1 ATTG=1 CACA=1 CACG=1 CAGA=1 CCAC=1 CCCA=1 CCGA=1 CCTA=1 CGAC=1 CGCA=1 CGCC=1 CGCG=1 CGTA=1 CTAC=1 GAAC=1 GCGA=1 GCGC=1 GTAC=1 GTGA=1 TTAA=1 Mutate.sh Simulate multiple mutants from a known reference (e.g. E. coli). Code:
$ mutate.sh in=e_coli.fasta out=mutant.fasta id=99 $ randomreads.sh ref=mutant.fasta out=reads.fq.gz reads=5m length=150 paired adderrors ------------------------------------ Partition.sh One can partition a large dataset with partition.sh into smaller subsets (example below splits data into 8 chunks). Code:
partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8 clumpify.sh If you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage: Code:
clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz Code:
clumpify.sh in1=reads_R1.fastq.gz in2=reads_R2.fastq.gz out1=clumped_R1.fastq.gz out2=clumped_R2.fastq.gz
This does NOT require alignments so it should prove more useful compared to Picard MarkDuplicates. Relevant options for clumpify.sh command are listed below. Code:
dedupe=f optical=f (default) Nothing happens with regards to duplicates. dedupe=t optical=f All duplicates are detected, whether optical or not. All copies except one are removed for each duplicate. dedupe=f optical=t Nothing happens. dedupe=t optical=t Only optical duplicates (those with an X or Y coordinate within dist) are detected. All copies except one are removed for each duplicate. The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy. But like optical, it has no effect unless dedupe=t. Note: If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically. fuse.sh Fuse will automatically reverse-complement read 2. Pad (N) amount can be adjusted as necessary. This will for example create a full size amplicon that can be used for alignments. Code:
fuse.sh in1=r1.fq in2=r2.fq pad=130 out=fused.fq fusepairs partition.sh Following command will produce 10 output files with an equal number of sequences and no duplication. Code:
partition.sh in=X.fa out=X%.fa ways=10 Last edited by GenoMax; 08-24-2020 at 11:03 AM. |
![]() |
![]() |
![]() |
#2 |
Registered Vendor
Location: Eugene, OR Join Date: May 2013
Posts: 521
|
![]()
I feel that the best strategy to solve some very difficult problems out there in the world, like proving P = NP or creating a true artificial intelligence, would be to couch the problem in terms of read mapping and then ask Brian to add that feature to BBMap.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com |
![]() |
![]() |
![]() |
#3 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Much thanks, GenoMax! I have added a link to this on the Sourceforge page.
@SNPsaurus - I have no immediate plans to add artificial intelligence to BBMap, but that's a good suggestion; I'll add it to my todo list. |
![]() |
![]() |
![]() |
#4 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
Perhaps Brian is an 'artificial intelligence' or an 'alien intelligence' who takes pity on us mere mortals to provide excellent tools for us to work with?
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
The BB map home page ( http://sourceforge.net/projects/bbmap/ )says ...
Short read aligner for DNA and RNA-seq data. Is there a special mode or feature that makes it good for RNA reads? Does it handle long splices especially well ? |
![]() |
![]() |
![]() |
#6 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Yes, that's correct. BBMap handles very long deletions or splices, making it useful for both RNA-seq or situations like fast-neutron bombardment to delete entire genes. It finds splices denovo (does not use a gene annotation file) and allows multiple introns/deletions per read. The defaults are set for plant and fungal introns, which are short; when mapping vertebrate RNA-seq data, I normally use the flag maxindel=200000 (default is 16000). It also supports the necessary flags for Cufflinks custom tags (xstag=firststrand, etc).
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Germany Join Date: Oct 2008
Posts: 415
|
![]()
Hi Brian,
I wouldn't say plant introns are as short as 16000. We have observed reliable introns as large as 50kbp in one species (confirmed by spliced alignments of multiple RNA-seq datasets) and 150kbp in another (maize). Great to hear BBmap works well on RNA-seq reads and auto detects splice sites though! |
![]() |
![]() |
![]() |
#8 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Oh, that's good to know. My exposure to plant RNA-seq is mainly Chlamy and Arabidopsis - and you know what they say about assumptions.
|
![]() |
![]() |
![]() |
#9 |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]()
Hey Brian,
A feature many of the bench scientists I work with would love is an automated grant proposal generator. Any chance of something like that in future updates? The input could be some images/figures and a basic subject...
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff Salk Institute for Biological Studies, La Jolla, CA, USA */ |
![]() |
![]() |
![]() |
#10 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I was thinking about an automatic paper generator, which would be more useful to me. But I guess grant proposals are even more standardized. I'll see if I can roll that into Reformat.
|
![]() |
![]() |
![]() |
#11 |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]()
Excellent. My PI will be stoked.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff Salk Institute for Biological Studies, La Jolla, CA, USA */ |
![]() |
![]() |
![]() |
#12 |
Member
Location: NJ Join Date: Oct 2012
Posts: 97
|
![]()
Hi Brian,
Thank you for creating these awesome applications. I was wondering if you have any plans to create an OLC aligner or something that improves Dedupe so that it can merge multiple assemblies together while taking care of any duplicate sequences based on a certain or user specified percentage. Then finally it outputs a consensus sequence based on a user specified percent similarity and some other variables. I believe this will be very helpful with working with viral or bacterial de-novo genome (DNA-Based) Assemblies. Some ideas. ![]() |
![]() |
![]() |
![]() |
#13 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I would also really like such a tool! Yes, I DO have plans to do something like that, but I am not sure about the timing, since I have had these plans for almost a year now. But hopefully I will make it this year, as Dedupe is really close to being an assembler, just not quite there.
|
![]() |
![]() |
![]() |
#14 |
Junior Member
Location: DE Join Date: May 2015
Posts: 4
|
![]()
Do you recommend quality trimming before mapping using BBMap? I am talking illumina 100bp PE.
I have been searching the forum for a while but could not find a hint. Thanks |
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
Unless you have a lot of data that has Q scores of 10 or less you can get away without quality trimming.
|
![]() |
![]() |
![]() |
#16 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
To expand upon that, BBMap has the options for internally quality-trimming (and untrimming) reads, but I only use them in special cases. For example, BBMap is used a lot for filtering out contamination. To ensure high specificity, I only want to remove reads that map to contaminant sequences with high identity; but I don't want the identity calculation to include low-quality bases. So, I map with "qtrim=rl trimq=15 untrim" or similar, which will trim before mapping but then output the untrimmed read.
For normal cases such as variant calling or coverage calculation, don't quality-trim unless you have really low quality data. I would not expect that to be the case for a 100bp library. |
![]() |
![]() |
![]() |
#17 |
Junior Member
Location: DE Join Date: May 2015
Posts: 4
|
![]()
Aah, I see. Thanks to both of you. Would these flags work with BBDuk then?
Data quality is decent. I will map to a de novo metagenome assembly generated from the same data. I am kind of hopeful to be able to do a little variant calling after metagenome binning and mapping. Given this situation, would you still rather not quality trim? Thanks |
![]() |
![]() |
![]() |
#18 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Those flags work in BBDuk as well, except for "untrim", which is unique to BBMap.
A good variant caller will take the quality into account when weighing possible variations. Quality-trimming before mapping gives the mapper less information, and thus a higher probability of ambiguously-mapped or incorrectly-mapped reads (though adapter-trimming before mapping is always good). Even if the trimmed bases are low quality - say, 50 bases at Q13 - that's still an expected 45 matches and 5 mismatches, which can greatly increase mapping confidence in a repetitive area, and thus improve the ability to call the variations from the high-quality part of the read. On the other hand, quality-trimming before assembly is often a good idea, though the exact threshold depends on the assembler and dataset. |
![]() |
![]() |
![]() |
#19 |
Junior Member
Location: New York, NY Join Date: Jun 2012
Posts: 5
|
![]()
Is there a way of efficiently using the BBMap tools to cut variable length and variable sequence primer pairs from amplicon sequencing data either pre-alignment (fastq) or post-alignment (BAM)? I am thinking in regards to RainDance and Fluidigm data where there are hundreds of amplicons, many of which overlap, each with different forward/reverse primers.
|
![]() |
![]() |
![]() |
#20 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
BBTools has is a pair of programs for cutting (and outputting to a file) the sequence between primer pairs, but not for cutting out primers themselves. Is that what you are after?
The usage is like this: Assume your set of primers for one end is AAAAAA and AAAAAT, and for the other end is GGGGGG and GGGCCC: msa.sh in=amplicons.fq out=primer1.sam literal=AAAAAA,AAAAAT msa.sh in=amplicons.fq out=primer2.sam literal=GGGGGG,GGGCCC That will find the optimal alignment for the optimal primer, and output one line per amplicon (twice). Then run this: cutprimers.sh in=amplicons.fq out=middle.fq sam1=primer1.sam sam2=primer2.sam "middle.fq" will contain the sequence between the primers, one per amplicon, using the best alignment. I designed this to cut V4 regions from full-length 16s. Notes: msa.sh stands for MultiStateAligner, not for Multiple Sequence Alignment. All sequences can be in any orientation (forward or reverse complement). Please let me know if that doesn't address your question; I may have misunderstood it. |
![]() |
![]() |
![]() |
Tags |
bbmap |
Thread Tools | |
|
|