SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 514 03-13-2020 03:57 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 11-26-2016, 07:48 AM   #381
nicklisbon
Junior Member
 
Location: Lisbon

Join Date: Sep 2015
Posts: 5
Default

So, the first alignment is run in global alignment mode, while the second one is run in local mode allowing indels up to 400 kb, isn't it?
Sounds like a good solution thanks.
We will try this option and check whether Lumpy likes it.
nicklisbon is offline   Reply With Quote
Old 11-28-2016, 06:29 AM   #382
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

what is the default number of mismatch and how can I change it ?
moistplus is offline   Reply With Quote
Old 11-28-2016, 09:32 AM   #383
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The default is roughly 76% identity. You can adjust this with the "minid" flag (e.g. "minid=0.80" for 80% identity.) If you want to restrict alignments to a maximum number of substitutions, you can use "subfilter"; e.g., "subfilter=5" will discard alignments with more than 5 substitutions.

That said, what exactly are you trying to do?

Last edited by Brian Bushnell; 11-28-2016 at 09:35 AM.
Brian Bushnell is offline   Reply With Quote
Old 11-28-2016, 12:09 PM   #384
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Quote:
Originally Posted by Brian Bushnell View Post
The default is roughly 76% identity. You can adjust this with the "minid" flag (e.g. "minid=0.80" for 80% identity.) If you want to restrict alignments to a maximum number of substitutions, you can use "subfilter"; e.g., "subfilter=5" will discard alignments with more than 5 substitutions.

That said, what exactly are you trying to do?
I would like to map my reads to my draft assembly to compute the coverage.
moistplus is offline   Reply With Quote
Old 11-28-2016, 12:31 PM   #385
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBMap can output coverage, by the way. Just add the flag "covstats=covstats.txt". You may also want to add the flag "delcov=f" for this purpose.
Brian Bushnell is offline   Reply With Quote
Old 11-28-2016, 01:25 PM   #386
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

In the case where I have 2 libraries (or more); for instance one paired end and one single; so the commands would be:

Quote:
bbmap.sh in1=pe_lib1_R1.fq in2=pe_lib1_R2.fq outm=lib1.bam
and

Quote:
bbmap.sh in1=pe_lib2.fq outm=lib2.bam
Then I merge the 2 bam files with
Quote:
samtools merge
And finally I compute the coverage with :

Quote:
pileup.sh in=merge.sam out=stats.txt hist=histogram.txt
Is there a way to do that in one command with bbtools ? (at least avoid merging 2 lib)
moistplus is offline   Reply With Quote
Old 11-28-2016, 01:34 PM   #387
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Pileup won't accept multiple input files, but it will accept standard in. So if the files were sam rather than bam, you could do this:

Code:
cat lib1.sam lib2.sam | pileup.sh in=stdin.sam out=stats.txt hist=histogram.txt
or for gzipped files:

Code:
cat lib1.sam.gz lib2.sam.gz | pileup.sh in=stdin.sam.gz out=stats.txt hist=histogram.txt
However, I don't see a way to do that with bam files. BBMap will produce sam files instead of bam files if you name the output *.sam instead of *.bam.
Brian Bushnell is offline   Reply With Quote
Old 11-29-2016, 02:13 AM   #388
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

cat is working ? because there are the headers (@).

pileup does not take account of that ?
moistplus is offline   Reply With Quote
Old 11-29-2016, 09:22 AM   #389
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Pileup uses the header lines to determine the names and lengths of the sequences in the reference. So, they must be present. However, it doesn't really care whether there are extra ones randomly in the file; they just get ignored. So, cat works fine.
Brian Bushnell is offline   Reply With Quote
Old 12-11-2016, 04:15 AM   #390
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Quote:
Originally Posted by Brian Bushnell View Post
Pileup uses the header lines to determine the names and lengths of the sequences in the reference. So, they must be present. However, it doesn't really care whether there are extra ones randomly in the file; they just get ignored. So, cat works fine.
Can you confirm that pileup doesn't need sam files to be sorted ?
moistplus is offline   Reply With Quote
Old 12-11-2016, 08:36 AM   #391
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by moistplus View Post
Can you confirm that pileup doesn't need sam files to be sorted ?
Yes, that's correct.
Brian Bushnell is offline   Reply With Quote
Old 12-14-2016, 05:21 AM   #392
skbrimer
Member
 
Location: OP Kansas

Join Date: Mar 2014
Posts: 53
Default

Hi Brian,

I like to use freebayes for the vcf creation because all of our stuff is haploid, however I do not know how to work with the ref file that BBMap makes during a mapping. Freebayes accepts a fasta for the ref. The ref folder contains a compressed chromosome file which freebayes doesn't like and when I use the fasta that I feed to BBmap it doesn't like the header mismatch.

Can you tell me how to overcome this issue?
skbrimer is offline   Reply With Quote
Old 12-14-2016, 08:31 AM   #393
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Unfortunately, it sounds like Freebayes is trimming the names of the sequences after the first whitespace when loading them from the fasta file, but it's not doing that when looking at the sam file. To fix that problem, unfortunately you'd need to rerun the mapping, adding the "trd" flag (trim read description). Additionally, Freebayes is not compliant with the current Sam specification and does not understand cigar strings with "=" and "X" symbols, so you need to output in legacy Sam format with M symbols. Therefore, for Freebayes, you need to add two extra flags when mapping:

Code:
trd sam=1.3
Brian Bushnell is offline   Reply With Quote
Old 12-14-2016, 08:37 AM   #394
skbrimer
Member
 
Location: OP Kansas

Join Date: Mar 2014
Posts: 53
Default

Thank you Brain!

Is there a variant calling pipeline that works best with BBmap?
skbrimer is offline   Reply With Quote
Old 12-14-2016, 09:00 AM   #395
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

If you are going to re-do mapping then clean up the fasta ID's (and regenerate BBMap index) in your reference genome, if there are spaces in the words there.

BBMap suite now has a variant caller (callvariants.sh) :-)
GenoMax is offline   Reply With Quote
Old 12-14-2016, 09:09 AM   #396
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by skbrimer View Post
Thank you Brain!

Is there a variant calling pipeline that works best with BBmap?
Funny you should ask! I just finished developing a variant-caller, because I was unhappy with the speed, usability, and some other aspects of Samtools mpileup and GATK (I've never used Freebayes, though). Also, it has a default ploidy of 1 which better reflects projects at JGI, though the ploidy can be changed to an arbitrary number.

I just uploaded a new version of BBTools (36.71) with the variant caller (older versions do not support vcf output). You can run it like this:

Code:
callvariants.sh in=mapped.sam vcf=vars.vcf ref=ref.fasta ploidy=1
CallVariants works on sam or bam, sorted or unsorted, and it's really fast. Also, it does not care what format the cigar strings are in or whether you trimmed sequence names after the first whitespace. I'd be eager to hear how it compares to FreeBayes, if you'd like to try it out. So far, other people at JGI have indicated that it does a much better job of identifying low-depth variants in high-ploidy organisms than Samtools or GATK (if you adjust the ploidy and minallelefraction flags).

That said, no, I don't have a recommended variant-calling workflow for BBMap, because I have not done rigorous benchmarking of all the possibilities yet. Oh, and to modify my previous post, for variant calling I suggest you also add the "mdtag" flag, which some variant-callers like to have. So, for Freebayes:

Code:
bbmap.sh in=reads.fq ref=ref.fa out=mapped.sam mdtag trd sam=1.3

Quote:
Originally Posted by GenoMax View Post
BBMap suite now has a variant caller (callvariants.sh) :-)
Indeed But if you use it, I highly recommend downloading BBMap 36.71, as prior versions lack proper VCF output.

Last edited by Brian Bushnell; 12-14-2016 at 09:12 AM.
Brian Bushnell is offline   Reply With Quote
Old 12-14-2016, 09:55 AM   #397
skbrimer
Member
 
Location: OP Kansas

Join Date: Mar 2014
Posts: 53
Default

Thanks for the input guys! I will start playing with the updated BBmap as well!
skbrimer is offline   Reply With Quote
Old 12-14-2016, 09:11 PM   #398
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Hi Brian and other members,

I'm trying to map the Illumina reads generated from 19 inbred and hybrid mazie cultivars under two conditions for downstream differential expression analysis. Considering the SNPs and indels in the mixed genetic background, my original plan was to generate a 'corrected refernece' using tools like Quiver or Pilon. Then I came across to BBMap and was happy to find its capability to handle SNPs and indels. I used the default setting and the mapped rate so far was over 97%. Meanwhile, I did notice a huge variaion of ambiguous rate, ranging from 6% to >50%. Therefore, I was wondering if it is possible to evaluate which method (directly map to a single ref vs map to 19 SNP corrected ref) would be more accurate for the DE analysis.

The third option would be to re-assemble de-novo assembly for each genetic background. However, I wasn's sure if it's possible to come up with a consensus contig conrrespondance (contigX_background1, contigX_background2..contigX_background19) so that I could monitor each contig/gene of interest in multiple backgrounds.

Any input/thoughts you may have on this would be much appreciated. Thank you in advance.

Last edited by chiayi; 12-15-2016 at 04:46 AM.
chiayi is offline   Reply With Quote
Old 12-16-2016, 05:30 PM   #399
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Maize is the kind of organism where de-novo assembly is extremely difficult. I recommend avoiding that; it will make things much more complicated. I don't think there's any reason to do that, either, unless you find that structural variations are causing problems.

BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.

I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?
Brian Bushnell is offline   Reply With Quote
Old 12-16-2016, 07:22 PM   #400
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Quote:
BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.
I share your thoughts so I will at least try a few background to see how it goes. On the other hand, I'm not sure how to compare the accuracy between the results generated from base reference vs SNP corrected reference. Do you have any suggestion?

Quote:
I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?
The reads were generated by NEBNext Ultra RNA library prep kit for Illumina (no enrichment) and sequenced on NextSeq. I mapped the reads to the latest genome annotation (B73 RefGen_v4 reference genome). I should also mention that these are mixed background (hybrid of B73, reference genome, and other cultivar). For B73 alone, the ambig rate was lower at about ~10%.
Is there a 'standard' for ambiguous rate when you look at the stats from BBMap?
chiayi is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO