SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 07:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 04:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 02:10 AM

Reply
 
Thread Tools
Old 11-19-2014, 05:30 PM   #141
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can do that with the redirect operator. BBTools print status info to standard err, not standard out, so you redirect it like this:

bbmap.sh arguments 2>log.txt

For reference, 1>x.txt 2>y.txt will capture both stdout and stderr to different files in case you use the 'out=stdout.fa' flag.
Brian Bushnell is offline   Reply With Quote
Old 11-23-2014, 02:09 PM   #142
susanklein
Senior Member
 
Location: oceania

Join Date: Feb 2014
Posts: 115
Default

ahhh. Thanks.. I was trying to just use '>'.

S.
susanklein is offline   Reply With Quote
Old 11-24-2014, 12:30 AM   #143
susanklein
Senior Member
 
Location: oceania

Join Date: Feb 2014
Posts: 115
Default basecov problem

Hi,

I am trying to use the basecov option to output per base coverage so I can graph the results. However, the input file I am using is a pseudomolecule with genes separated by 150 Ns. When the output bam files is viewed, the coverage looks fine, but it does not match up with the basecov file and seems 'shifted' 3' by roughly 150bp.

Any ideas on what I am dong wrong?

Thanks,

S.
susanklein is offline   Reply With Quote
Old 11-24-2014, 09:55 AM   #144
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Susan,

How big are these files? Would it be possible for you to send them to me? I am unaware of this ever happening and it sounds more like a bug than user error, but it's hard to know for sure without seeing the data...
Brian Bushnell is offline   Reply With Quote
Old 11-24-2014, 02:43 PM   #145
susanklein
Senior Member
 
Location: oceania

Join Date: Feb 2014
Posts: 115
Default

Brian, nevermind, my mistake, I have realised its just the coverage file includes the local alignment overlap into the 'Ns' whereas my bam viewer (Geneious) trims the overlap and doesn't show it as coverage. Problem solved. Thanks anyway.

S.
susanklein is offline   Reply With Quote
Old 11-24-2014, 02:55 PM   #146
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

OK, good to hear!
Brian Bushnell is offline   Reply With Quote
Old 12-02-2014, 04:29 PM   #147
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Quote:
Originally Posted by Brian Bushnell View Post
Now, as for scoring - the affine score matrix is deeply embedded in BBMap. A 100bp read that maps perfectly with only matches has a 100% identity and BBMap score of 9970 (100 points per match, but only 70 points for the first match). If a 100bp read can map somewhere with 1 mismatch (99% identity, BBMap score of 9713: 9970+1*(-127-100-100+70)), that site will always be chosen over a site with 50bp match, 20k intron, 50bp match (4.7% identity, BBMap score of roughly 8841). But if the 100bp read can map somewhere with 5 mismatches (95% identity, BBMap score of 8685: 9970 +5*(-127-100-100+70)), the spliced alignment will be chosen, because 8841>8685. However they are very close so it would get a low mapping score.
Brian, do you mind explaining how the 50bp match, 20k intron, 50bp match ends up with a rough score of 8841? And in terms of the above examples that means it would have a final score ratio of about 0.887, right? thanks.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-02-2014, 04:55 PM   #148
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The scoring uses affine transforms, so consecutive cigar string symbols are scored differently according to how many there are. These are the scores:

Code:
POINTS_NOREF=0;
POINTS_NOCALL=0;
POINTS_MATCH=70;
POINTS_MATCH2=100;
POINTS_SUB=-127;
POINTS_SUB2=-51;
POINTS_SUB3=-25;
POINTS_INS=-395;
POINTS_INS2=-39;
POINTS_INS3=-23;
POINTS_INS4=-8;
POINTS_DEL=-472;
POINTS_DEL2=-33;
POINTS_DEL3=-9;
POINTS_DEL4=-1;
POINTS_DEL5=-1;

LIMIT_FOR_COST_2=1;
LIMIT_FOR_COST_3=5;
LIMIT_FOR_COST_4=20;
LIMIT_FOR_COST_5=80;
So, the first consecutive match is 70 points, and subsequent ones are 100 points. N is 0 points (whether in the reference or in the read). The first consecutive substitution is -127 points; subsequent ones are -51 points; and after 5 in a row, it drops to -25 points each.

The "LIMIT_FOR_COST_X" terms indicate how many consecutive symbols are needed before the cost drops. So, after 80 consecutive symbols (LIMIT_FOR_COST_5), the cost will drop to the lowest possible level. For deletions, this is POINTS_DEL5=-1; other types don't have a level-5 cost so just use the lowest available (e.g. POINTS_INS4 for insertions).

However, these costs are still not low enough to allow jumbo deletions. So I compress the deleted portions of the reference; every 128 bases are replaced by a single symbol, allowing effectively -1/128 point cost per deleted base, once the deletion is sufficiently long (though with present settings it costs -2/128).

Since it's pretty confusing to calculate the cost of a gap, you can currently do that with the main() function of one of the classes:

java -cp /path/to/current/ align2.MultiStateAligner11ts 20000

...will print out "-1129", which is the cost of a 20kbp gap. For a 100bp read, that would be added to the score from matching reads (2*70 + 98*100 = 9940) for a total of 8811.

Before I had said 8841, because I forgot about the additional 30-point penalty due to 2 bases being in the MATCH1 state, whereas in a perfect read, only the first base is in that state.
Brian Bushnell is offline   Reply With Quote
Old 12-02-2014, 04:56 PM   #149
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Holy smokes. And thanks!
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */

Last edited by sdriscoll; 12-02-2014 at 05:00 PM. Reason: forgot to say thanks.
sdriscoll is offline   Reply With Quote
Old 12-02-2014, 05:05 PM   #150
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sdriscoll View Post
Brian, do you mind explaining how the 50bp match, 20k intron, 50bp match ends up with a rough score of 8841? And in terms of the above examples that means it would have a final score ratio of about 0.887, right? thanks.
Oh, right - yes, the score ratio would correspond to 8841/9970 = 0.887, if 8841 was in fact the read's raw score.
Brian Bushnell is offline   Reply With Quote
Old 12-04-2014, 12:14 PM   #151
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Hey Brian,

I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

To give you an idea of some handy filtering options have a look at the help text from my perl script.

Code:
usage: bbmap-filter.pl [options] <alignments.bam>

Parses alignments from bbmap and modifies them based on options.

Options:
  -p INT           Number of threads for processing (default: 2)
  -N INT           Maximum number of mismatches allowed (default: any)
  -d INT           Maximum number of deletions per alignment (default: any)
  -D INT           Maximum length of deletion allowed (default: any)
  -i INT           Maximum number of insertions per alignment (default: any)
  -I INT           Maximum length of insertion allowed (default: any)
  -e INT           Maximum number of INDELS total per alignment (default: any)
  -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
  -c               Set reads not passing filters to qc-failed (default: unaligned)
  -q INT           Minimum MAPQ (default: any)
  -a               Output alignments passing filter only (default: all)
  -h               Show this message and exit
This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-04-2014, 01:05 PM   #152
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

*edit*

My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */

Last edited by sdriscoll; 12-04-2014 at 01:17 PM. Reason: figured it out
sdriscoll is offline   Reply With Quote
Old 12-04-2014, 01:22 PM   #153
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sdriscoll View Post
Hey Brian,

I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

To give you an idea of some handy filtering options have a look at the help text from my perl script.

Code:
usage: bbmap-filter.pl [options] <alignments.bam>

Parses alignments from bbmap and modifies them based on options.

Options:
  -p INT           Number of threads for processing (default: 2)
  -N INT           Maximum number of mismatches allowed (default: any)
  -d INT           Maximum number of deletions per alignment (default: any)
  -D INT           Maximum length of deletion allowed (default: any)
  -i INT           Maximum number of insertions per alignment (default: any)
  -I INT           Maximum length of insertion allowed (default: any)
  -e INT           Maximum number of INDELS total per alignment (default: any)
  -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
  -c               Set reads not passing filters to qc-failed (default: unaligned)
  -q INT           Minimum MAPQ (default: any)
  -a               Output alignments passing filter only (default: all)
  -h               Show this message and exit
This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.
Those sound useful, and won't be very difficult, so I will go ahead and add them. Thanks for the suggestion!

Quote:
Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

*edit*

My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
OK, I will add that.

Incidentally, in my testing, JNI speeds up BBMap by around 25-30%, which is nice but not huge, probably because it is memory-bandwidth limited or too branchy. However, it speeds up Dedupe (when edits are allowed) and BBMerge by closer to 200%; both are compute-limited.
Brian Bushnell is offline   Reply With Quote
Old 12-04-2014, 01:31 PM   #154
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Awesome. Yeah for all of the sam/bam utilities out there I haven't found one that could do elaborate filtering based on the details of the alignment. Probably because those tools can't guarantee that aligners are writing the alignment files with the necessary flags. Within the ecosystem of BBTools, however, you can control all...so that would be fantastic.

I'm testing JNI with bbmap now and will report back with the speedup.

Speedup: ~ 2%

This was an alignment that I'd feed into eXpress for gene expression quantification. Options used:

Code:
t=8 ambig=all minid=0.95 usejni=t saa=t secondary=t maxsites=20 sam=1.3 trd=t
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */

Last edited by sdriscoll; 12-04-2014 at 01:36 PM. Reason: added speedup result
sdriscoll is offline   Reply With Quote
Old 12-04-2014, 05:09 PM   #155
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That's a little disappointing, but thanks for the report! The speedup should be greater the worse the reads match the genome, as the JNI portion only affects the affine-transform alignment, which is not used unless reads have more than 1 mismatch to the reference.
Brian Bushnell is offline   Reply With Quote
Old 12-05-2014, 09:01 AM   #156
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default bbmap and multimapper MAPQ

Hello Brian and others,

Usually I run bbmap with the ambig=best parameter and am good to go, but for one of our datasets, it may prove favourable to give multi-mapping reads some special treatment. Therefore I now have to familiarize myself with the SAM format and find optimal settings for high quality mapping of spliced stranded reads.

I ran bbmap for an initial test with those parameters
Code:
ambig=all minid=0.9 padding=6 tipsearch=200 maxindel=200000 intronlen=5 secondary=T quickmatch=T sssr=0.97 local=T saa=F xstag=T xmtag=T nhtag=T
and am now messing around with the alignments. The current idea is to use bbmaps real MAPQ and modify it according to our additional criteria. Then I will sort the file based on the read ID and secondarily descending on the final MAPQ and neglect ID duplicates to resolve multi-mappers.

However when I check the original MAPQ, I only see qualities between 1-3 despite I have chosen local=T option. Shouldn't there be some higher MAPQ reads also within the multi-mappers? Or did I misunderstand the local option?

Code:
# All MAPQ
awk -F "\t" '!/^@/ {print $5}' $SAMFILE | sort | uniq
# Only multimapping MAPQ
grep -Fvw "NH:i:1" $SAMFILE | awk -F "\t" '!/^@/ {print $5}' | sort | uniq
Thanks a lot
Matthias

PS: Support for something like sdriscoll's bbmap-filter.pl becoming a part of bbmap suite.
Thias is offline   Reply With Quote
Old 12-05-2014, 10:43 AM   #157
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hmm... all multi-mapping reads get their quality penalized, according to how many sites there are and how close the scores are. Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.

You can also use the flag "idtag", though, which will print each read's percent identity to the reference in a custom field. This value is unaffected by multimapping, so you could use that for unbiased filtering by mapping quality.
Brian Bushnell is offline   Reply With Quote
Old 12-07-2014, 08:13 AM   #158
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default

Quote:
Originally Posted by Brian Bushnell View Post
Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.
Thanks a lot for your kind offer, but your hint with the "idtag" is perfectly fine for our purpose, so no need to adjust your code. I just misunderstood the doings of the local=T flag in exactly the proposed way. Since some downstream tools may depend on the real MAPQ, on second thought it now anyway seems wiser to introduce an extra column for that custom score, resolve the multi-mappers according to it and kick that column out again for the final SAM file.

Thanks a lot!
Matthias
Thias is offline   Reply With Quote
Old 12-08-2014, 04:25 PM   #159
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

sdriscoll and Matthias,

I've added the following flags:

subfilter=-1 Ban alignments with more than this many substitutions.
insfilter=-1 Ban alignments with more than this many insertions.
delfilter=-1 Ban alignments with more than this many deletions.
indelfilter=-1 Ban alignments with more than this many indels.
editfilter=-1 Ban alignments with more than this many edits.
inslenfilter=-1 Ban alignments with an insertion longer than this.
dellenfilter=-1 Ban alignments with a deletion longer than this.

(all of those have no effect if the value is negative)

penalizeambiguous=t (pambig) Penalize the mapping score of reads that multimap.
Brian Bushnell is offline   Reply With Quote
Old 12-09-2014, 03:10 AM   #160
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default

Hello Brian,

Kudos to you for doing all of that at no time. This will certainly make the bbmap suite even more useful than it already is.

Thanks again
Matthias

BTW: I solved my problem by using the idtag and the scoretag:

Code:
 grep -o 'YR:i:[0-9]*' $MULTIMAPSAMFILE | cut -f3- -d':'
I tweaked the scores a bit for our purpose and pasted them as first column to the sam. Then I sorted the file on the read ID and the scores and removed the duplicates.
Quote:
sort -k2,2 -r -k1,1 $SCOREDMULTIMAPSAMFILE | awk -F "\t" '!x[$2]++' | cut -f2-
Thias is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO