SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BBMap (aligner for DNA/RNAseq) is now open-source and available for download. Brian Bushnell Bioinformatics 657 10-29-2018 06:28 PM
BBMap for BitSeq dietmar13 Bioinformatics 1 04-30-2015 09:40 AM
Please help my BBMap cannot remove Illumina adapter TofuKaj Bioinformatics 4 04-28-2015 09:53 AM
BBMap Error Phage Hunter Bioinformatics 5 01-14-2015 05:34 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM

Reply
 
Thread Tools
Old 03-12-2016, 02:59 PM   #81
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

@mcauchy: Consider @Brian's reply above when choosing a memory setting.

If you know that your files fall in the second category then you may want to re-do the trimming with a paired-end aware trimmer (so the reads don't get out of sync in first place) e.g. use bbduk.sh from BBMap. That option will not require a lot of RAM and would prove convenient.
GenoMax is online now   Reply With Quote
Old 03-14-2016, 07:49 AM   #82
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default Kmer Counting

Hi Brian, tried to send you a private message, but I guess you're popular:

"Brian Bushnell has exceeded their stored private messages quota and cannot accept further messages until they clear some space."

Anyhow!

New odd question for you Brian: I'm doing some kmer counting with kmercountexact.sh. I'm running a kmer series on 100bp DNA-Seq reads from k=21 to k=55. Two things I noticed:

1) After k=31, the program only calculates kmer counts for even numbers. The output reads (for the example of k=55): "K was changed from 55 to 54", and it did that for every odd value of k from 33 to 55

2) The unique kmer rises as the value of k rises, but only to a point. In my two samples, it seems like after k=31 that as k increases, the number of unique kmers decreases. How can this be? Am I missing something conceptual about how kmer counts should be expected?

Best,
Bob

EDIT: Added kmer graph


Last edited by jazz710; 03-14-2016 at 07:57 AM. Reason: Added Image
jazz710 is offline   Reply With Quote
Old 03-14-2016, 08:01 AM   #83
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

@jazz710: kmercountexact.sh is only designed to accept a max of k=31. I am not sure what it is doing once you go past that point.

On a different note: @Brian now works at google and continues to support BBTools in spare time (don't worry, BBTools are going to remain supported/be developed, just at a slower pace than what we were used to last year). So look for an official answer from him late tonight.

Note: People can still contact Brian via his JGI/LBL email address. That email address remains valid and can be found in the help menu for any BBTools program. Same constraint about time applies to this method as well.

Last edited by GenoMax; 03-14-2016 at 08:26 AM.
GenoMax is online now   Reply With Quote
Old 03-14-2016, 08:05 AM   #84
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

Thanks GenoMax!
jazz710 is offline   Reply With Quote
Old 03-16-2016, 10:14 PM   #85
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Let me clarify this:

As GenoMax stated, KmerCountExact originally only supported K=1 to 31. It now supports unlimited kmer lengths, with the provision that K=32 to 62 must be multiples of 2, 63 to 93 must be multiples of 3, and generally (N-1)*31+1 to (N)*31 must be multiples of N, which makes reverse-complementing much faster.

It is expected that the kmer counts form an arc like that with a peak at some specific K. As K increases, the number of unique kmers increases in general as longer repeats are spanned. But as K approaches read length, the number of unique kmers will drop to zero because reads will contain fewer kmers. So there is some peak between the zeros on each end, and thus, the graph basically looks as expected.

What should never happen is the discontinuity between K=31 and K=32, which is clearly an artifact of the program. I will investigate that this weekend; in the meantime, could you tell me the exact command you used, and version of BBTools? It's possible that this is due to the way kmers are filtered to exclude error kmers, which can be set in the command line, but I'll find out.

Sorry about my inbox; I've cleared some space!
Brian Bushnell is offline   Reply With Quote
Old 03-18-2016, 07:34 AM   #86
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

They were just basic kmercountexact.sh commands:

kmercountexact.sh -k <21-55> in=<File1> in2=<File2> mincount=2 prefilter=2

The graph included in my post above is the output from the Unique kmers for each value of K. I should note that while both of my species showed the same 'jump' it wasn't at k=31 each time. I trashed most of that output because I went another way with that part of the analysis, but if it would help, I could always re-run it quickly to get you the specific output.
jazz710 is offline   Reply With Quote
Old 03-24-2016, 08:38 AM   #87
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default Cross Species Mapping

Hello Geno and Brian,

I have seen BBMap referenced on a handful of forums as a good option for mapping short DNA-seq reads onto a closely related species. Bowtie et al. can't seem to handle very many SNPs.

Looking at the BBMap documentation, I'm not exactly sure what parameters I should tweak to loosen the mapping conditions. I was hoping there would simply be an hdist/edist option that I could set to 4 or 5 or so but I don't see that directly stated.

Are there a set of parameters you would recommend tweaking to get optimal mapping across species (given the obvious biological limitations)?

Best,
Bob

EDIT: Would it be editfilter=4 or 5? That would allow for 4 or 5 SNPs or indels, correct?

Last edited by jazz710; 03-24-2016 at 09:13 AM.
jazz710 is offline   Reply With Quote
Old 03-25-2016, 11:47 AM   #88
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Bob,

BBMap does not have an hdist/edist option because it evaluates alignments using a scoring system based on evolutionary probability. For example, two adjacent deletions are much more likely than two non-adjacent deletions. So these are considered as "events", and BBMap makes alignments optimizing the most probable layout of events that would lead to the resulting read. For example, a 100bp read containing a 100bp deletion event could be aligned with 70 matches, and 30 bases in which 1/4 of them are matches and 3/4 of them are mismatches. Or, it could be aligned as 70 matches and 30 bases clipped (which is typical). Or, it could be aligned as 70 matches, a 100bp deletion, and 30 matches. Which is what BBMap does by default.

Every alignment in BBMap has a score, and scores below the minimum are discarded. You can specify that with, for example, "minid=70", which will discard alignments with identity below roughly 70% (it is "roughly" because BBMap discards things based on probability rather than percent identity, which is an inferior metric. It still offers filters like "idfilter" for situations in which percent identity is an important metric, but I don't recommend using them for real research; they are just for publications in which people need to state "I excluded any alignment with more than X of property Y". It also offers specific filters to discard reads with more than X SNPs, or Y insertions, or Z deletions, or whatever. But again, I do not think these are useful in your case because you want cross-species alignments, which are best modeled by evolutionary distance, which BBMap was designed for.

You can increase BBMap's sensitivity using various means.

For preprocessing:

1) Adapter-trim your reads prior to alignment, using BBDuk.
2) Quality-trim your reads to ~Q10, if they have low-quality tails, also with BBDuk.
3) Error-correct reads (if you have sufficient coverage, typically >10x) using Tadpole.

For mapping:

1) Set the "slow" flag, which greatly increases sensitivity: bbmap.sh in=x.fq out=y.fq slow or the "vslow" flag, which increases it even more.
2) Set sensitivity very high: "vslow minid=0 k=11 maxindel=200"

Each of those flags is a bit different. Maxindel is default 16000 but for inter-species mapping reducing it increases sensitivity. You can further reduce k down to, say, 9, but it greatly affects speed and values below 10 start to cause issues in repetitive areas. The default is 13. k is basically the seed length - no alignments will be considered unless the read shares k consecutive exact matches with the reference. For cross-species mapping, using "slow" or "vslow", and varying the parameters maxindel, k, and minid are the best ways of adjusting sensitivity.
Brian Bushnell is offline   Reply With Quote
Old 03-30-2016, 06:32 AM   #89
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

Thank you, as always for your thorough response! New question time:

I'm running a bbduk contamination removal and with normal settings, and the run takes ~30m-45m (it's a big dataset). However, when I set hdist=1 (rather than 0) now the job has been running for 14h and is still only to this point:

BBDuk version 35.82
Set threads to 40
Initial:
Memory: max=823202m, free=788842m, used=34360m

I know hdist affects the time of the run, but is that amount what you would expect?

Best and many thanks,

Bob
jazz710 is offline   Reply With Quote
Old 03-30-2016, 06:47 AM   #90
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

That sounds unusually long (especially if the original job took only 45 min). Are the trimmed output files still increasing in size? Are you using more than one thread for this job?
GenoMax is online now   Reply With Quote
Old 03-30-2016, 06:51 AM   #91
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

There is no output yet...

40 threads, 800G memory.
jazz710 is offline   Reply With Quote
Old 03-30-2016, 06:59 AM   #92
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

I have a feeling the job is hung. Is this under a job scheduler? In my experience, bbduk starts writing output even when running under a scheduler right away.
GenoMax is online now   Reply With Quote
Old 03-30-2016, 07:11 AM   #93
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

It was submitted to a server with a scheduler, yes.

It's just odd that it's the exact same code and dataset, but the only difference is hdist=1...same set up with hdist=0 runs fine.
jazz710 is offline   Reply With Quote
Old 03-30-2016, 07:22 AM   #94
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Are you able to ssh to the node this is job is on and see if the bbduk is running (something like top -H)?
GenoMax is online now   Reply With Quote
Old 03-30-2016, 07:25 AM   #95
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

jazz710 is offline   Reply With Quote
Old 03-30-2016, 07:34 AM   #96
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

The java process (should correspond to bbduk) appears to be sleeping (at least in this screenshot). Does it change back to R periodically or is it staying in S?

@Brian had talked about memory requirements with hdist=1 before. It may not be applicable in your case.
GenoMax is online now   Reply With Quote
Old 03-30-2016, 07:37 AM   #97
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

Based on a minute of observation, it appears to be sleeping, however the %CPU and %MEM change as I re-issue 'top'.
jazz710 is offline   Reply With Quote
Old 03-30-2016, 07:43 AM   #98
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

Based on Brian's comments, I bet I have run out of memory. My reference is very large, and a 3*K increase in memory would put me over my server capacity. I will try to re-run with a smaller reference.
jazz710 is offline   Reply With Quote
Old 03-30-2016, 07:45 AM   #99
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Unless your scheduler is buffering the entire output in some temp location you should have seen trimmed files right away (have you ever looked before if a job does that)? Did you try top -H to see all ~40 java threads?

Based on your last comment that must be the case.
GenoMax is online now   Reply With Quote
Old 04-11-2016, 08:40 PM   #100
susanklein
Senior Member
 
Location: oceania

Join Date: Feb 2014
Posts: 115
Default reads mapped per gene

Hi,

can you suggest how to get raw reads mapped per gene from bbmap?

Thanks,

S.
susanklein is offline   Reply With Quote
Reply

Tags
bbmap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO