SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
tophat/cufflinks with different read lengths libraries linsson RNA Sequencing 0 07-03-2012 10:21 AM
Mate pairs contaminated with paired ends - impact on assembly? reithme Bioinformatics 2 12-13-2009 11:35 PM

Reply
 
Thread Tools
Old 10-02-2017, 06:06 PM   #41
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,495
Default

@Brian: Reads are single end 100 bp. Commands used are in post #37
GenoMax is offline   Reply With Quote
Old 10-04-2017, 09:24 AM   #42
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

The numbers of mapped reads do not add up to the same record in each case... it seems like the program is crashing and not processing all the reads, or else the input is different. The output is deterministic for single-ended reads so I can't see any other explanation. Does the program print an error message?
Brian Bushnell is offline   Reply With Quote
Old 10-05-2017, 12:31 AM   #43
wlokhorst
Junior Member
 
Location: The Netherlands

Join Date: Oct 2017
Posts: 5
Default

No error messages are produced, but you might be right about it crashing. (It also crashes when I put ./ in front of the input file.) Maybe you can see what goes wrong from the output text, so here it is:
Code:
BBMap version 37.53
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Retaining first best site only for ambiguous mappings.
NOTE:   Ignoring reference file because it already appears to have been processed.
NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:       1.506 seconds.
Loading index for chunk 1-1, build 1
Generated Index:        1.482 seconds.
Analyzed Index:         3.715 seconds.
Cleared Memory:         6.933 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 48 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47

   ------------------   Results   ------------------

Genome:                 1
Key Length:             13
Max Indel:              20
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             40428   (3919820 bases)

Mapping:                13.209 seconds.
Reads/sec:              3060.74
kBases/sec:             296.76


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                   4.5464%            1838         4.3524%             170607
unambiguous:              1.2120%             490         1.2217%              47890
ambiguous:                3.3343%            1348         3.1307%             122717
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        0.7173%             290         0.6776%              26561
semiperfect site:         0.7297%             295         0.6905%              27066

Match Rate:                   NA               NA        91.4514%             156445
Error Rate:              42.4134%            1543         8.4773%              14502
Sub Rate:                42.1935%            1535         7.7799%              13309
Del Rate:                 2.2265%              81         0.2701%                462
Ins Rate:                 4.1506%             151         0.4273%                731
N Rate:                   0.2474%               9         0.0713%                122

Total time:             27.078 seconds.
wlokhorst is offline   Reply With Quote
Old 10-05-2017, 03:09 AM   #44
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,495
Default

Do you only have 40428 reads in the file? 48 threads seems like an overkill for this amount of reads. Can you use 4 and see if that helps? You can also delete the previous bbmap index files and force the program to create new ones using ref=.

Adding relative paths before the file names should not cause the program to crash. It never has for me.
GenoMax is offline   Reply With Quote
Old 10-05-2017, 03:29 AM   #45
wlokhorst
Junior Member
 
Location: The Netherlands

Join Date: Oct 2017
Posts: 5
Default

40428 reads indeed. The number of threads shouldn't matter for the succes or failure of the process. However, I did run it on 4 threads for you, so here are the results:
Code:
BBMap version 37.53
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Set threads to 4
Retaining first best site only for ambiguous mappings.
NOTE:   Ignoring reference file because it already appears to have been processed.
NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:       1.647 seconds.
Loading index for chunk 1-1, build 1
Generated Index:        1.542 seconds.
Analyzed Index:         3.989 seconds.
Cleared Memory:         6.978 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 4 mapping threads.
Detecting finished threads: 0, 1, 2, 3

   ------------------   Results   ------------------

Genome:                 1
Key Length:             13
Max Indel:              20
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             40428   (3919820 bases)

Mapping:                10.603 seconds.
Reads/sec:              3812.77
kBases/sec:             369.68


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                   4.5464%            1838         4.3524%             170607
unambiguous:              1.2120%             490         1.2217%              47890
ambiguous:                3.3343%            1348         3.1307%             122717
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        0.7173%             290         0.6776%              26561
semiperfect site:         0.7297%             295         0.6905%              27066

Match Rate:                   NA               NA        91.4514%             156445
Error Rate:              42.4134%            1543         8.4773%              14502
Sub Rate:                42.1935%            1535         7.7799%              13309
Del Rate:                 2.2265%              81         0.2701%                462
Ins Rate:                 4.1506%             151         0.4273%                731
N Rate:                   0.2474%               9         0.0713%                122

Total time:             25.038 seconds.
This also includes removal of the ref folder (which I do every time before testing the next run) and re-creation with the first command I've given you before.
wlokhorst is offline   Reply With Quote
Old 10-05-2017, 04:39 AM   #46
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,495
Default

For some other programs in BBMap suite there has been a "race condition" which produces a nasty (though according to @Brian a harmless) error related to threads.

At least in this case the result appears to be identical. In the last table you had given us the results all looked different.

Not sure why you still have this in the output, if you had deleted the ref folder:
Quote:
NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
You can set nodisk=t to prevent the index from being written to disk (memory will always be used).

Last edited by GenoMax; 10-05-2017 at 04:43 AM.
GenoMax is offline   Reply With Quote
Old 10-11-2017, 01:04 PM   #47
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 29
Default

Hi Brian,

I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

I am trying to remove a nearly-identical putative contaminant sequence from my data...

(FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

Thx! Kate

Last edited by sk8bro; 10-11-2017 at 01:08 PM.
sk8bro is offline   Reply With Quote
Old 10-11-2017, 01:43 PM   #48
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by GenoMax View Post
You can set nodisk=t to prevent the index from being written to disk (memory will always be used).
Actually, "nodisk" does not work with BBSplit... sorry! I'll clarify that in the documentation. It's not like it's impossible to make it work, but it would be pretty complicated; one of those things that I would do if I could clone myself.

Quote:
Hi Brian,

I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

I am trying to remove a nearly-identical putative contaminant sequence from my data...

(FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

Thx! Kate
Hi Kate,

You can certainly set "perfectmode" to only allow mappings that are 100% identity to the reference. For example, you could run once in perfectmode, and then run the remaining unmapped reads normally. But generally, yes, this is intended behavior. BBSplit is intended to do things like separate mouse reads from human reads when a mouse is used as the vector for some human DNA study. Also, it is designed to separate reads belonging to various bacteria in a metagenome, and chloroplast/plant reads in a plant. In any of these cases, if there is a single bp mismatch, you can't unambiguously assign a read to one reference or the other, since it could be a sequencing error or (more importantly) an actual variation.

You might find Seal (seal.sh) useful. It has functionality similar to BBSplit, but is alignment free (meaning it is way faster, but uses more memory). It allows you to specify cutoffs, so you can for example send all of the perfectly-matching reads to one file, regardless of whether they almost match another genome. It decides which file to assign a read to based on the number of kmers (by default, 31-mers) matching that reference. When dealing with bacteria, I always prefer Seal over BBSplit because it is so much faster and easier to use, and bacteria have tiny genomes that are under high evolutionary pressure to avoid low-complexity sequence. Euks have huge genomes with a lot of low-complexity regions so BBSplit is better in that case, since it is more precise.

If you want to remove a particular contaminant from your data, I suggest trying BBDuk. BBDuk does not allow kmers longer than 31. However, it emulates longer kmers. For example, if you set "k=90", then it will consider reads as matching (by default, they get discarded) if there is a 90bp stretch in which all 31-mers match 31-mers in the reference. In practice this is very similar to matching 90-mers.

So, for example:

bbduk.sh in=reads.fq out=filtered.fq ref=contaminant.fa k=90 mm=f

That will remove all reads containing at least 90bp that shares all kmers with your reference (which should be the unwanted sequence). The exact values depend on the length of the sequence in question. Note that read length is very important here; if a read only overlaps 80bp of the 90bp sequence in question, it would not be removed.
Brian Bushnell is offline   Reply With Quote
Old 10-11-2017, 01:48 PM   #49
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 29
Default

Thanks Brian, I will look into the options you suggested to handle this use case!

Kate
sk8bro is offline   Reply With Quote
Old 10-12-2017, 03:53 AM   #50
Microalgues
Junior Member
 
Location: Belgium

Join Date: Oct 2017
Posts: 2
Default Biological question about contaminants DB

Dear Brian,

I have created a de novo transcriptome of a non-model organism. After BLASTing the contigs I found some homology with possible contaminants. My question is if I should use DNA genomes (un/masked) or the CDS gene predictions from ENSMBL database.

In the first case, do you recommend the unmasked, the masked version of genomes (ensmbl) or should I use BBMask?
And in the second case, could BBsplit handle with CDS mapping?

I am a bit lost, I think that contigs from de novo transcriptome have to be mapped against CDS sequences.

With thanks,
Xavier
__________________
Xavier
Microalgues is offline   Reply With Quote
Old 10-12-2017, 04:40 AM   #51
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,495
Default

While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.
GenoMax is offline   Reply With Quote
Old 10-12-2017, 06:28 AM   #52
Microalgues
Junior Member
 
Location: Belgium

Join Date: Oct 2017
Posts: 2
Default

Quote:
Originally Posted by GenoMax View Post
While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.
Thanks GenoMax,

I see from other thread that masked genomes are used in order to prevent false positives when removing contamination.

At the same time in the thread you suggest simply use BBSplit, which should be enough.

Quote:
Originally Posted by GenoMax View Post
You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

I assume you have seen BBsplit, which can be used for this purpose.
With thanks,
Xavier
__________________
Xavier
Microalgues is offline   Reply With Quote
Old 10-12-2017, 07:17 AM   #53
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,495
Default

@Brian is doing something very specific to remove human contamination in JGI's non-human samples.

Re-reading your original question it would be good to know if your contaminants are "close" relatives or can be considered a distant species. Success or failure of bbsplit is going to largely depend on that. No tool is going to be able to separate reads from a very closely related species based on sequence alone.

Last edited by GenoMax; 10-12-2017 at 07:38 AM.
GenoMax is offline   Reply With Quote
Old 10-13-2017, 02:12 PM   #54
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 29
Default

Hi Brian,

So I've considered your 3 suggestions and Seal seems the path of least resistance. To separate sequences with 1 SNP of importance, in presence of other sequencing error SNPs the references need to be considered simultaneously or else which SNP is the important one gets lost.

So... i took approach of using BBsplit (to ensure the .95 minid/idfilter) that I want, and then took everything mapped as input to Seal (default except k=8, ambiguous=toss).

There are a few read pairs that are coming out as "Unmapped" in the Seal step, but they aligned with BBsplit. I looked at one and saw it has 1 SNP in each of F and R ~150 bp reads. Which Processing parameter do I need to change in order to "Map" this read pair?

Thx, Kate
sk8bro is offline   Reply With Quote
Reply

Tags
aligners, bbsplit, binning, contaminant, metagenome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:58 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO