SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
tophat/cufflinks with different read lengths libraries linsson RNA Sequencing 0 07-03-2012 10:21 AM
Mate pairs contaminated with paired ends - impact on assembly? reithme Bioinformatics 2 12-13-2009 11:35 PM

Reply
 
Thread Tools
Old 03-13-2015, 02:44 PM   #21
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,490
Default

@Manuel: Are these 20000 sequences related (contigs)?

@Brian: Thanks for the additional explanation.
GenoMax is offline   Reply With Quote
Old 03-13-2015, 04:41 PM   #22
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Manuel,

I released a new version of BBTools, 34.65, in which you can now specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Have a nice weekend!
Brian Bushnell is offline   Reply With Quote
Old 03-16-2015, 02:09 PM   #23
manuelkleiner
Junior Member
 
Location: Calgary

Join Date: Jun 2014
Posts: 5
Default

Wow that was fast! Thank you Brian!
I will test it tomorrow and let you know how it works out for me.

@GenoMax: the 20,000 Mulit-fasta files are the product of a rather strict binning of multiple metagenomes. So some files contain multiple sequences, some only contain one. Concatenating would thus not be an option and the scaffold/chromosome statistics function would also not work to get the statistics per file.
manuelkleiner is offline   Reply With Quote
Old 09-08-2015, 11:36 PM   #24
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 300
Default

Hi Brian,

when using BBsplit (also latest version) with long reads (Moleculo data) the program shreds these long reads into 500 bp pieces. It seems to me that the distributing of the reads according to reference is not working in this case. For me about 0.5% of the genomic reads are mapping to mitochondria according to the BBsplit report, but the out-files contain about 99% of the reads. Perhaps I am misinterpreting anything?

Thanks!
luc is offline   Reply With Quote
Old 09-08-2015, 11:54 PM   #25
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Luc,

It's quite possible that you are not misinterpreting anything, but the results are incorrect. Would you mind giving the exact parameters and version (I assume 35.43) you used? I will probably be able to replicate the bug without your specific data.

Thanks,
Brian

P.S. By the way, "mapPacBio.sh" will use approximately the same algorithm but only shred them into 6000bp shreds.

Last edited by Brian Bushnell; 09-08-2015 at 11:57 PM.
Brian Bushnell is offline   Reply With Quote
Old 09-09-2015, 11:45 PM   #26
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 300
Default

Hi Brian,

yes I am using version 35.43.

I ran for example:
bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta build=1 maxindel=100 minratio=0.8 refstats=mitorefstats.txt out=MolToMitoreads.fasta &

The results look like this:
------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 100
Minimum Score Ratio: 0.8
Mapping Mode: normal
Reads Used: 887008 (419139365 bases)

Mapping: 18.567 seconds.
Reads/sec: 47772.92
kBases/sec: 22574.22


Read 1 data: pct reads num reads pct bases num bases

mapped: 0.3272% 2902 0.3307% 1386018
unambiguous: 0.2006% 1779 0.2065% 865395
ambiguous: 0.1266% 1123 0.1242% 520623
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0046% 41 0.0024% 9939
semiperfect site: 0.0046% 41 0.0024% 9939

Match Rate: NA NA 94.2618% 1332227
Error Rate: 43.3616% 2861 5.7382% 81100
Sub Rate: 43.2707% 2855 2.4932% 35237
Del Rate: 29.7818% 1965 1.9322% 27309
Ins Rate: 28.8118% 1901 1.3128% 18554
N Rate: 0.0000% 0 0.0000% 0

The problems is that the out-file contains almost all of the Moleculo data - in 500bp pieces - instead of the expected 0.33 %.

Last edited by luc; 09-09-2015 at 11:48 PM.
luc is offline   Reply With Quote
Old 09-10-2015, 12:18 AM   #27
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Luc!

The syntax for BBSplit is different from BBMap. It's like this:

bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta basename=out_%.fa outu=unmapped.fa

That will give you 3 output files, "out_mitochondrion1.fa", "out_mitochondrion2.fa", and "unmapped.fa".

Unfortunately, BBSplit by default breaks reads longer than 500bp into 500bp pieces. It's technically possible to change this to 6kbp pieces with the use of a couple extra parameters (specifically "msa=MultiStateAligner9PacBio fastareadlen=6000"), but I suggest you try Seal instead. Seal does not do alignment, just kmer-matching, but I tend to use it in this kind of situation as it is much faster and can handle reads of unlimited length without breaking them up. The default is a hamming distance of zero and considering something a match if even a single kmer is shared (and by default k=31).

Syntax:

seal.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta pattern=out_%.fa outu=unmatched.fa mkf=0.05 hdist=1

The last 2 parameters are optional, but specify a hamming distance of 1 (which is fine in the case of mitochondria, which are tiny) and require a read to share 5% of its kmers with a reference sequence to be considered matching. You can increase that to a substantially higher fraction (like 0.5), and eliminate the hamming distance, if you expect a very close match between the reads and the reference.

Last edited by Brian Bushnell; 09-10-2015 at 12:23 AM.
Brian Bushnell is offline   Reply With Quote
Old 03-05-2017, 02:51 PM   #28
RubenD
Junior Member
 
Location: Boston

Join Date: Mar 2017
Posts: 3
Default how to change default bbsplit parameters?

Hi all,

I would like to use bbsplit to separate my fastq files.

First I generated an index and now I am running a for loop for all my fastq's.

However when I run the command it seems that the default parameter values are not changed, see for example parameters minhits and maxindel in bold below. Which parameter will it take?

Am I doing something wrong here?


Quote:
Merged reference file /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz already exists; skipping merge.
Ref merge time: 0.028 seconds.
Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minhits=2, maxindel=200000, t=10, ambiguous=best, ambiguous2=best, build=2, in=../Fastq_files/20161223_CDX10_CC3737_S7_R1_001.fastq.gz, out_human=20161223_CDX10_CC3737_S7_R1_001_human.fastq, out_mouse=20161223_CDX10_CC3737_S7_R1_001_mouse.fastq, scafstats=scafstats_20161223_CDX10_CC3737_S7_R1_001.txt, refstats=refstats_20161223_CDX10_CC3737_S7_R1_001.txt, ref=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz]

BBMap version 37.00
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Set threads to 10
Scaffold statistics will be written to scafstats_20161223_CDX10_CC3737_S7_R1_001.txt
Reference set statistics will be written to refstats_20161223_CDX10_CC3737_S7_R1_001.txt
Retaining first best site only for ambiguous mappings.
NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/summary.txt
Set genome to 2
RubenD is offline   Reply With Quote
Old 03-05-2017, 04:34 PM   #29
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,490
Default

@RubenD: Please provide the exact BBSplit command you are using. You need to provide more than one reference for BBSplit to work right.
GenoMax is offline   Reply With Quote
Old 03-05-2017, 04:47 PM   #30
RubenD
Junior Member
 
Location: Boston

Join Date: Mar 2017
Posts: 3
Default

@GenoMax: it seems to be running correctly and produces sensible results, however I'm not sure whether it uses the default or my custom parameter settings.

I have first generated an index for mouse and human:

Quote:
bbsplit.sh build=2 maxindel=200000 minhits=2 ambiguous=best ambiguous2=best \
ref_human=/gcdata/gcproj/Ruben/GENOMES/Human/Sequences/Ensembl/GRCh37/Release75/Genome/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
ref_mouse=/gcdata/gcproj/Ruben/GENOMES/Mouse/Sequences/NCBIM37.67/Mus_musculus.NCBIM37.67.dna.toplevel.fa


Then I split my fastq files, again this works but I'm not sure whether it actually used the parameters that I provide...

Quote:
bbsplit.sh minhits=2 maxindel=200000 t=10 ambiguous=best ambiguous2=best path=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ build=2 in=sample.fastq.gz out_human=sample_human.fastq out_mouse=sample_mouse.fastq scafstats=scafstats.txt refstats=refstats.txt

Thanks for your help,
Ruben
RubenD is offline   Reply With Quote
Old 03-05-2017, 06:57 PM   #31
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.
Brian Bushnell is offline   Reply With Quote
Old 03-05-2017, 07:32 PM   #32
RubenD
Junior Member
 
Location: Boston

Join Date: Mar 2017
Posts: 3
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.
Perfect! Thanks a lot Brian.
RubenD is offline   Reply With Quote
Old 07-16-2017, 05:55 AM   #33
xc611
Junior Member
 
Location: us

Join Date: Mar 2011
Posts: 2
Default

Hi, Brian:

I have run bbsplit to number of samples for binning the rRNA reads. Most of them (total of 172 samples) run fine but some of them gave me strange java errors. I rerun couple of times and every time different sample had this error, so I know its not the fastq file error.

Any suggestions?

Thanks

Jack
Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: Unexpected end of ZLIB input stream
at fileIO.ReadWrite.readObject(ReadWrite.java:788)
at fileIO.ReadWrite.read(ReadWrite.java:1154)
at dna.ChromosomeArray.read(ChromosomeArray.java:65)
at align2.ChromLoadThread.loadAll(ChromLoadThread.java:50)
at dna.Data.loadChromosomes(Data.java:272)
at align2.BBMap.loadIndex(BBMap.java:354)
at align2.BBMap.main(BBMap.java:33)
at align2.BBSplitter.main(BBSplitter.java:49)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
at fileIO.ReadWrite.readObject(ReadWrite.java:783)
... 7 more
xc611 is offline   Reply With Quote
Old 07-17-2017, 08:52 AM   #34
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Jack,

This means that you were running multiple different indexing processes in the same directory at the same time. Unless you use a different directory for each process, or specify a different index location with "path=", or specify a different build number, the indexes can overwrite each other leading to corrupt zip files (which, fortunately, normally get detected, as in this case).

If you want to do all of these mapping operations to the same references, just index once, wait for it to finish, and then run all the mapping operations without specifying "ref=". E.g.

Code:
bbsplit.sh ref=ecoli.fa,salmonella.fa

(wait for finish)

bbsplit.sh a.fq basename=out_a_%.fq
bbsplit.sh b.fq basename=out_b_%.fq
bbsplit.sh c.fq basename=out_c_%.fq
...etc
If each one needs different references, then either run them serially, or use a different directory/build each time.
Brian Bushnell is offline   Reply With Quote
Old 10-02-2017, 03:01 AM   #35
wlokhorst
Junior Member
 
Location: The Netherlands

Join Date: Oct 2017
Posts: 5
Default

Hello Brian,

Your tool is great and on first sight it seems to work perfectly, but after running several batches of samples I've noticed some quirky behaviour.

To investigate this, I've run the BBSplit tool several times on the same sample (and deleting the index every time) and it doesn't give me the same results every time. I've pasted the summarised results below and, as you can see, they differ with every run. This certainly isn't what I expected. Mapping the same reference sequences to a sample multiple times should always give me the same results, so something is wrong here.

The numbers are the the number of reads mapped to a certain reference sequence.
Code:
		run1	run2	run3	run4	run5	run6	run7	run8	run9	run10	
ref_seq_1	4	8	8	8	4	8	8	8	4	8
ref_seq_2	4	4	4	4	4	4	4	4	4	4
ref_seq_3	8	4	12	40	16	4	4	4	16	4
ref_seq_4	24	24	8	4	8	4	24	24	8	4
ref_seq_5	4	8	12	4	4	40	16	12	4	12
ref_seq_6	12	8	8	8	4	16	4	32	4	4
ref_seq_7	4	4	4	4	56	4	56	4	56	4
ref_seq_8	16	4	4	4	16	4	16	4	16	4
ref_seq_9	4	4	4	4	4	4	4	4	4	4
ref_seq_10	20	20	20	20	20	20	20	20	20	20
Hopefully you're able to tell me what causes this and (even better) solve it easily.

Kind regards,

Wouter Lokhorst

Last edited by GenoMax; 10-02-2017 at 06:05 AM.
wlokhorst is offline   Reply With Quote
Old 10-02-2017, 04:03 AM   #36
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,490
Default

@Wouter: @Brian would be along with an official answer but this may be because most NGS aligners are non-determinstic by default i.e. they may not produce exactly the same answer for each run.

bbmap.sh provides a "determinisitc" flag that you can set to "true" to get exactly identical results each time. I don't see bbsplit listed as having that flag but it may support it since it is based on bbmap. @Brian will confirm.

You would also need to consider how similar your references are and how you are choosing to handle multi-mappers (reads which map across genomes). Are you leaving the ambiguous2= option default. It may help to post the entire bbsplit command you are using for this analysis.
GenoMax is offline   Reply With Quote
Old 10-02-2017, 04:22 AM   #37
wlokhorst
Junior Member
 
Location: The Netherlands

Join Date: Oct 2017
Posts: 5
Default

@GenoMax: I did not consider the non-deterministic behaviour of mappers, however BBMap should still be deterministic in my case, since I have single-end reads. See copied text from the BBMap help section below. I could still try to explicitly call this though.

Quote:
Run in deterministic mode. In this case it is good to set averagepairdist. BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded.
My references are very similar in some cases (e.g. different strains of the same species) and I'm leaving the ambiguous parameters untouched.

Commands:
bash ~/bbmap/bbsplit.sh ref={}
# ref= a folder containing my reference sequences
bash ~/bbmap/bbsplit.sh in={} basename={}/out_%.fq
# in= sample, basename= output folder and output filename
wlokhorst is offline   Reply With Quote
Old 10-02-2017, 06:00 AM   #38
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,490
Default

You could try changing the ambiguous2= parameter and see how that changes the result. "toss/all" may generate results that look more similar. With very similar genomes you are going to have trouble assigning reads reproducibly. How long are these reads?
GenoMax is offline   Reply With Quote
Old 10-02-2017, 08:34 AM   #39
wlokhorst
Junior Member
 
Location: The Netherlands

Join Date: Oct 2017
Posts: 5
Default

Neither deterministic=t nor ambiguous2=all helped change the results. Unless you or Brian know a way of solving this, I'm sorry to say that I'll try using another tool (or attempt writing something of my own).

Reads are 100bp.
wlokhorst is offline   Reply With Quote
Old 10-02-2017, 03:26 PM   #40
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

That's odd, if you are not intentionally producing random assignments for sequences. Can you please post your full command line? Also, are the reads paired or unpaired?
Brian Bushnell is offline   Reply With Quote
Reply

Tags
aligners, bbsplit, binning, contaminant, metagenome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO