SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
tophat/cufflinks with different read lengths libraries linsson RNA Sequencing 0 07-03-2012 10:21 AM
Mate pairs contaminated with paired ends - impact on assembly? reithme Bioinformatics 2 12-13-2009 11:35 PM

Reply
 
Thread Tools
Old 03-13-2015, 02:44 PM   #21
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,384
Default

@Manuel: Are these 20000 sequences related (contigs)?

@Brian: Thanks for the additional explanation.
GenoMax is offline   Reply With Quote
Old 03-13-2015, 04:41 PM   #22
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,617
Default

Manuel,

I released a new version of BBTools, 34.65, in which you can now specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Have a nice weekend!
Brian Bushnell is offline   Reply With Quote
Old 03-16-2015, 02:09 PM   #23
manuelkleiner
Junior Member
 
Location: Calgary

Join Date: Jun 2014
Posts: 5
Default

Wow that was fast! Thank you Brian!
I will test it tomorrow and let you know how it works out for me.

@GenoMax: the 20,000 Mulit-fasta files are the product of a rather strict binning of multiple metagenomes. So some files contain multiple sequences, some only contain one. Concatenating would thus not be an option and the scaffold/chromosome statistics function would also not work to get the statistics per file.
manuelkleiner is offline   Reply With Quote
Old 09-08-2015, 11:36 PM   #24
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 281
Default

Hi Brian,

when using BBsplit (also latest version) with long reads (Moleculo data) the program shreds these long reads into 500 bp pieces. It seems to me that the distributing of the reads according to reference is not working in this case. For me about 0.5% of the genomic reads are mapping to mitochondria according to the BBsplit report, but the out-files contain about 99% of the reads. Perhaps I am misinterpreting anything?

Thanks!
luc is offline   Reply With Quote
Old 09-08-2015, 11:54 PM   #25
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,617
Default

Hi Luc,

It's quite possible that you are not misinterpreting anything, but the results are incorrect. Would you mind giving the exact parameters and version (I assume 35.43) you used? I will probably be able to replicate the bug without your specific data.

Thanks,
Brian

P.S. By the way, "mapPacBio.sh" will use approximately the same algorithm but only shred them into 6000bp shreds.

Last edited by Brian Bushnell; 09-08-2015 at 11:57 PM.
Brian Bushnell is offline   Reply With Quote
Old 09-09-2015, 11:45 PM   #26
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 281
Default

Hi Brian,

yes I am using version 35.43.

I ran for example:
bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta build=1 maxindel=100 minratio=0.8 refstats=mitorefstats.txt out=MolToMitoreads.fasta &

The results look like this:
------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 100
Minimum Score Ratio: 0.8
Mapping Mode: normal
Reads Used: 887008 (419139365 bases)

Mapping: 18.567 seconds.
Reads/sec: 47772.92
kBases/sec: 22574.22


Read 1 data: pct reads num reads pct bases num bases

mapped: 0.3272% 2902 0.3307% 1386018
unambiguous: 0.2006% 1779 0.2065% 865395
ambiguous: 0.1266% 1123 0.1242% 520623
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0046% 41 0.0024% 9939
semiperfect site: 0.0046% 41 0.0024% 9939

Match Rate: NA NA 94.2618% 1332227
Error Rate: 43.3616% 2861 5.7382% 81100
Sub Rate: 43.2707% 2855 2.4932% 35237
Del Rate: 29.7818% 1965 1.9322% 27309
Ins Rate: 28.8118% 1901 1.3128% 18554
N Rate: 0.0000% 0 0.0000% 0

The problems is that the out-file contains almost all of the Moleculo data - in 500bp pieces - instead of the expected 0.33 %.

Last edited by luc; 09-09-2015 at 11:48 PM.
luc is offline   Reply With Quote
Old 09-10-2015, 12:18 AM   #27
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,617
Default

Hi Luc!

The syntax for BBSplit is different from BBMap. It's like this:

bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta basename=out_%.fa outu=unmapped.fa

That will give you 3 output files, "out_mitochondrion1.fa", "out_mitochondrion2.fa", and "unmapped.fa".

Unfortunately, BBSplit by default breaks reads longer than 500bp into 500bp pieces. It's technically possible to change this to 6kbp pieces with the use of a couple extra parameters (specifically "msa=MultiStateAligner9PacBio fastareadlen=6000"), but I suggest you try Seal instead. Seal does not do alignment, just kmer-matching, but I tend to use it in this kind of situation as it is much faster and can handle reads of unlimited length without breaking them up. The default is a hamming distance of zero and considering something a match if even a single kmer is shared (and by default k=31).

Syntax:

seal.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta pattern=out_%.fa outu=unmatched.fa mkf=0.05 hdist=1

The last 2 parameters are optional, but specify a hamming distance of 1 (which is fine in the case of mitochondria, which are tiny) and require a read to share 5% of its kmers with a reference sequence to be considered matching. You can increase that to a substantially higher fraction (like 0.5), and eliminate the hamming distance, if you expect a very close match between the reads and the reference.

Last edited by Brian Bushnell; 09-10-2015 at 12:23 AM.
Brian Bushnell is offline   Reply With Quote
Old 03-05-2017, 02:51 PM   #28
RubenD
Junior Member
 
Location: Boston

Join Date: Mar 2017
Posts: 3
Default how to change default bbsplit parameters?

Hi all,

I would like to use bbsplit to separate my fastq files.

First I generated an index and now I am running a for loop for all my fastq's.

However when I run the command it seems that the default parameter values are not changed, see for example parameters minhits and maxindel in bold below. Which parameter will it take?

Am I doing something wrong here?


Quote:
Merged reference file /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz already exists; skipping merge.
Ref merge time: 0.028 seconds.
Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minhits=2, maxindel=200000, t=10, ambiguous=best, ambiguous2=best, build=2, in=../Fastq_files/20161223_CDX10_CC3737_S7_R1_001.fastq.gz, out_human=20161223_CDX10_CC3737_S7_R1_001_human.fastq, out_mouse=20161223_CDX10_CC3737_S7_R1_001_mouse.fastq, scafstats=scafstats_20161223_CDX10_CC3737_S7_R1_001.txt, refstats=refstats_20161223_CDX10_CC3737_S7_R1_001.txt, ref=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz]

BBMap version 37.00
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Set threads to 10
Scaffold statistics will be written to scafstats_20161223_CDX10_CC3737_S7_R1_001.txt
Reference set statistics will be written to refstats_20161223_CDX10_CC3737_S7_R1_001.txt
Retaining first best site only for ambiguous mappings.
NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/summary.txt
Set genome to 2
RubenD is offline   Reply With Quote
Old 03-05-2017, 04:34 PM   #29
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,384
Default

@RubenD: Please provide the exact BBSplit command you are using. You need to provide more than one reference for BBSplit to work right.
GenoMax is offline   Reply With Quote
Old 03-05-2017, 04:47 PM   #30
RubenD
Junior Member
 
Location: Boston

Join Date: Mar 2017
Posts: 3
Default

@GenoMax: it seems to be running correctly and produces sensible results, however I'm not sure whether it uses the default or my custom parameter settings.

I have first generated an index for mouse and human:

Quote:
bbsplit.sh build=2 maxindel=200000 minhits=2 ambiguous=best ambiguous2=best \
ref_human=/gcdata/gcproj/Ruben/GENOMES/Human/Sequences/Ensembl/GRCh37/Release75/Genome/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
ref_mouse=/gcdata/gcproj/Ruben/GENOMES/Mouse/Sequences/NCBIM37.67/Mus_musculus.NCBIM37.67.dna.toplevel.fa


Then I split my fastq files, again this works but I'm not sure whether it actually used the parameters that I provide...

Quote:
bbsplit.sh minhits=2 maxindel=200000 t=10 ambiguous=best ambiguous2=best path=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ build=2 in=sample.fastq.gz out_human=sample_human.fastq out_mouse=sample_mouse.fastq scafstats=scafstats.txt refstats=refstats.txt

Thanks for your help,
Ruben
RubenD is offline   Reply With Quote
Old 03-05-2017, 06:57 PM   #31
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,617
Default

Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.
Brian Bushnell is offline   Reply With Quote
Old 03-05-2017, 07:32 PM   #32
RubenD
Junior Member
 
Location: Boston

Join Date: Mar 2017
Posts: 3
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.
Perfect! Thanks a lot Brian.
RubenD is offline   Reply With Quote
Old 07-16-2017, 05:55 AM   #33
xc611
Junior Member
 
Location: us

Join Date: Mar 2011
Posts: 2
Default

Hi, Brian:

I have run bbsplit to number of samples for binning the rRNA reads. Most of them (total of 172 samples) run fine but some of them gave me strange java errors. I rerun couple of times and every time different sample had this error, so I know its not the fastq file error.

Any suggestions?

Thanks

Jack
Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: Unexpected end of ZLIB input stream
at fileIO.ReadWrite.readObject(ReadWrite.java:788)
at fileIO.ReadWrite.read(ReadWrite.java:1154)
at dna.ChromosomeArray.read(ChromosomeArray.java:65)
at align2.ChromLoadThread.loadAll(ChromLoadThread.java:50)
at dna.Data.loadChromosomes(Data.java:272)
at align2.BBMap.loadIndex(BBMap.java:354)
at align2.BBMap.main(BBMap.java:33)
at align2.BBSplitter.main(BBSplitter.java:49)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
at fileIO.ReadWrite.readObject(ReadWrite.java:783)
... 7 more
xc611 is offline   Reply With Quote
Old 07-17-2017, 08:52 AM   #34
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,617
Default

Hi Jack,

This means that you were running multiple different indexing processes in the same directory at the same time. Unless you use a different directory for each process, or specify a different index location with "path=", or specify a different build number, the indexes can overwrite each other leading to corrupt zip files (which, fortunately, normally get detected, as in this case).

If you want to do all of these mapping operations to the same references, just index once, wait for it to finish, and then run all the mapping operations without specifying "ref=". E.g.

Code:
bbsplit.sh ref=ecoli.fa,salmonella.fa

(wait for finish)

bbsplit.sh a.fq basename=out_a_%.fq
bbsplit.sh b.fq basename=out_b_%.fq
bbsplit.sh c.fq basename=out_c_%.fq
...etc
If each one needs different references, then either run them serially, or use a different directory/build each time.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
aligners, bbsplit, binning, contaminant, metagenome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO