SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 12-16-2016, 07:44 PM   #401
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,491
Default

Normally... I see ambig rates of 3% or lower in diploids like human, and 0.1% or lower in haploid bacteria. I have little experience in mapping RNA to plant genomes, aside from feedback from co-workers. And they mainly map to plant transcriptomes rather than genomes.

Can you describe your mapping protocol in more detail? For example, are you concatenating all genomes and mapping to all simultaneously, or are you experiencing a high claimed ambig rate when mapping to a single reference alone?
Brian Bushnell is offline   Reply With Quote
Old 12-17-2016, 07:41 AM   #402
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 16
Default

For maaping diploid maize data, I concatenated all the chromosomes for a given species (e.g. 10 chr in maize) and used it as the mapping reference. I then mapped the BBDuk-trimmed reads to this reference (genome, not transcriptome) using the default setting of BBMap.
For the maize data generated from B73 (B73 is the sequeced ref genome), the ambig rate rages from 3% to 15%. I saw the variation occurred between biologica rep, not within a pair. For example, the ambig rates of Read 1 and Read 2 of a replicate are close to each other, but the ambig rates of different biologica replica could be various, (12%, 3%, 3%; 11%, 11%, 3%). The variatoin between replica almost made me feel like it was due to the technical issue of the libraries and/or biological nature of maize. I could try to map it to the transcriptome and see if there's any improvement. Other than that I wasn't sure how to trace down, and/or if this should be a concern.
For a different set of maize data with mixed background, as for now I still used the same B73 base reference genome for mapping. The ambig ranges went up to 6% - 46%. The increase was expected given the SNPs/indels present between different cultivars. I also observed variation between biological replicates similar to described above.
On a related note, I also have diploid Arabidopsis data which I also mapped to its own genome (TAIR9 genome fasta). I added a -maxindel=2000 to accomodate the compact size of the genome. The ambig rate on average was lower (mostly below 4%). I did not see that radical variation between replicates either. This is consistent with my guess about the ambig rates in maize was partly due to the nature of maize.
Please let me know if any furtehr details would be helpful for you to diagnose. Thank you as always for your input.
chiayi is offline   Reply With Quote
Old 12-17-2016, 04:21 PM   #403
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,491
Default

It would be useful to know what the ambiguous reads are hitting. It's likely that it's something with many copies, such as ribosomal elements. Ribo "contamination" is common in libraries even when some kind of ribo-depletion is used. You can catch the ambiguous reads with a second mapping pass using "ambig=toss outu=unmapped.fq" if you start with just the mapped reads. Then, you can BLAST them, or map them again and look at an annotated version of the reference to see what they're hitting. But it's likely ribosomal.
Brian Bushnell is offline   Reply With Quote
Old 01-05-2017, 12:05 PM   #404
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 36
Default Save Reads for Tadpole

Hi Brian, Geno,

I'm wondering if it's possible to save the reads that BBMap uses in its assembly, for subsequent use with Tadpole?

I have some junk reads that I think might be interfering with de novo assembly. I have ample coverage, and won't miss the crappy reads. I thought a clever way of eliminating them would be to restrict the reads used during de novo assembly to those that had previously mapped to the reference with BBMap. I'm getting an error at the moment when I try to use Tadpole with the reads used by BBMap that says it cannot take a mixture of paired and unpaired reads as input (working in Geneious). Do you think what I'm trying to do is possible?

I've already quality trimmed to Q20, and have confirmed that the junk reads are indeed high quality (>Q35). They are internal, repetitive strings of a single nucleotide. Not sure where they're coming from, but such a nuisance.

Thanks for any help.

P.S. Any idea where strings of a single nucleotide might be originating? I'm using 2-color chemistry on a MiniSeq, but the strings can be any nucleotide, not just G. Samples are prep'd with Nextera. My samples are PCR amplicons, however, if I Sanger sequence, I don't get these strings PolyN's
JVGen is offline   Reply With Quote
Old 01-05-2017, 12:57 PM   #405
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,491
Default

I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?

What you are planning to do should work fine. On the command line, it would be something like this:

Code:
bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq
tadpole.sh in=mapped.fq out=contigs.fa k=62
I'm not sure how that would translate to Geneious. Note, by the way, that you can remove homopolymer reads with BBDuk like this:

Code:
bbduk.sh in=reads.fq out=filtered.fq entropy=0.01
That will remove polymonomer reads only (AAAA...). At entropy=0.19 it will also remove polydimer reads (ACACAC...). At entropy=0.29 it will remove polytrimers (ACTACT...), and at 0.34 it will remove polytetramers. E.coli reads seem to have an average entropy of around 0.985 by my calculation method.
Brian Bushnell is offline   Reply With Quote
Old 01-05-2017, 01:13 PM   #406
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 36
Default

Quote:
Originally Posted by Brian Bushnell View Post
I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?

What you are planning to do should work fine. On the command line, it would be something like this:

Code:
bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq
tadpole.sh in=mapped.fq out=contigs.fa k=62
I'm not sure how that would translate to Geneious. Note, by the way, that you can remove homopolymer reads with BBDuk like this:

Code:
bbduk.sh in=reads.fq out=filtered.fq entropy=0.01
That will remove polymonomer reads only (AAAA...). At entropy=0.19 it will also remove polydimer reads (ACACAC...). At entropy=0.29 it will remove polytrimers (ACTACT...), and at 0.34 it will remove polytetramers. E.coli reads seem to have an average entropy of around 0.985 by my calculation method.
Hi Brian,

That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads. It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):

MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1

I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?

Thanks,
Jake
JVGen is offline   Reply With Quote
Old 01-05-2017, 01:42 PM   #407
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,491
Default

Quote:
Originally Posted by JVGen View Post
That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads.
How many files do you have after BBMap, and what are they named?

Quote:
It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):

MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1
I doubt it - BBTools should be able to handle reads named like that.

Quote:
I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?
For the default window=50 entropyk=5, reads must be at least 50bp long to be processed by the entropy filter (you can reduce that by making the window smaller). And entropy=0.01 will remove any sequence that is a singly mononucleotide, as long as it's at least 50bp long. Note that if there are some errors so that it is no longer a pure mononucleotide you'd need a higher value for entropy. Something like "AAAAAAAAAAGGGGGGGGGGGGGGGG" would also need a higher value (50 A's and 50 G's appears to need entropy=0.21). Don't set it too high, though, or you'll lose the low complexity parts of your genome.
Brian Bushnell is offline   Reply With Quote
Old 01-06-2017, 06:28 AM   #408
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 36
Default

Quote:
Originally Posted by Brian Bushnell View Post
How many files do you have after BBMap, and what are they named?

The file name is: "1-JL08-P1-A3 assembled to HXB2 Nested Amplified Region extraction". Within the file, there are thousands of reads with the naming convention that I shared in the previous post.


I doubt it - BBTools should be able to handle reads named like that.



For the default window=50 entropyk=5, reads must be at least 50bp long to be processed by the entropy filter (you can reduce that by making the window smaller). And entropy=0.01 will remove any sequence that is a singly mononucleotide, as long as it's at least 50bp long. Note that if there are some errors so that it is no longer a pure mononucleotide you'd need a higher value for entropy. Something like "AAAAAAAAAAGGGGGGGGGGGGGGGG" would also need a higher value (50 A's and 50 G's appears to need entropy=0.21). Don't set it too high, though, or you'll lose the low complexity parts of your genome.
I think I might try something like window = 15 and entropy 0.01. That should get rid of the mononucleotide strings. The HIV genome generally doesn't have anything like that, so it should work.


***Update. Brian, I contacted Geneious and they seem to be aware of the problem. They gave me a macro/workflow that extracts the reads from the BBMap'd contig file, and now they are feeding into Tadpole without a problem. Thanks for your help on this, you're getting all the gold stars!

Last edited by JVGen; 01-06-2017 at 07:51 AM.
JVGen is offline   Reply With Quote
Old 01-25-2017, 04:21 AM   #409
boulund
Junior Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 5
Default

How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?

If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.
boulund is offline   Reply With Quote
Old 01-25-2017, 05:08 AM   #410
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,244
Default

Quote:
Originally Posted by boulund View Post
How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?

If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.
@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.
GenoMax is offline   Reply With Quote
Old 01-25-2017, 05:45 AM   #411
boulund
Junior Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 5
Default

Quote:
Originally Posted by GenoMax View Post
@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.
I think so too, it feels natural considering the memory usage profile of BBMap .
boulund is offline   Reply With Quote
Old 01-25-2017, 09:06 AM   #412
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,491
Default

Quote:
Originally Posted by GenoMax View Post
@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.
That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.
Brian Bushnell is offline   Reply With Quote
Old 01-26-2017, 01:21 AM   #413
boulund
Junior Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 5
Default

Quote:
Originally Posted by Brian Bushnell View Post
That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.
Perfect, thank you!
boulund is offline   Reply With Quote
Old 02-06-2017, 11:15 PM   #414
Mat29
Junior Member
 
Location: Russia

Join Date: Mar 2015
Posts: 4
Default

Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.
Mat29 is offline   Reply With Quote
Old 02-07-2017, 07:13 AM   #415
YeastGuy
Junior Member
 
Location: here

Join Date: Feb 2017
Posts: 4
Default

Hi Brian,

I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
Do you have a suggestion how to constrain the alignment within each ORF?

Thank you!
YeastGuy is offline   Reply With Quote
Old 02-07-2017, 07:52 AM   #416
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,244
Default

@YeastGuy: Are those sequences in multi-fasta format?
GenoMax is offline   Reply With Quote
Old 02-07-2017, 12:18 PM   #417
YeastGuy
Junior Member
 
Location: here

Join Date: Feb 2017
Posts: 4
Default

@GenoMax: Yes, they are.
YeastGuy is offline   Reply With Quote
Old 02-07-2017, 12:19 PM   #418
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,491
Default

Quote:
Originally Posted by Mat29 View Post
Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.
Please ensure that your chromosome names do not have spaces in them. If they do, use the "trd" flag when mapping (trim read description), or for simplicity, just process the fasta initially like this:

reformat.sh in=ref.fa out=fixed.fa trd


...then use fixed.fa for mapping and everything else.
Brian Bushnell is offline   Reply With Quote
Old 02-07-2017, 12:23 PM   #419
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,491
Default

Quote:
Originally Posted by YeastGuy View Post
Hi Brian,

I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
Do you have a suggestion how to constrain the alignment within each ORF?

Thank you!
How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.
Brian Bushnell is offline   Reply With Quote
Old 02-07-2017, 12:39 PM   #420
YeastGuy
Junior Member
 
Location: here

Join Date: Feb 2017
Posts: 4
Default

Quote:
Originally Posted by Brian Bushnell View Post
How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.
I tried to simply use the multi-fasta format with the hope that the sequences would be seen as seperate and not connected:
>lcl||1|YAL001C|TFC3|1|145
TAGTTACTATGGTCGTTAACGAAATAATATTTCATCCAGGGA
>lcl||2|YAL002W|VPS8|1|145
CTGGTCTGGACCCATTACTTTTTCTAGCTTGGGAAAATGTACAG

But after randomising the order withing the ref file the coverage changed.
YeastGuy is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO