![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bowtie, an ultrafast, memory-efficient, open source short read aligner | Ben Langmead | Bioinformatics | 513 | 05-14-2015 02:29 PM |
Introducing BBMap, a new short-read aligner for DNA and RNA | Brian Bushnell | Bioinformatics | 24 | 07-07-2014 09:37 AM |
Miso's open source | joyce kang | Bioinformatics | 1 | 01-25-2012 06:25 AM |
Targeted resequencing - open source | stanford_genome_tech | Genomic Resequencing | 3 | 09-27-2011 03:27 PM |
EKOPath 4 going open source | dnusol | Bioinformatics | 0 | 06-15-2011 01:10 AM |
![]() |
|
Thread Tools |
![]() |
#401 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
Normally... I see ambig rates of 3% or lower in diploids like human, and 0.1% or lower in haploid bacteria. I have little experience in mapping RNA to plant genomes, aside from feedback from co-workers. And they mainly map to plant transcriptomes rather than genomes.
Can you describe your mapping protocol in more detail? For example, are you concatenating all genomes and mapping to all simultaneously, or are you experiencing a high claimed ambig rate when mapping to a single reference alone? |
![]() |
![]() |
![]() |
#402 |
Member
Location: New York Join Date: Dec 2016
Posts: 22
|
![]()
For maaping diploid maize data, I concatenated all the chromosomes for a given species (e.g. 10 chr in maize) and used it as the mapping reference. I then mapped the BBDuk-trimmed reads to this reference (genome, not transcriptome) using the default setting of BBMap.
For the maize data generated from B73 (B73 is the sequeced ref genome), the ambig rate rages from 3% to 15%. I saw the variation occurred between biologica rep, not within a pair. For example, the ambig rates of Read 1 and Read 2 of a replicate are close to each other, but the ambig rates of different biologica replica could be various, (12%, 3%, 3%; 11%, 11%, 3%). The variatoin between replica almost made me feel like it was due to the technical issue of the libraries and/or biological nature of maize. I could try to map it to the transcriptome and see if there's any improvement. Other than that I wasn't sure how to trace down, and/or if this should be a concern. For a different set of maize data with mixed background, as for now I still used the same B73 base reference genome for mapping. The ambig ranges went up to 6% - 46%. The increase was expected given the SNPs/indels present between different cultivars. I also observed variation between biological replicates similar to described above. On a related note, I also have diploid Arabidopsis data which I also mapped to its own genome (TAIR9 genome fasta). I added a -maxindel=2000 to accomodate the compact size of the genome. The ambig rate on average was lower (mostly below 4%). I did not see that radical variation between replicates either. This is consistent with my guess about the ambig rates in maize was partly due to the nature of maize. Please let me know if any furtehr details would be helpful for you to diagnose. Thank you as always for your input. |
![]() |
![]() |
![]() |
#403 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
It would be useful to know what the ambiguous reads are hitting. It's likely that it's something with many copies, such as ribosomal elements. Ribo "contamination" is common in libraries even when some kind of ribo-depletion is used. You can catch the ambiguous reads with a second mapping pass using "ambig=toss outu=unmapped.fq" if you start with just the mapped reads. Then, you can BLAST them, or map them again and look at an annotated version of the reference to see what they're hitting. But it's likely ribosomal.
|
![]() |
![]() |
![]() |
#404 |
Member
Location: East Coast Join Date: Jul 2016
Posts: 37
|
![]()
Hi Brian, Geno,
I'm wondering if it's possible to save the reads that BBMap uses in its assembly, for subsequent use with Tadpole? I have some junk reads that I think might be interfering with de novo assembly. I have ample coverage, and won't miss the crappy reads. I thought a clever way of eliminating them would be to restrict the reads used during de novo assembly to those that had previously mapped to the reference with BBMap. I'm getting an error at the moment when I try to use Tadpole with the reads used by BBMap that says it cannot take a mixture of paired and unpaired reads as input (working in Geneious). Do you think what I'm trying to do is possible? I've already quality trimmed to Q20, and have confirmed that the junk reads are indeed high quality (>Q35). They are internal, repetitive strings of a single nucleotide. Not sure where they're coming from, but such a nuisance. Thanks for any help. P.S. Any idea where strings of a single nucleotide might be originating? I'm using 2-color chemistry on a MiniSeq, but the strings can be any nucleotide, not just G. Samples are prep'd with Nextera. My samples are PCR amplicons, however, if I Sanger sequence, I don't get these strings PolyN's |
![]() |
![]() |
![]() |
#405 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?
What you are planning to do should work fine. On the command line, it would be something like this: Code:
bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq tadpole.sh in=mapped.fq out=contigs.fa k=62 Code:
bbduk.sh in=reads.fq out=filtered.fq entropy=0.01 |
![]() |
![]() |
![]() |
#406 | |
Member
Location: East Coast Join Date: Jul 2016
Posts: 37
|
![]() Quote:
That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads. It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious): MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2 MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1 I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5? Thanks, Jake |
|
![]() |
![]() |
![]() |
#407 | |||
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
Quote:
Quote:
|
|||
![]() |
![]() |
![]() |
#408 | |
Member
Location: East Coast Join Date: Jul 2016
Posts: 37
|
![]() Quote:
***Update. Brian, I contacted Geneious and they seem to be aware of the problem. They gave me a macro/workflow that extracts the reads from the BBMap'd contig file, and now they are feeding into Tadpole without a problem. Thanks for your help on this, you're getting all the gold stars! Last edited by JVGen; 01-06-2017 at 07:51 AM. |
|
![]() |
![]() |
![]() |
#409 |
Member
Location: Sweden Join Date: Jan 2017
Posts: 18
|
![]()
How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?
If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works. |
![]() |
![]() |
![]() |
#410 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 6,695
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#411 | |
Member
Location: Sweden Join Date: Jan 2017
Posts: 18
|
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
#412 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.
|
![]() |
![]() |
![]() |
#413 |
Member
Location: Sweden Join Date: Jan 2017
Posts: 18
|
![]() |
![]() |
![]() |
![]() |
#414 |
Junior Member
Location: Russia Join Date: Mar 2015
Posts: 4
|
![]()
Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.
|
![]() |
![]() |
![]() |
#415 |
Junior Member
Location: here Join Date: Feb 2017
Posts: 4
|
![]()
Hi Brian,
I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF. Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference. Do you have a suggestion how to constrain the alignment within each ORF? Thank you! |
![]() |
![]() |
![]() |
#416 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 6,695
|
![]()
@YeastGuy: Are those sequences in multi-fasta format?
|
![]() |
![]() |
![]() |
#417 |
Junior Member
Location: here Join Date: Feb 2017
Posts: 4
|
![]()
@GenoMax: Yes, they are.
|
![]() |
![]() |
![]() |
#418 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
reformat.sh in=ref.fa out=fixed.fa trd ...then use fixed.fa for mapping and everything else. |
|
![]() |
![]() |
![]() |
#419 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#420 | |
Junior Member
Location: here Join Date: Feb 2017
Posts: 4
|
![]() Quote:
>lcl||1|YAL001C|TFC3|1|145 TAGTTACTATGGTCGTTAACGAAATAATATTTCATCCAGGGA >lcl||2|YAL002W|VPS8|1|145 CTGGTCTGGACCCATTACTTTTTCTAGCTTGGGAAAATGTACAG But after randomising the order withing the ref file the coverage changed. |
|
![]() |
![]() |
![]() |
Tags |
bbmap, metagenomics, rna-seq aligners, short read alignment |
Thread Tools | |
|
|