Seqanswers Leaderboard Ad

**skbrimer** · 12-14-2016, 10:55 AM

Thanks for the input guys! I will start playing with the updated BBmap as well!

**chiayi** · 12-14-2016, 10:11 PM

Hi Brian and other members,

I'm trying to map the Illumina reads generated from 19 inbred and hybrid mazie cultivars under two conditions for downstream differential expression analysis. Considering the SNPs and indels in the mixed genetic background, my original plan was to generate a 'corrected refernece' using tools like Quiver or Pilon. Then I came across to BBMap and was happy to find its capability to handle SNPs and indels. I used the default setting and the mapped rate so far was over 97%. Meanwhile, I did notice a huge variaion of ambiguous rate, ranging from 6% to >50%. Therefore, I was wondering if it is possible to evaluate which method (directly map to a single ref vs map to 19 SNP corrected ref) would be more accurate for the DE analysis.

The third option would be to re-assemble de-novo assembly for each genetic background. However, I wasn's sure if it's possible to come up with a consensus contig conrrespondance (contigX_background1, contigX_background2..contigX_background19) so that I could monitor each contig/gene of interest in multiple backgrounds.

Any input/thoughts you may have on this would be much appreciated. Thank you in advance.

**Brian Bushnell** · 12-16-2016, 06:30 PM

Maize is the kind of organism where de-novo assembly is extremely difficult. I recommend avoiding that; it will make things much more complicated. I don't think there's any reason to do that, either, unless you find that structural variations are causing problems.

BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.

I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?

**chiayi** · 12-16-2016, 08:22 PM

BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.

I share your thoughts so I will at least try a few background to see how it goes. On the other hand, I'm not sure how to compare the accuracy between the results generated from base reference vs SNP corrected reference. Do you have any suggestion?

I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?

The reads were generated by NEBNext Ultra RNA library prep kit for Illumina (no enrichment) and sequenced on NextSeq. I mapped the reads to the latest genome annotation (B73 RefGen_v4 reference genome). I should also mention that these are mixed background (hybrid of B73, reference genome, and other cultivar). For B73 alone, the ambig rate was lower at about ~10%.
Is there a 'standard' for ambiguous rate when you look at the stats from BBMap?

**Brian Bushnell** · 12-16-2016, 08:44 PM

Normally... I see ambig rates of 3% or lower in diploids like human, and 0.1% or lower in haploid bacteria. I have little experience in mapping RNA to plant genomes, aside from feedback from co-workers. And they mainly map to plant transcriptomes rather than genomes.

Can you describe your mapping protocol in more detail? For example, are you concatenating all genomes and mapping to all simultaneously, or are you experiencing a high claimed ambig rate when mapping to a single reference alone?

**chiayi** · 12-17-2016, 08:41 AM

For maaping diploid maize data, I concatenated all the chromosomes for a given species (e.g. 10 chr in maize) and used it as the mapping reference. I then mapped the BBDuk-trimmed reads to this reference (genome, not transcriptome) using the default setting of BBMap.
For the maize data generated from B73 (B73 is the sequeced ref genome), the ambig rate rages from 3% to 15%. I saw the variation occurred between biologica rep, not within a pair. For example, the ambig rates of Read 1 and Read 2 of a replicate are close to each other, but the ambig rates of different biologica replica could be various, (12%, 3%, 3%; 11%, 11%, 3%). The variatoin between replica almost made me feel like it was due to the technical issue of the libraries and/or biological nature of maize. I could try to map it to the transcriptome and see if there's any improvement. Other than that I wasn't sure how to trace down, and/or if this should be a concern.
For a different set of maize data with mixed background, as for now I still used the same B73 base reference genome for mapping. The ambig ranges went up to 6% - 46%. The increase was expected given the SNPs/indels present between different cultivars. I also observed variation between biological replicates similar to described above.
On a related note, I also have diploid Arabidopsis data which I also mapped to its own genome (TAIR9 genome fasta). I added a -maxindel=2000 to accomodate the compact size of the genome. The ambig rate on average was lower (mostly below 4%). I did not see that radical variation between replicates either. This is consistent with my guess about the ambig rates in maize was partly due to the nature of maize.
Please let me know if any furtehr details would be helpful for you to diagnose. Thank you as always for your input.

**Brian Bushnell** · 12-17-2016, 05:21 PM

It would be useful to know what the ambiguous reads are hitting. It's likely that it's something with many copies, such as ribosomal elements. Ribo "contamination" is common in libraries even when some kind of ribo-depletion is used. You can catch the ambiguous reads with a second mapping pass using "ambig=toss outu=unmapped.fq" if you start with just the mapped reads. Then, you can BLAST them, or map them again and look at an annotated version of the reference to see what they're hitting. But it's likely ribosomal.

**JVGen** · 01-05-2017, 01:05 PM

Save Reads for Tadpole

Hi Brian, Geno,

I'm wondering if it's possible to save the reads that BBMap uses in its assembly, for subsequent use with Tadpole?

I have some junk reads that I think might be interfering with de novo assembly. I have ample coverage, and won't miss the crappy reads. I thought a clever way of eliminating them would be to restrict the reads used during de novo assembly to those that had previously mapped to the reference with BBMap. I'm getting an error at the moment when I try to use Tadpole with the reads used by BBMap that says it cannot take a mixture of paired and unpaired reads as input (working in Geneious). Do you think what I'm trying to do is possible?

I've already quality trimmed to Q20, and have confirmed that the junk reads are indeed high quality (>Q35). They are internal, repetitive strings of a single nucleotide. Not sure where they're coming from, but such a nuisance.

Thanks for any help.

P.S. Any idea where strings of a single nucleotide might be originating? I'm using 2-color chemistry on a MiniSeq, but the strings can be any nucleotide, not just G. Samples are prep'd with Nextera. My samples are PCR amplicons, however, if I Sanger sequence, I don't get these strings PolyN's

**Brian Bushnell** · 01-05-2017, 01:57 PM

I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?

What you are planning to do should work fine. On the command line, it would be something like this:

Code:

bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq
tadpole.sh in=mapped.fq out=contigs.fa k=62

I'm not sure how that would translate to Geneious. Note, by the way, that you can remove homopolymer reads with BBDuk like this:

Code:

bbduk.sh in=reads.fq out=filtered.fq entropy=0.01

That will remove polymonomer reads only (AAAA...). At entropy=0.19 it will also remove polydimer reads (ACACAC...). At entropy=0.29 it will remove polytrimers (ACTACT...), and at 0.34 it will remove polytetramers. E.coli reads seem to have an average entropy of around 0.985 by my calculation method.

**JVGen** · 01-05-2017, 02:13 PM

Originally posted by Brian Bushnell View Post

I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?

What you are planning to do should work fine. On the command line, it would be something like this:

Code:

bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq
tadpole.sh in=mapped.fq out=contigs.fa k=62

I'm not sure how that would translate to Geneious. Note, by the way, that you can remove homopolymer reads with BBDuk like this:

Code:

bbduk.sh in=reads.fq out=filtered.fq entropy=0.01

That will remove polymonomer reads only (AAAA...). At entropy=0.19 it will also remove polydimer reads (ACACAC...). At entropy=0.29 it will remove polytrimers (ACTACT...), and at 0.34 it will remove polytetramers. E.coli reads seem to have an average entropy of around 0.985 by my calculation method.

Hi Brian,

That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads. It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):

MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1

I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?

Thanks,
Jake

**Brian Bushnell** · 01-05-2017, 02:42 PM

Originally posted by JVGen View Post

That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads.

How many files do you have after BBMap, and what are they named?

It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):

MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1

I doubt it - BBTools should be able to handle reads named like that.

I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?

For the default window=50 entropyk=5, reads must be at least 50bp long to be processed by the entropy filter (you can reduce that by making the window smaller). And entropy=0.01 will remove any sequence that is a singly mononucleotide, as long as it's at least 50bp long. Note that if there are some errors so that it is no longer a pure mononucleotide you'd need a higher value for entropy. Something like "AAAAAAAAAAGGGGGGGGGGGGGGGG" would also need a higher value (50 A's and 50 G's appears to need entropy=0.21). Don't set it too high, though, or you'll lose the low complexity parts of your genome.

**JVGen** · 01-06-2017, 07:28 AM

Originally posted by Brian Bushnell View Post

How many files do you have after BBMap, and what are they named?

The file name is: "1-JL08-P1-A3 assembled to HXB2 Nested Amplified Region extraction". Within the file, there are thousands of reads with the naming convention that I shared in the previous post.

I doubt it - BBTools should be able to handle reads named like that.

For the default window=50 entropyk=5, reads must be at least 50bp long to be processed by the entropy filter (you can reduce that by making the window smaller). And entropy=0.01 will remove any sequence that is a singly mononucleotide, as long as it's at least 50bp long. Note that if there are some errors so that it is no longer a pure mononucleotide you'd need a higher value for entropy. Something like "AAAAAAAAAAGGGGGGGGGGGGGGGG" would also need a higher value (50 A's and 50 G's appears to need entropy=0.21). Don't set it too high, though, or you'll lose the low complexity parts of your genome.

I think I might try something like window = 15 and entropy 0.01. That should get rid of the mononucleotide strings. The HIV genome generally doesn't have anything like that, so it should work.

***Update. Brian, I contacted Geneious and they seem to be aware of the problem. They gave me a macro/workflow that extracts the reads from the BBMap'd contig file, and now they are feeding into Tadpole without a problem. Thanks for your help on this, you're getting all the gold stars!

**boulund** · 01-25-2017, 05:21 AM

How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?

If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.

**GenoMax** · 01-25-2017, 06:08 AM

Originally posted by boulund View Post

How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?

If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.

@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.

**boulund** · 01-25-2017, 06:45 AM

Originally posted by GenoMax View Post

@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.

I think so too, it feels natural considering the memory usage profile of BBMap

.

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News