Seqanswers Leaderboard Ad

**Brian Bushnell** · 10-22-2015, 09:01 AM

Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100

That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.

**Alirez** · 10-22-2015, 09:28 AM

Originally posted by Brian Bushnell View Post

Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100

That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.

Hi Brian,

Thank you very much for your reply, I will look into doing this.

One further complication that I have is that I have generated a final assembly through a multiple k-mer approach. I've created assemblies with every (odd) iteration of k from 19-31 and merged them all followed by removing redundant contigs. To find out which reads have not been used in this final assembly is there an option in BBMap to use multiple kfilters simultaneously?

If not, would you recommend conducting BBMap on each iterated-k assembly individually and clustering all unmapped reads together - anything which forms a cluster containing unmapped reads from all 7 assemblies will be reads not incorporated into the final assembly.

Thanks again,
Ali

**Brian Bushnell** · 10-22-2015, 09:34 AM

The point of multiple kmers is to compensate for locally low coverage. If a contig shares a 19-mer with a read, then that read was used in the 19-mer phase. If a read shares a 31-mer with a read, then it shares a 19-mer as well. So, just set kfilter=19.

19 is really short for assembly, though. I don't have any direct experience with Trinity, but normally I find that's below the bottom of the range of useful kmer sizes. In Spades, for example, which also supports multiple kmer lengths, I normally sweep from K=25 to K=127.

**westerman** · 10-22-2015, 09:51 AM

Trinity's normal kmer is 25. Sweeping in Trinity doesn't seem to make much difference. At least I do not recall anyone on the mailing list recommending a sweep.

That said, I would just take your final contigs and map the reads to them using your minimum kfilter. That would keep any reads that map to your final assembly on a very conservative basis from being output thus allowing you to concentrate on reads that absolutely do not map. As Brian hinted, kfilter probably won't make much of a difference so personally I would skip it.

**Alirez** · 10-23-2015, 07:42 AM

Brian and Westermen,

Thanks for both of your comments, I'll use the bbmap as recommended.

With regards to the multiple-k approach I've been using it as a way to optimise my assembly - my initial assemblies have been quite fragmented with an unrealistic amount of loci. This is likely partly due to Trinity being unable to reliably distinguish allelic variation from isoform information (as indicated by a few gene examples I looked at in my dataset). I've certainly found on tests of subsets of the data that a multiple-k assembly gives a much better assembly than any single k-mer alone. Conducting differential expression on loci that can only be well annotated should then resolve any misassembly issues. My next step is a scaffolding through translation mapping to my closest genome/exome (tblastx)or proteome. Here is the paper I am using the approach from (http://www.ncbi.nlm.nih.gov/pmc/arti...pid397977title) if you are interested to see or have any further comments.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Acquiring reads not incorporated into a Trinity assembly?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News