SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trinity Assembly Bang_Didi RNA Sequencing 6 01-11-2015 08:45 PM
Low mapping of reads to trinity assembly horvathdp RNA Sequencing 3 11-19-2013 07:30 AM
Acquiring contigs, de Bruijn graphs, velvet bioinf Bioinformatics 8 10-14-2012 08:27 PM
get wrong when I do de novo assembly by trinity for two 22.6G reads file taoxiang180 Bioinformatics 4 08-20-2012 10:52 PM
Fraction of reads incorporated in Velvet assemblies foram Bioinformatics 3 02-04-2009 06:36 PM

Reply
 
Thread Tools
Old 10-22-2015, 09:14 AM   #1
Alirez
Junior Member
 
Location: Durham, UK

Join Date: Sep 2014
Posts: 6
Default Acquiring reads not incorporated into a Trinity assembly?

Hi all,

I am conducting mRNASeq (Illumina HiSeq 2500 with PE data) using Trinity. I am finding it a bit difficult to see if there is an option to provide you with the reads which were not used to create the final .fasta assembly (the 'orphan' reads). Is there a way to do this?

I have considered looking at reads that that did not align back to the assembly, but I'm aware this may not be a true representation of reads not being used in the assembly.

Many thanks
Alirez is offline   Reply With Quote
Old 10-22-2015, 10:01 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100


That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.
Brian Bushnell is offline   Reply With Quote
Old 10-22-2015, 10:28 AM   #3
Alirez
Junior Member
 
Location: Durham, UK

Join Date: Sep 2014
Posts: 6
Default

Quote:
Originally Posted by Brian Bushnell View Post
Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100


That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.
Hi Brian,

Thank you very much for your reply, I will look into doing this.

One further complication that I have is that I have generated a final assembly through a multiple k-mer approach. I've created assemblies with every (odd) iteration of k from 19-31 and merged them all followed by removing redundant contigs. To find out which reads have not been used in this final assembly is there an option in BBMap to use multiple kfilters simultaneously?

If not, would you recommend conducting BBMap on each iterated-k assembly individually and clustering all unmapped reads together - anything which forms a cluster containing unmapped reads from all 7 assemblies will be reads not incorporated into the final assembly.

Thanks again,
Ali
Alirez is offline   Reply With Quote
Old 10-22-2015, 10:34 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The point of multiple kmers is to compensate for locally low coverage. If a contig shares a 19-mer with a read, then that read was used in the 19-mer phase. If a read shares a 31-mer with a read, then it shares a 19-mer as well. So, just set kfilter=19.

19 is really short for assembly, though. I don't have any direct experience with Trinity, but normally I find that's below the bottom of the range of useful kmer sizes. In Spades, for example, which also supports multiple kmer lengths, I normally sweep from K=25 to K=127.
Brian Bushnell is offline   Reply With Quote
Old 10-22-2015, 10:51 AM   #5
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Trinity's normal kmer is 25. Sweeping in Trinity doesn't seem to make much difference. At least I do not recall anyone on the mailing list recommending a sweep.

That said, I would just take your final contigs and map the reads to them using your minimum kfilter. That would keep any reads that map to your final assembly on a very conservative basis from being output thus allowing you to concentrate on reads that absolutely do not map. As Brian hinted, kfilter probably won't make much of a difference so personally I would skip it.
westerman is offline   Reply With Quote
Old 10-23-2015, 08:42 AM   #6
Alirez
Junior Member
 
Location: Durham, UK

Join Date: Sep 2014
Posts: 6
Default

Brian and Westermen,

Thanks for both of your comments, I'll use the bbmap as recommended.

With regards to the multiple-k approach I've been using it as a way to optimise my assembly - my initial assemblies have been quite fragmented with an unrealistic amount of loci. This is likely partly due to Trinity being unable to reliably distinguish allelic variation from isoform information (as indicated by a few gene examples I looked at in my dataset). I've certainly found on tests of subsets of the data that a multiple-k assembly gives a much better assembly than any single k-mer alone. Conducting differential expression on loci that can only be well annotated should then resolve any misassembly issues. My next step is a scaffolding through translation mapping to my closest genome/exome (tblastx)or proteome. Here is the paper I am using the approach from (http://www.ncbi.nlm.nih.gov/pmc/arti...pid397977title) if you are interested to see or have any further comments.
Alirez is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO