SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
De Novo Assembly of a transcriptome Neil De novo discovery 82 02-28-2012 10:44 AM
De Novo assembly of a plant transcriptome raonyguimaraes RNA Sequencing 7 07-05-2011 02:17 PM
De Novo Transcriptome Assembly QC Noremac General 0 05-19-2011 12:02 PM
de novo transcriptome assembly Niharika Introductions 8 02-07-2011 06:29 AM
de novo transcriptome assembly chenjy RNA Sequencing 4 12-07-2010 12:54 AM

Reply
 
Thread Tools
Old 02-10-2012, 08:58 AM   #1
LizBent
Member
 
Location: Guelph, Ontario, Canada

Join Date: Jan 2012
Posts: 31
Default De novo transcriptome quality metrics?

Hi everyone

I'm going to be making several de novo transcriptome assemblies (using different software), and I wish to compare them. What metrics are best for this? I don't have a reference genome.

Also, is there a software package for generating these metrics from output files? I've currently tried running Trinity, and I get a lot of output files, but none that seem to summarize the number of contigs, their length, etc. How can I calculate this from a fasta file of assembled contigs?
LizBent is offline   Reply With Quote
Old 02-10-2012, 11:13 AM   #2
ssing
Member
 
Location: usa

Join Date: Jan 2009
Posts: 20
Default

Hi LizBent,

I have been working on the exact same problem and have come up with some metrics to estimate the quality of a transcriptome in the absence of a ref genome. Some stats that I have used are:
*n50
*percent annotated to my closest reference
*percent of annotated proteins that have (what seem to be) premature stop codons
*percent of reads used/percent of paired reads used
*contiguity & completeness (see http://www.nature.com/nrg/journal/v1...l/nrg3068.html)
*incidence of chimeric transcripts

As for calculating simple metrics like n50, max contig size, etc, I use the command line program abyss-fac, which is available as part of the general ABySS package.

Good luck!
ssing is offline   Reply With Quote
Old 10-12-2012, 01:03 PM   #3
nepossiver
Junior Member
 
Location: São Paulo

Join Date: Oct 2012
Posts: 8
Default

Quote:
Originally Posted by ssing View Post
*incidence of chimeric transcripts
hi ssing,

how do you calculate chimeric transcripts? Do you have a reference genome? My problem is, I don't, and I don't know of a good way to find chimeric contigs in my assemblies.

thanks
nepossiver is offline   Reply With Quote
Old 05-12-2015, 06:59 PM   #4
student-t
Member
 
Location: Garvan Institute

Join Date: Mar 2015
Posts: 16
Default

There're a few solutions to calculating metrics for an assembly.

1. https://github.com/ajmazurie/velvet-stats
2. Biopieces
3.http://korflab.ucdavis.edu/datasets/...athon_stats.pl
4. abyss-fac

I don't recommend 1-3. The documentation is bad, I didn't have the time to go through the source code. Biopieces required a multi-stage workflow, which I think it's a very stupid idea.

Use abyss-fac, don't waste your time. On a Mac, install it via "brew install abyss"
student-t is offline   Reply With Quote
Old 05-12-2015, 07:24 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Old thread, but BBMap has a stats.sh program that will summarize basic assembly stats (N50, L50, distribution of contig sizes, GC%, etc); it's very fast even on assemblies with millions of contigs, and extremely easy to use:

stats.sh contigs.fasta

For more advanced statistics, particularly if you have a reference and are evaluating different assembly methodologies, I recommend Quast because it also does alignment to the reference to calculate the number of misassemblies. Also, even if you don't have a reference, it does neat things like gene prediction. Not sure how that feature would work on a transcriptome, though.

Last edited by Brian Bushnell; 05-12-2015 at 07:26 PM.
Brian Bushnell is offline   Reply With Quote
Old 05-13-2015, 08:41 AM   #6
nepossiver
Junior Member
 
Location: São Paulo

Join Date: Oct 2012
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
For more advanced statistics, particularly if you have a reference and are evaluating different assembly methodologies, I recommend Quast because it also does alignment to the reference to calculate the number of misassemblies.
Their (excellent, I love SPAdes and QUAST) group is developing rnaQUAST, to evaluate transcriptome assemblies. Version 0.1.1 (current version at the time of my message) has a bug, though, reference transcriptome file naming has to strictly follow:

Code:
name.extension
I could not use a reference which had:

Code:
name.middle.extension

Last edited by nepossiver; 05-13-2015 at 09:06 AM. Reason: added rnaQUAST link.
nepossiver is offline   Reply With Quote
Old 06-04-2015, 08:03 AM   #7
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

There are tools like CGAL and RSEM-EVAL, which calculate the likelyhood of the reads belonging to the actual assembly. That might help when you're having more than 1.

Since sometimes the size of the assembly can vary too, I also like to have an estimate of the genome size beforehand, tools to use are kmerspectrumanalyzer or kmergenie.

And depending on how fragmented you can/want to get with the data: A most likely correct genome (not necessarily contigous) will be to take the consensus from all your assemblies, and break the contigs if they're not agreeing.

<s>If you arrive at a chromosome, and you have a prokaryote, then you need to take a look at the GC skew of the chromosome to detect obvious misassemblies.</s> scratch that, didn't see the transcriptome part.
EDIT: Eh, no strike through tags in this forum?
bastianwur is offline   Reply With Quote
Old 06-25-2015, 04:56 AM   #8
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

I should say Biopieces is pretty nifty for this task:


https://code.google.com/p/biopieces/...embled_contigs

You simply do:

Code:
read_fasta -i contigs.fna |
grab -e "SEQ_LEN>=200" |
analyze_assembly -x
and get:

Code:
N50: 9082
MAX: 52038
MIN: 200
MEAN: 4170
TOTAL: 3057214
COUNT: 733
---
maasha is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:23 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO