Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Oases: De novo transcriptome assembly of very short reads lcollado De novo discovery 58 02-07-2017 09:48 AM
PubMed: Parallelized short read assembly of large genomes using de Bruijn graphs. Newsbot! Literature Watch 0 12-30-2011 03:00 AM
short read assembly john@wurbio De novo discovery 1 05-31-2011 10:18 AM
De novo assembly of human genomes with massively parallel short read sequencing dan Literature Watch 0 12-21-2009 05:40 AM
De novo short read assembly? Which assembler is the best? Patrick De novo discovery 0 06-23-2009 07:42 PM

Thread Tools
Old 09-21-2011, 09:25 AM   #1
Senior Member
Location: China

Join Date: Sep 2009
Posts: 199
Default Minimum short read required for transcriptome assembly

I have Illumina short read, 2X50bp right now, around 14Gb data.
I just curious whether got any parameter or formula able to calculate the minimum short read required to assemble a transcript sequence by transcriptome assembler program in order to obtain comprehensive transcript?
eg. must have at least 1Mb Illumina short read in order to assemble it.

Do we need consider coverage and depth of data when determine or calculate the minimum short read required for transcriptome assembly as well?

Many thanks for advice.
edge is offline   Reply With Quote
Old 09-21-2011, 11:31 AM   #2
Rick Westerman
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104

Ah, I should have noted that you are a "Senior Member" and thus undoubtedly already know more about sequencing than many of us. My response below was more aimed towards the many new people we get on SeqAnswers thus it may not be applicable to you. Wish I did have more than a rough guide on an actual formula to use.


Originally Posted by edge View Post
Do we need consider coverage and depth of data...
Yes you do. In particular for a non-normalized transcriptome or non-rRNA-depleted sample then you need to be concerned with picking up low expression genes.

You do not give enough information for us to make an intelligent decision for your particular case (e.g., we would need information on the organism you are sequencing, the complexity of the genes for the organism, if your sequence sample is normalized or not, etc.) However we can play around with some very rough numbers.

Let us assume that your sample is completely normalized. In other words each transcript (gene) is present once and only once in your sample. Assume a complex eukaryotic organism. Then our numbers could look like:

100,000 genes at 1000 bases each ... equals a sequence space of 100 Mbase

Desire 30x sequencing coverage ... means we need 3 GB of sequence.

Your 14 GB will do quite nicely.

On the other hand let us assume that you do not have a normalized sample. Then some genes will be present thousands of times. Others only once. I am sure that there is some graph out there that describes this behavior and provides a multiplication factor but I'll make a wild guess that this increase the sequence space by at least 10. Thus you would need 30 GB of sequence.

The numbers above are very, very rough so do not base your research off of them. The numbers are more meant as a way to say "... it depends ..."

Last edited by westerman; 09-21-2011 at 11:34 AM. Reason: Realized that 'edge' is not a newbie.
westerman is offline   Reply With Quote
Old 09-21-2011, 12:22 PM   #3
Location: St. Catharines

Join Date: Mar 2010
Posts: 11

The following publication shows a number of simulations on transcriptome assembly and the effects of coverage and sequencing technology. It`s a bit dated now but should help you out. I believe they also have some online software so you can do your own rough simulation.

Wall PK, Leebens-Mack J, Chanderbali AS, Barakat A, Wolcott E, Liang H, Landherr L, Tomsho LP, Hu Y, Carlson JE, Ma H, Schuster SC, Soltis DE, Soltis PS, Altman N, dePamphilis CW. Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics. 2009 Aug 1;10:347.
tbanks is offline   Reply With Quote
Old 09-22-2011, 12:18 AM   #4
Senior Member
Location: China

Join Date: Sep 2009
Posts: 199

many thanks, westerman.

I have a RNA-seq human lung sample, 2X100bp, pair-end read with total 14GB file size right now.
I plan to map my RNA-seq data against transcriptome database that downloaded from NCBI.
After then, I plan to cluster all the short read depend on their mapped transcript group.
My problem facing is to determine how many minimum pair-end read is best to be a cut-off for assembly purpose.
From the mapping result, some of the transcript group only mapped by thousand read pair.

Thanks for any advice.
edge is offline   Reply With Quote
Old 08-25-2013, 10:16 PM   #5
Location: Santiago

Join Date: Apr 2013
Posts: 22
Default Minimum deep of coverage in transcriptome assembly

Hi everyone, i have 4,46 Gigas of information on various sequencing of transcripts in various tissues of Illumina Miseq paired-end reads. I had assembly all these reads and i found that the mean deep of coverage is of 27,9X (Deep of coverage = efficiency of sequencing / efficiency of assembly)
My question here is, what is de minimun of the deep of coverage for obtain robust information of the assembled transcriptome in a de novo transcriptome analysis?

Best regards!
mruizm is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 09:38 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO