SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
Power Analysis - Sample Size Calculation jroussarie Bioinformatics 2 11-07-2012 12:15 PM
coverage calculation arvi8689 Illumina/Solexa 7 11-11-2011 03:53 PM
coverage calculation arvi8689 Bioinformatics 2 11-08-2011 12:44 AM
coverage calculation arvi8689 Genomic Resequencing 1 11-07-2011 03:01 PM
Illumina Human Exome vs Agilent Human Exome GW_OK Sample Prep / Library Generation 23 06-28-2011 01:06 PM

Reply
 
Thread Tools
Old 05-28-2010, 10:53 PM   #1
apratap
Member
 
Location: Bay Area

Join Date: Jan 2009
Posts: 58
Smile Size of human transcriptome/exome for coverage calculation

Hi Guys

This might seem like a very trivial question but strangely enough I am not able to come up with an acceptable answer.

I am trying to calculate the size of human exome and human transcriptome in #bases for coverage purposes.

Here is what I did

Downloaded mRNA, exons, refSeq genes BED file from UCSC and summed up the total number of bases in each of those files / feature. Clearly there are overlapping regions in each of these annotation files but the #base that I am getting is far from the numbers one would accept. here is what I am seeing.

1. Total #bases in mRNA : 14,881,824,369
2. Total #bases in exons : 99,752,470
3. Total # bases in RefSeq Genes : 2,011,862,672

Just wondering if I should count the bases common to two genes twice or only uniq regions should be counted.

Any pointers from your experience will help.

Thanks!
-Abhi
apratap is offline   Reply With Quote
Old 05-29-2010, 08:00 AM   #2
Bio.X2Y
Member
 
Location: Europe

Join Date: Apr 2010
Posts: 46
Default

Hi Abhi,

I'm not sure what other people do, but we count an exon base only once (regardless of the number of transcripts it appears in, and regardless of whether it is in an exon on one or both strands).

We're using the UCSC Known Gene human annotation (hg19), and these are the counts we've come up with:

Total Bases Exon Bases
chr1 249250621 8079409
chr2 243199373 5781424
chr3 198022430 4706998
chr4 191154276 3364332
chr5 180915260 3820351
chr6 171115067 4241245
chr7 159138663 4049692
chr8 146364022 2909471
chr9 141213431 3430407
chr10 135534747 3398919
chr11 135006516 4439924
chr12 133851895 4144621
chr13 115169878 1655075
chr14 107349540 2665222
chr15 102531392 2897969
chr16 90354753 3248662
chr17 81195210 4348983
chr18 78077248 1377184
chr19 59128983 4500567
chr20 63025520 2034342
chr21 48129895 888164
chr22 51304566 1888002
chrX 155270560 2951340
chrY 59373566 271506
chrM 16571 11925
Bio.X2Y is offline   Reply With Quote
Old 05-31-2010, 01:05 AM   #3
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by apratap View Post
Clearly there are overlapping regions in each of these annotation files [...] Just wondering if I should count the bases common to two genes twice or only uniq regions should be counted.
Most of the transcribed nucleotides of the human genome are represented in different transcripts (whatever they are considered as same "gene" or not). As Bio.X2Y pointed out, you definitely have to remove redundancy. You can send your annotations to galaxy or use BEDtools to "collapse" ("project"/"fusion"/"merge") your annotated exons before adding the lengths.
steven is offline   Reply With Quote
Old 05-31-2010, 03:14 AM   #4
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

If you just want a base pair count for different annotations, you can just use UCSC table browser, choose the genome build you are using and annotation you are interested in, and press "summary/statistics" at the bottom, eg for hg18 RefSeq you get

item count 34,702
item bases 1,166,592,699 (40.49%)
item total 2,020,112,601 (70.11%)
smallest item 33
average item 58,213
biggest item 2,304,634
block count 347,347
block bases 66,601,430 (2.31%)
block total 104,526,351 (3.63%)
smallest block 3
average block 301
biggest block 59,461


The "block" lines are what you are interested in: 347,347 exons from 34,702 Refseq genes, with total size of 104MB, however when removing redundancies 66Mb is covered
frozenlyse is offline   Reply With Quote
Old 06-01-2010, 09:57 AM   #5
apratap
Member
 
Location: Bay Area

Join Date: Jan 2009
Posts: 58
Default

Thanks Guys. I understand that it is acceptable to remove redundancy at exon level.

@frozenlyse : your end number (exons) seems to match mine.

How do I deal with gene level coverage. There are many genes which overlap each other and as noted in my first post.

Total # bases in RefSeq Genes : 2,011,862,672

Is it acceptable to remove redundancy while counting bases in all human genes. In a way this will lead us to underestimate coverage. I say so because overlapping genes can be coexpressed right >>?

Thanks for your time to help me understand this.

Best,
-Abhi
apratap is offline   Reply With Quote
Old 06-01-2010, 01:22 PM   #6
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

I assume you are interested in this since you are doing whole exome sequence enrichment and subsequent sequencing.

Different vendors have different amounts of "whole exome" coverage. We found that the Agilent Sure Select only enriches for ~89% of the human whole exome.
NextGenSeq is offline   Reply With Quote
Old 06-09-2010, 10:45 AM   #7
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Quote:
Originally Posted by NextGenSeq View Post
I assume you are interested in this since you are doing whole exome sequence enrichment and subsequent sequencing.

Different vendors have different amounts of "whole exome" coverage. We found that the Agilent Sure Select only enriches for ~89% of the human whole exome.
NextGenSeq, how did you get the number of ~89% exome targetted by agilent? Could you share some detail on that!

Thanks,
sm
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 06-10-2010, 10:51 AM   #8
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

By comparing the genes listed in the bed file to the UCSC annotation. I tried attaching the bed file but it's too large for this site to allow it.
NextGenSeq is offline   Reply With Quote
Old 01-03-2011, 08:52 AM   #9
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

I keep seeing a figure of 30-33Mb for the human exome e.g.

This 2009 Nature paper
http://www.nature.com/nature/journal...090910-11.html
"Protein-coding regions constitute ~1% of the human genome or ~30 megabases (Mb), split across ~180,000 exons."

30-33Mb is also the figure cited in Illumina's "Sequencing Output Calculator' , sent to me by tech support.

Anyone know why the number is so much higher on this thread?
ssully is offline   Reply With Quote
Old 02-14-2011, 06:29 PM   #10
rstarke
Junior Member
 
Location: earth

Join Date: Feb 2011
Posts: 7
Default

I would also like to know why the huge discrepancy between what's in the literature (~30-40Mb) and the numbers cited in this thread. I just checked the GENCODE v6 annotations and the total annotated base count is over a billion, supporting the estimates in this thread. I'm confused. Can anyone clear up the discrepancy?
rstarke is offline   Reply With Quote
Old 02-14-2011, 07:08 PM   #11
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default Our friend Mr. Ref Seq says ...

Back of the envelope calculations:
The sum of the values for base coverage of the exons for the data above in the hg19/UCSCknown table (posted above) is
81,105,734

The Refseq table from UCSC for hg19 (jan 2011 version) says : 63,995,498
[ method : load table into datastruct, sort by name, traverse, if (currentname==previousname) dont count else calculate sum of exons and add to sum]. Notabene: this won't eliminate some overlapping situations.

Refseq is more conservative than UCSCknown and relies more on hand curation and less on computation.

I don't know about GENCODE but if it's that for human only and that number is right then it's probably any transcript ever measured. I could only speculate on what that extra bonus coverage is. A free trip to Sweden goes to the guy that can explain and prove it (if it's functionally real).
Richard Finney is offline   Reply With Quote
Old 02-14-2011, 11:05 PM   #12
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

Just to throw my 2 Cents in. As far as I know most exome-enriching kits use the CDS database for generating the exome library. As this database is less comprehensive than the Refseq or knownGene annotations in UCSC some exons will be missed due to that. Of course others are discarded because of hybridization difficulties (repetitive regions, etc).
ulz_peter is offline   Reply With Quote
Old 02-15-2011, 02:12 AM   #13
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by ssully View Post
I keep seeing a figure of 30-33Mb for the human exome e.g.

This 2009 Nature paper
http://www.nature.com/nature/journal...090910-11.html
"Protein-coding regions constitute ~1% of the human genome or ~30 megabases (Mb), split across ~180,000 exons."

30-33Mb is also the figure cited in Illumina's "Sequencing Output Calculator' , sent to me by tech support.

Anyone know why the number is so much higher on this thread?
Because "protein coding regions" and "exons" are different things. UTRs can be long, especially in human.

I think it is important to know what we are talking about:

1. number of genomic positions that are annotated as coding (included in CDS)
2. number of genomic positions that are annotated as exonic (included in exons)

As frozenlyse and Richard Finney indicated, values for 2. range around 60 and 80Mb, depending on the annotation source.
Ssully, the citation you mention with the number of 30Mb refers to 1. ("protein coding regions").
Rstarke, what is this number of 1 billion referring to? "Annotated bases" can be anything, on a genome you can annotate introns, promoters, repeated regions.. a link to this information would help.
Now, is there a precise definition of "exome" or is it a loose term? Is it supposed to include coding regions only, or can anyone put in there some UTR, promoters, intronic flanks, etc?
steven is offline   Reply With Quote
Old 02-15-2011, 02:13 AM   #14
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by ulz_peter View Post
Just to throw my 2 Cents in. As far as I know most exome-enriching kits use the CDS database for generating the exome library. As this database is less comprehensive than the Refseq or knownGene annotations in UCSC some exons will be missed due to that. Of course others are discarded because of hybridization difficulties (repetitive regions, etc).
That makes sense, thanks.
steven is offline   Reply With Quote
Old 03-02-2014, 07:08 PM   #15
zzta
Junior Member
 
Location: Hong Kong

Join Date: Dec 2013
Posts: 5
Default

Sorry to revive this thread, but exons or CDSs are not the only thing transcribed, so how can we account for non-coding RNAs? My understanding is that they are also part of the transcriptome...
zzta is offline   Reply With Quote
Old 12-01-2019, 10:11 AM   #16
Ric69
Junior Member
 
Location: So Cal

Join Date: Oct 2019
Posts: 1
Default

Here's a nice uncomplicated summary of hg19... https://grch37.ensembl.org/Homo_sapiens/Info/Annotation
Ric69 is offline   Reply With Quote
Reply

Tags
coverage, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO