Seqanswers Leaderboard Ad

**arendon** · 07-14-2009, 11:21 AM

I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

for example for a solexa export file:

Code:

require(ShortRead)
aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
# Filtering of reads e.g.:
aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
#Remove duplicated reads
aln<-aln[!srduplicated(aln)]
#Coverage
cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'

One can then use the package rtracklayer to export it as a wig

Code:

require(rtracklayer)
export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")

you might need to change the chromosome names afterwards if your original names already contained chr.

**dePhi** · 07-17-2009, 04:42 AM

Originally posted by westerman View Post

From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

% coverage is how well the genome is actually covered after all mapping and assembly is done.

As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)

One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.

I use the haploid genome size or more specifically the C-value times 965Mbases/pg.

Maybe I misunderstood what coverage is about, but...

Using coverage to determine how many times you COULD cover a total genome is not really usefull, is it? (Unless you want to brag during coffee breaks, of course) You would like to know how many times the sequenced part of your genome actually IS covered, right? Giving you an idea of how many reads confirm a found base at a position. And coverage gives you an estimate of just that; how many reads do you have covering the parts you actually sequenced.

Then shouldn't the result be 15X coverage? Because actual covering of the genome was 66.6% which contains 100Mb. So coverage would be 1.5Gb/100Mb which makes it 15X?
So on average your sequenced part of the genome has 15X coverage, while in theory you could cover the entire genome 10X.

Adding to that, I think that using the average coverage is not really adequate in describing the trustworthiness of your data. Using coverage distribution seems to make more sense to me. What if you have 50% covered only 5 times, and 50% covered 55 times? Still gives you 30X coverage, but it's not a good representation.

Sorry for being such a pain...
Cheers, dePhi

**xguo** · 07-22-2009, 07:00 AM

Hi, there,

I have a related question about coverage calculation for RNASeq data. The average exon coverage may be computed as the sum of # reads mapped to each exonic position divided by exon length. However, as dePhi said, "using the average coverage is not really adequate in describing the trustworthiness of your data." Drawing a distribution graph along the transcript length is helpful. It is easy to get the distribution graph for an individual transcript. But to get a summary graph for all transcripts, coverage values have to be scaled to transcript length and summed over all transcripts. I'm having trouble to do that. Could someone help me on that, probably some simple R code would do it?

thanks a lot

Xiang

**Nix** · 07-23-2009, 09:42 AM

I wrote an app called "ReadCoverage" which I think does exactly what you want for defined regions (e.g. exons, repeatmasked regions). Also generates track data. It's part of http://useq.sourceforge.net . Here's the command line menu. There's also a GUI if you prefer that approach.

**************************************************************************************
** Read Coverage: June 2009 **
**************************************************************************************
Generates read coverage stair-step xxx.bar graph files for visualization in IGB. Will
also calculate per base coverage stats for a given file of interrogated regions.

Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files.
-s Save directory, full path.
-i (Optional) Full path file name for a tab delimited bed file (chr start stop ...)
containing interrogated regions to use in calculating a per base coverage
statistics. Interbase coordinates assumed.
-b Just calculate stats, skip coverage graph generation.

Example: java -Xmx1500M -jar pathTo/USeq/Apps/ReadCoverage -p
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedHitTrckEts1 -i
/CapSeqDesign/interrogatedExonsChrX.bed

**************************************************************************************

**xguo** · 07-24-2009, 09:53 AM

Thanks. Could you please let me know how you scale the base coverage for transcripts with different length?

**Nix** · 07-26-2009, 12:31 PM

It calculates a per base read coverage over the regions you provide. There is no scaling. It also generates a histogram of the fraction of interrogated bases with 1,2,3,4,.... or more reads. Good for assessing what % of your array capture and sequencing hit 10 or more reads. It also gives an estimate of what your coverage would be with additional sequencing.

**srao** · 08-19-2009, 05:46 PM

Does readAligned support Paired End export.txt files?

Hello,
I figured I'd ask, since this might be faster than me figuring it out - does readAligned from the ShortRead package allow data from PE export files to be plotted together?

Thanks a lot!

Originally posted by arendon View Post

I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

for example for a solexa export file:

Code:

require(ShortRead)
aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
# Filtering of reads e.g.:
aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
#Remove duplicated reads
aln<-aln[!srduplicated(aln)]
#Coverage
cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'

One can then use the package rtracklayer to export it as a wig

Code:

require(rtracklayer)
export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")

you might need to change the chromosome names afterwards if your original names already contained chr.

**ohofmann** · 08-20-2009, 03:51 AM

Originally posted by arendon View Post

I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

And a question related to the one above -- does readAligned also support (or plans to support) the SAM format?

**arendon** · 08-20-2009, 03:59 AM

Hey ohofmann,

At the moment readAligned does not support SAM format. The developers are planning to include it for the next release. Take a look at https://stat.ethz.ch/pipermail/bioc-...ly/000474.html

Also, it is very worthwhile to have a look at http://htseq.ucr.edu They have a nice tutorial on all things R and seq.

**ohofmann** · 08-20-2009, 04:01 AM

Somehow missed Martin's post, strange. Thanks for the pointer! And yes, the UCR pages are a great starting point.

-- Oliver

**jamal** · 07-07-2010, 06:27 AM

Originally posted by westerman View Post

That is what I get for posting early Monday morning.

'X' coverage is raw bases divided by genome size. So the above should be 1.5 Gbases / 150 Mbases or 10X. I will correct my original post.

I believe that all published C-values for for haploid genomes. Although if someone knows for certain then please chime in.

thank you for your answer

I'm new in this field but I was searching about x covarage to calculate it for my reads and found this qoute. I would like to ask you that in your example 300M* 50 is 15G, so what is the point for this caculation?

**mattanswers** · 07-08-2010, 10:31 AM

Using Tablet, a genome viewer program, you can export a file in which the number of reads at each base is printed out. However, the base number is not printed out, just the number of reads.

Is there a graphing program that could take this txt file with the number of reads at each base and graph it as a line graph ? Or, if I added the appropriate base number along with the number of reads at each base, then is there a graphics program which could print out a line graph from this txt file ?

**jordi** · 07-27-2010, 08:06 AM

Hi all,
I would like to know the coverage of a 454 tissue specific transcriptome of a non-model specie, so I have not a reference sequence, neither EST nor genome sequence. I understood coverage as number of reads covering a given region (as dePhi said a year ago), that is, reads / bp. If I have a mean of 6 reads per contig and a mean of 209 bp per read, it is correct to say that I have a ~0.03X (6/209) coverage??
Thanks a lot!

**westerman** · 07-28-2010, 10:58 AM

jordi:

6 reads per contig with an average read length of 209 will tell you only that you have 1254 bases of data in a contig. What you really need is know is how many base *should* be covered. For example if your contig is suppose to be 1000 bases then your coverage is 1254/1000 or about 1.25 coverage. If your contig is suppose to be 300 bases then you have 1254/300 or about 4x coverage.

Unfortunately it appears that you have no idea of how many bases you are suppose to be covering. So you will have to take a wild guess. Fortunately inter-species transcriptomes tend to be a bit more uniform in size than inter-species genomes. I suggest you take the transcriptome size of a related species and use that. It won't be exact but it probably will not be off by more than a factor of 2.

To be clear what you want to calculate is:

transcriptome_size divided by (#_of_reads times avg_length_of_reads)

**westerman** · 07-28-2010, 11:11 AM

Originally posted by jamal View Post

I'm new in this field but I was searching about x covarage to calculate it for my reads and found this qoute. I would like to ask you that in your example 300M* 50 is 15G, so what is the point for this caculation?

And finally catching up with Jamal's question. Hum. It looks like I made another Monday morning calculation mistake. 300M reads times 50 would be indeed be 15 Gbases or 15000 Mbases. Then the 'X' coverage in my example is 15000/150 or 100x coverage.

Oh well, numbers being wrong or not, my original post was about 'X coverage' (where you divide how many bases you sequenced by the size of the genome/transcriptome/whatever you wish to sequence) versus 'percentage coverage' (which is how many bases you end up after mapping/assembly by the size of the genome/transcriptome/whatever).

This doesn't seem to be complex concept to me. Of course determining the size of what you wish to sequence can be complex. And even after you get the average coverage -- what does it really mean? The average does not help if your region of interest does not have good coverage in that region. Variance of coverage can be a helpful metric.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News