SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to calculate P-value&FDR lynn012 RNA Sequencing 2 09-18-2011 10:51 PM
Calculate p-value of SNP cybercot Bioinformatics 0 04-20-2011 09:55 PM
How to calculate the sequencing coverage from bioscope result coonya SOLiD 0 12-28-2010 10:14 PM
low 454 coverage combined with high solexa coverage strob Bioinformatics 7 10-07-2010 10:14 AM
how calculate MAF ? dmartinez Bioinformatics 1 09-23-2010 09:39 AM

Reply
 
Thread Tools
Old 05-28-2009, 08:42 AM   #1
arendon
Junior Member
 
Location: Cambridge, UK

Join Date: Mar 2009
Posts: 6
Default How to calculate coverage

Hello,

I would like to sequence coverage along the genome from aligned reads. Any suggestions or tools to do this efficiently? Maybe also being able to choose different bin sizes?

Many thanks,

a
arendon is offline   Reply With Quote
Old 05-29-2009, 06:23 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Maybe it is early in the morning but I am not able to parse the phrase "I would like to sequence coverage along the genome from aligned reads". Perhaps you meant "... like to determine sequence coverage ...". In which case, yes, binning would work. Or just taking the number of reads times the average length of the reads all divided the genome length.

Many of the alignment programs will give you something that you can toss into a spreadsheet and come up with a fancy graph. Basically via binning with a bin size of 1.

Really it all depends on exactly what you want in the end. A single "X coverage" number? A graph? A mean with standard deviation? In any case the computational portion does not seem that complex to me.
westerman is offline   Reply With Quote
Old 05-29-2009, 06:45 AM   #3
arendon
Junior Member
 
Location: Cambridge, UK

Join Date: Mar 2009
Posts: 6
Default

Terribly sorry, I did make little sense.

I would like to know at each base along the genome how many reads saw that base. Alternatively, one can ask not at the single base level but at some interval distance, say 10bp. The output could be a wig file. This does not sound terribly complicated to do if the reads are sorted. I am more wondering whether there are tools that already do this.

Many thanks,

a
arendon is offline   Reply With Quote
Old 06-02-2009, 01:41 PM   #4
ewilbanks
Member
 
Location: Davis, CA

Join Date: Mar 2009
Posts: 82
Default

Hi! I was working on just this issue a while back and was surprised by the relative lack of tools. Binning should work just fine. I'd recommend Aaron Quinlan's "BEDTools". the CoverageBed and genomeCoverageBed seem applicable, though I haven't used them yet.

http://people.virginia.edu/~arq5x/bedtools.html

A friend of mine wrote a nifty script in R for me that calculates the coverage at each base-pair across the genome, using input of a text file with read genome coordinates. I'd be happy to send it to you, if that seems helpful. The data from this is quite noisy and usually needs smoothing of some sort (I did rolling means, using the R "zoo package"). I don't know what genome size you're working with, but I was using this on microbial genomes (~4 Mb) and the program runs in ~20-30 min.

Cheers,
Lizzy
ewilbanks is offline   Reply With Quote
Old 06-03-2009, 10:06 AM   #5
v_kisand
Member
 
Location: Eesti

Join Date: Jan 2009
Posts: 37
Default

I do not remember exactly but some time ago I messed with velvet I think it calculated coverage (I must admit I do not remember was that single base coverage)
and there was a tutorial how to get nice graphs using R.

Quote:
Originally Posted by ewilbanks View Post
Hi! I was working on just this issue a while back and was surprised by the relative lack of tools. Binning should work just fine. I'd recommend Aaron Quinlan's "BEDTools". the CoverageBed and genomeCoverageBed seem applicable, though I haven't used them yet.

http://people.virginia.edu/~arq5x/bedtools.html

A friend of mine wrote a nifty script in R for me that calculates the coverage at each base-pair across the genome, using input of a text file with read genome coordinates. I'd be happy to send it to you, if that seems helpful. The data from this is quite noisy and usually needs smoothing of some sort (I did rolling means, using the R "zoo package"). I don't know what genome size you're working with, but I was using this on microbial genomes (~4 Mb) and the program runs in ~20-30 min.

Cheers,
Lizzy
v_kisand is offline   Reply With Quote
Old 06-08-2009, 10:42 PM   #6
strob
Member
 
Location: Belgium

Join Date: Nov 2008
Posts: 79
Default

Hello all,

instead of having a complete chromosome/genome as a reference, I use many gene-scale sequences as my reference. I now want to see what the coverage per base is when I map my solexa reads against these many reference sequences. Is there already a tool out there that can do the job?
Can programs like soap, bowtie, ... provide me this type of information?
Is it also possible to do a blast (with stringent parameters) and than parse these blast results?

Any help/comments are more than welcome
strob is offline   Reply With Quote
Old 06-09-2009, 02:17 AM   #7
Jonathan
Member
 
Location: Germany

Join Date: Jun 2009
Posts: 36
Default

There's an easy way to do it using output from the maq-pipeline:

Code:
...
[maq-steps]
...
maq pileup -p [your bfa] [your map] > pileup.out

cut -f 2-4 pileup.out > croppedpileup.out

#then launch R
R
#following are R commands
data <-read.table(file="croppedpileup.out",sep="\t",header=F)
colnames(data)<-c("pos","consensus","coverage")
depth<-mean(data[,"coverage"])
# depth now has the mean (overall)coverage
#set the bin-size
window<-101
rangefrom<-0
rangeto<-length(data[,"pos"])
data.smoothed<-runmed(data[,"coverage"],k=window)
png(file="cov_out.png",width=1900,height=1000)
plot(x=data[rangefrom:rangeto,"pos"],y=data.smoothed[rangefrom:rangeto],pch=".", cex=1,xlab="bp position",ylab="depth",type="l")
dev.off()
Feel free to leave R afterwards,
you should (unless some error occured) find a PNG-file containing the coverageplot in your directory;
Of course window can be changed (needs to be odd-numbered, though)
as well as rangefrom and rangeto values.

Edit:
Of course when using many sequences in maq,
you will most likely be interessted in keeping the first column of the pileup.out.
However, this will leed to much bigger files (longer R-load-times), and will require R-handling
as you probably want to slice and dice and plot them by sequence-ID I take it?

Any questions?
Best
-Jonathan

Last edited by Jonathan; 06-09-2009 at 02:20 AM.
Jonathan is offline   Reply With Quote
Old 07-11-2009, 05:32 AM   #8
Malabady
Member
 
Location: Illinois, USA

Join Date: Apr 2009
Posts: 12
Default

hi all;
some papers mention X coverage and some says % coverage, is there any difference between both? we get % coverage by dividing total (reads*length) by genome size. if these two terms are different, how the X coverage is calculated? do we use haploid genome size instead of genome size?
Malabady is offline   Reply With Quote
Old 07-13-2009, 07:14 AM   #9
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by Malabady View Post
hi all;
some papers mention X coverage and some says % coverage, is there any difference between both? we get % coverage by dividing total (reads*length) by genome size. if these two terms are different, how the X coverage is calculated? do we use haploid genome size instead of genome size?
From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

% coverage is how well the genome is actually covered after all mapping and assembly is done.

As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)


One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


I use the haploid genome size or more specifically the C-value times 965Mbases/pg.

Last edited by westerman; 07-13-2009 at 10:25 AM. Reason: Corrected my division order. Color me red-faced!
westerman is offline   Reply With Quote
Old 07-13-2009, 07:38 AM   #10
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by westerman View Post
From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

% coverage is how well the genome is actually covered after all mapping and assembly is done.

As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
Our '% coverage' is 66.6% (100Mbases / 150Mbases)


One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
Also, what is a genome? Is it the non-repetitive part? Is it the part that is sequencable with your X base-pair reads? Most genome-sequencing papers say >98% but I highly doubt this given the large fraction of ALU, SINE, and other repeat elements that confound short reads.
nilshomer is offline   Reply With Quote
Old 07-13-2009, 08:10 AM   #11
Malabady
Member
 
Location: Illinois, USA

Join Date: Apr 2009
Posts: 12
Default

Thanks Westerman,

In the example you gave:
So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
Our '% coverage' is 66.6% (100Mbases / 150Mbases)

Isn't the X coverage should be 0.1X in this case?

Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?
Malabady is offline   Reply With Quote
Old 07-13-2009, 10:24 AM   #12
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by Malabady View Post
Thanks Westerman,

In the example you gave:
So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
Our '% coverage' is 66.6% (100Mbases / 150Mbases)

Isn't the X coverage should be 0.1X in this case?
That is what I get for posting early Monday morning.

'X' coverage is raw bases divided by genome size. So the above should be 1.5 Gbases / 150 Mbases or 10X. I will correct my original post.

Quote:
Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?
I believe that all published C-values for for haploid genomes. Although if someone knows for certain then please chime in.
westerman is offline   Reply With Quote
Old 07-13-2009, 12:14 PM   #13
Malabady
Member
 
Location: Illinois, USA

Join Date: Apr 2009
Posts: 12
Default

I agree that all published C-values are for haploid genomes. But one can roughly estimated the total genome size by multiplying the haploid size by the number of genomes. so if we are talking about diploid plant, we multiply the haploid genome size by 2. Then use this estimated genome size in the coverage calculation. Does this sounds correct to you?.....many thanks
Malabady is offline   Reply With Quote
Old 07-14-2009, 08:04 AM   #14
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default computing coverage

Hi,
As ewilbanks suggested, BEDTools will do this for you.

http://people.virginia.edu/~arq5x/bedtools.html

If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
chr1 0 10000
chr1 9999 20000
...

Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
1) The number of reads in reads.bed that overlap the window
2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)

coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage

Sample output:
<chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
chr1 0 10000 0 0 10000 0
chr1 9999 20000 33 1000 10000 0.10

I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.

Best,
Aaron
quinlana is offline   Reply With Quote
Old 07-14-2009, 08:52 AM   #15
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Also, "per base" coverage can be computed with genomeCoverageBed using the "-d" option. Unfortunately, it doesn't currently have an option to assume that the input BED file is sorted by chrom/start. Consequently, it loads the data into memory and sorts it internally. Thus, memory use is quite high if you have millions and millions of reads. A forthcoming release will fix this obvious limitation.

Note that genomeCoverageBed requires a "genome" file that tells it how long each chromosome is. One can quickly produce this by querying the UCSC databases.

For example, human (hg18):
> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg18.chromInfo" | grep -v ^chrom | head
chr1 247249719
chr1_random 1663265
chr10 135374737
chr10_random 113275
chr11 134452384
chr11_random 215294
chr12 132349534
chr13 114142980
chr13_random 186858
chr14 106368585

Or, of course, just use their browser.

Best,
Aaron
quinlana is offline   Reply With Quote
Old 07-14-2009, 11:21 AM   #16
arendon
Junior Member
 
Location: Cambridge, UK

Join Date: Mar 2009
Posts: 6
Default

I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

for example for a solexa export file:
Code:
require(ShortRead)
aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
# Filtering of reads e.g.:
aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
#Remove duplicated reads
aln<-aln[!srduplicated(aln)]
#Coverage
cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'
One can then use the package rtracklayer to export it as a wig
Code:
require(rtracklayer)
export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")
you might need to change the chromosome names afterwards if your original names already contained chr.
arendon is offline   Reply With Quote
Old 07-17-2009, 04:42 AM   #17
dePhi
Junior Member
 
Location: Amsterdam, NL

Join Date: Feb 2009
Posts: 5
Default

Quote:
Originally Posted by westerman View Post
From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

% coverage is how well the genome is actually covered after all mapping and assembly is done.

As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)


One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
Maybe I misunderstood what coverage is about, but...

Using coverage to determine how many times you COULD cover a total genome is not really usefull, is it? (Unless you want to brag during coffee breaks, of course) You would like to know how many times the sequenced part of your genome actually IS covered, right? Giving you an idea of how many reads confirm a found base at a position. And coverage gives you an estimate of just that; how many reads do you have covering the parts you actually sequenced.

Then shouldn't the result be 15X coverage? Because actual covering of the genome was 66.6% which contains 100Mb. So coverage would be 1.5Gb/100Mb which makes it 15X?
So on average your sequenced part of the genome has 15X coverage, while in theory you could cover the entire genome 10X.

Adding to that, I think that using the average coverage is not really adequate in describing the trustworthiness of your data. Using coverage distribution seems to make more sense to me. What if you have 50% covered only 5 times, and 50% covered 55 times? Still gives you 30X coverage, but it's not a good representation.

Sorry for being such a pain...
Cheers, dePhi
dePhi is offline   Reply With Quote
Old 07-22-2009, 07:00 AM   #18
xguo
Member
 
Location: Maryland

Join Date: Jul 2008
Posts: 48
Default

Hi, there,

I have a related question about coverage calculation for RNASeq data. The average exon coverage may be computed as the sum of # reads mapped to each exonic position divided by exon length. However, as dePhi said, "using the average coverage is not really adequate in describing the trustworthiness of your data." Drawing a distribution graph along the transcript length is helpful. It is easy to get the distribution graph for an individual transcript. But to get a summary graph for all transcripts, coverage values have to be scaled to transcript length and summed over all transcripts. I'm having trouble to do that. Could someone help me on that, probably some simple R code would do it?

thanks a lot

Xiang
xguo is offline   Reply With Quote
Old 07-23-2009, 09:42 AM   #19
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default

I wrote an app called "ReadCoverage" which I think does exactly what you want for defined regions (e.g. exons, repeatmasked regions). Also generates track data. It's part of http://useq.sourceforge.net . Here's the command line menu. There's also a GUI if you prefer that approach.

**************************************************************************************
** Read Coverage: June 2009 **
**************************************************************************************
Generates read coverage stair-step xxx.bar graph files for visualization in IGB. Will
also calculate per base coverage stats for a given file of interrogated regions.

Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files.
-s Save directory, full path.
-i (Optional) Full path file name for a tab delimited bed file (chr start stop ...)
containing interrogated regions to use in calculating a per base coverage
statistics. Interbase coordinates assumed.
-b Just calculate stats, skip coverage graph generation.

Example: java -Xmx1500M -jar pathTo/USeq/Apps/ReadCoverage -p
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedHitTrckEts1 -i
/CapSeqDesign/interrogatedExonsChrX.bed

**************************************************************************************
Nix is offline   Reply With Quote
Old 07-24-2009, 09:53 AM   #20
xguo
Member
 
Location: Maryland

Join Date: Jul 2008
Posts: 48
Default

Thanks. Could you please let me know how you scale the base coverage for transcripts with different length?
xguo is offline   Reply With Quote
Reply

Tags
coverage, depth

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO