SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to calculate P-value&FDR lynn012 RNA Sequencing 2 09-18-2011 10:51 PM
Calculate p-value of SNP cybercot Bioinformatics 0 04-20-2011 09:55 PM
How to calculate the sequencing coverage from bioscope result coonya SOLiD 0 12-28-2010 10:14 PM
low 454 coverage combined with high solexa coverage strob Bioinformatics 7 10-07-2010 10:14 AM
how calculate MAF ? dmartinez Bioinformatics 1 09-23-2010 09:39 AM

Reply
 
Thread Tools
Old 07-26-2009, 12:31 PM   #21
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default

It calculates a per base read coverage over the regions you provide. There is no scaling. It also generates a histogram of the fraction of interrogated bases with 1,2,3,4,.... or more reads. Good for assessing what % of your array capture and sequencing hit 10 or more reads. It also gives an estimate of what your coverage would be with additional sequencing.
Nix is offline   Reply With Quote
Old 08-19-2009, 05:46 PM   #22
srao
Junior Member
 
Location: CA

Join Date: Feb 2008
Posts: 4
Default Does readAligned support Paired End export.txt files?

Hello,
I figured I'd ask, since this might be faster than me figuring it out - does readAligned from the ShortRead package allow data from PE export files to be plotted together?

Thanks a lot!


Quote:
Originally Posted by arendon View Post
I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

for example for a solexa export file:
Code:
require(ShortRead)
aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
# Filtering of reads e.g.:
aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
#Remove duplicated reads
aln<-aln[!srduplicated(aln)]
#Coverage
cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'
One can then use the package rtracklayer to export it as a wig
Code:
require(rtracklayer)
export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")
you might need to change the chromosome names afterwards if your original names already contained chr.
srao is offline   Reply With Quote
Old 08-20-2009, 03:51 AM   #23
ohofmann
Member
 
Location: Melbourne, Australia

Join Date: Jan 2009
Posts: 37
Default

Quote:
Originally Posted by arendon View Post
I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.
And a question related to the one above -- does readAligned also support (or plans to support) the SAM format?
ohofmann is offline   Reply With Quote
Old 08-20-2009, 03:59 AM   #24
arendon
Junior Member
 
Location: Cambridge, UK

Join Date: Mar 2009
Posts: 6
Default

Hey ohofmann,

At the moment readAligned does not support SAM format. The developers are planning to include it for the next release. Take a look at https://stat.ethz.ch/pipermail/bioc-...ly/000474.html

Also, it is very worthwhile to have a look at http://htseq.ucr.edu They have a nice tutorial on all things R and seq.
arendon is offline   Reply With Quote
Old 08-20-2009, 04:01 AM   #25
ohofmann
Member
 
Location: Melbourne, Australia

Join Date: Jan 2009
Posts: 37
Default

Somehow missed Martin's post, strange. Thanks for the pointer! And yes, the UCR pages are a great starting point.

-- Oliver
ohofmann is offline   Reply With Quote
Old 07-07-2010, 06:27 AM   #26
jamal
Member
 
Location: Denmark

Join Date: Jan 2010
Posts: 10
Default

Quote:
Originally Posted by westerman View Post
That is what I get for posting early Monday morning.

'X' coverage is raw bases divided by genome size. So the above should be 1.5 Gbases / 150 Mbases or 10X. I will correct my original post.



I believe that all published C-values for for haploid genomes. Although if someone knows for certain then please chime in.
thank you for your answer

I'm new in this field but I was searching about x covarage to calculate it for my reads and found this qoute. I would like to ask you that in your example 300M* 50 is 15G, so what is the point for this caculation?
jamal is offline   Reply With Quote
Old 07-08-2010, 10:31 AM   #27
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default

Using Tablet, a genome viewer program, you can export a file in which the number of reads at each base is printed out. However, the base number is not printed out, just the number of reads.

Is there a graphing program that could take this txt file with the number of reads at each base and graph it as a line graph ? Or, if I added the appropriate base number along with the number of reads at each base, then is there a graphics program which could print out a line graph from this txt file ?
mattanswers is offline   Reply With Quote
Old 07-27-2010, 08:06 AM   #28
jordi
Member
 
Location: València, Spain

Join Date: Apr 2009
Posts: 48
Default

Hi all,
I would like to know the coverage of a 454 tissue specific transcriptome of a non-model specie, so I have not a reference sequence, neither EST nor genome sequence. I understood coverage as number of reads covering a given region (as dePhi said a year ago), that is, reads / bp. If I have a mean of 6 reads per contig and a mean of 209 bp per read, it is correct to say that I have a ~0.03X (6/209) coverage??
Thanks a lot!
jordi is offline   Reply With Quote
Old 07-28-2010, 10:58 AM   #29
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

jordi:

6 reads per contig with an average read length of 209 will tell you only that you have 1254 bases of data in a contig. What you really need is know is how many base *should* be covered. For example if your contig is suppose to be 1000 bases then your coverage is 1254/1000 or about 1.25 coverage. If your contig is suppose to be 300 bases then you have 1254/300 or about 4x coverage.

Unfortunately it appears that you have no idea of how many bases you are suppose to be covering. So you will have to take a wild guess. Fortunately inter-species transcriptomes tend to be a bit more uniform in size than inter-species genomes. I suggest you take the transcriptome size of a related species and use that. It won't be exact but it probably will not be off by more than a factor of 2.

To be clear what you want to calculate is:

transcriptome_size divided by (#_of_reads times avg_length_of_reads)
westerman is offline   Reply With Quote
Old 07-28-2010, 11:11 AM   #30
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by jamal View Post

I'm new in this field but I was searching about x covarage to calculate it for my reads and found this qoute. I would like to ask you that in your example 300M* 50 is 15G, so what is the point for this caculation?
And finally catching up with Jamal's question. Hum. It looks like I made another Monday morning calculation mistake. 300M reads times 50 would be indeed be 15 Gbases or 15000 Mbases. Then the 'X' coverage in my example is 15000/150 or 100x coverage.

Oh well, numbers being wrong or not, my original post was about 'X coverage' (where you divide how many bases you sequenced by the size of the genome/transcriptome/whatever you wish to sequence) versus 'percentage coverage' (which is how many bases you end up after mapping/assembly by the size of the genome/transcriptome/whatever).

This doesn't seem to be complex concept to me. Of course determining the size of what you wish to sequence can be complex. And even after you get the average coverage -- what does it really mean? The average does not help if your region of interest does not have good coverage in that region. Variance of coverage can be a helpful metric.
westerman is offline   Reply With Quote
Old 07-29-2010, 12:45 AM   #31
jordi
Member
 
Location: València, Spain

Join Date: Apr 2009
Posts: 48
Default

thank you very much for your answer westerman. You're right, I have no idea about the transcriptome size. It's a tissue specific from a non-model specie. I got 62M from a 454 pyrosequencing process from 8 different organisms (same plate) and I suspect that there were too much samples for a single plate. I would like to guess the ideal ratio sample/plate so I thought that coverage could be a good way to estimate this.
jordi is offline   Reply With Quote
Old 07-29-2010, 02:37 AM   #32
rdeborja
Member
 
Location: Toronto

Join Date: Aug 2008
Posts: 42
Default

We've started using BED Tools as part of our pipeline and so far it's been good. We calculate 2 types of coverage, non-collapsed (total mapped reads x read length / genome size) and collapsed coverage (total mapped reads with PCR duplicates removed x read length / genome size).

The tough part is identifying those reads aligning to multiple locations. What do people do in that case? Are people using uniquely aligned reads only and using a mappable genome size (different read lengths will change this number). We use Picard to collapse our reads. If you use BEDTools, you'll need to remove the duplicates since BEDTools doesn't seem to do a flag check for the read.
rdeborja is offline   Reply With Quote
Old 12-01-2010, 07:44 PM   #33
fhb
Member
 
Location: illinois

Join Date: Aug 2010
Posts: 10
Default

Quote:
Originally Posted by quinlana View Post
Hi,
As ewilbanks suggested, BEDTools will do this for you.

http://people.virginia.edu/~arq5x/bedtools.html

If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
chr1 0 10000
chr1 9999 20000
...

Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
1) The number of reads in reads.bed that overlap the window
2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)

coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage

Sample output:
<chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
chr1 0 10000 0 0 10000 0
chr1 9999 20000 33 1000 10000 0.10

I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.

Best,
Aaron
Hi Dr. Quinlan,

Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando
fhb is offline   Reply With Quote
Old 11-17-2011, 05:52 PM   #34
msutada@gmail.com
Junior Member
 
Location: japan

Join Date: May 2011
Posts: 6
Default

Hi Jonathan,

About R script above, I have tried and go these error. Any suggestion is great as I'm R beginner.


> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])
Error: unexpected symbol in:
"rangefrom<-0
rangeto<-length(data[,"sca_position"
> data.1kb<-runmed(data[,"coverage"],k=window)
Error in as.integer(k) :
cannot coerce type 'closure' to vector of type 'integer'
> png(file="cov_1k.png",width=400,height=1000)
> plot(x=data[rangefrom:rangeto,"sca_position"],y=data.1kb[rangefrom:rangeto],pch=".",cex1,xlab="bp position",ylab="depth",type="p")
Error in `[.data.frame`(data, rangefrom:rangeto, "sca_position") :
object 'rangefrom' not found
> dev.off()
msutada@gmail.com is offline   Reply With Quote
Old 11-20-2011, 07:29 AM   #35
frymor
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 149
Default

Quote:
Originally Posted by msutada@gmail.com View Post
Hi Jonathan,
> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])
You have two different types of quoting in coverage
It should be like that:
Code:
depth<-mean(data[,"coverage"])
and than you will also define the windows size and the ranges
frymor is offline   Reply With Quote
Old 11-22-2011, 01:21 AM   #36
msutada@gmail.com
Junior Member
 
Location: japan

Join Date: May 2011
Posts: 6
Default

Hi frymor,

Thank you for the suggestion.
I have fix the script and now I got another message from coverPlot.Rout.

> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage"])
Warning message:
In mean.default(data[, "coverage"]) :
argument is not numeric or logical: returning NA
> #depth now has the mean (overall)coverage
> #set the bin-size
> window<-1001
> rangefrom<-0
> rangeto<-length(data[,"sca_position"])
> data.1kb<-runmed(data[,"coverage"],k=window)
> png(file="cov_1k.png",width=400,height=1000)
coverPlot.Rout

Any suggestion would be great.

Best regards,
Sutada
msutada@gmail.com is offline   Reply With Quote
Old 11-22-2011, 09:48 AM   #37
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

Based on the options you've given, "~/q20snpref/illusmp454merCbed" [note no extension] should be a headerless file with fields separated by tabs.

The warning message suggests that you either have a header line in your file (or possibly a comment line), or a non-numeric line in the 'coverage' column.

Just to make sure the input looks like what is expected, what is the output of this command [which displays the first 5 lines of a file]?
Code:
readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
gringer is offline   Reply With Quote
Old 11-23-2011, 01:05 AM   #38
msutada@gmail.com
Junior Member
 
Location: japan

Join Date: May 2011
Posts: 6
Default

hi gringer,

Thank you very much for suggestion.
I got this with this code.
> readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
[1] "Scaffold\tsca_position\tcoverage" "Scaffold1\t1\t0"
[3] "Scaffold1\t2\t0" "Scaffold1\t3\t0"
[5] "Scaffold1\t4\t0"

I have also tried
data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T) and I got
> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
Killed
^@^@^Killed.

Should I put # for the header line. Actually illusmp454merCbed is a text file contains 3 column separate by tab. I thought that to run R we need the column name to define the data.

Any suggestion will be great.
Best,
msutada@gmail.com is offline   Reply With Quote
Old 11-23-2011, 01:10 AM   #39
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

With a file looking like that, just what you've put should work (i.e. changing the header value to 'TRUE'):

Quote:
data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
Is there a particular reason why the reading of the table was killed? How big is the input file? How much memory does your computer have? Did you break out of it manually? How long did it take before the 'Killed' message appeared?
gringer is offline   Reply With Quote
Old 11-24-2011, 03:27 AM   #40
Valerie2011
Junior Member
 
Location: United Kingdom

Join Date: Nov 2011
Posts: 1
Default

Hi you all,

I have a little question concerning the best way to estimate the RNASeq coverage, or more precisely global transcription rate distribution.

To put it simple let's assume a bacterium from which an RNASeq has been performed. For this analysis we work only with the pairs (positive and negative strands) of .wig files, filtering out rRNAs, tRNAs and other known RNAs that are transcribed at very high levels.

Briefly (ommiting a normalization step between the samples) we then calculate a global trancription rate for each strand by simply adding up all the values in the corresponding .wig file and by dividing this sum by the genome size: we get the global average read per nucleotide. Easy. A standard deviation on this mean is also easily calculated at the same time.

My question is, is the standard deviation the best parameter to describe the distribution around this mean? As the majority of the nucleotides are not read a single time and as some transcripts are very abundant this standard deviation is obviously huge compared to the mean (e.g. mean = 8, stdev = 420). Would then the standard error (= stdev / sqrt (genome size)) be more relevant? Or should the 0 values be skipped from the analysis, counting only nucleotides called and their number of reads?

Sorry for not being very good at stats, I forgot most of what I learnt ages ago. Basically the aim is to test whether one strand is being transcribed at (statistically) the same rate as the other... We would prefer not to use fancy software to do this analysis but to do it ourselves "manually" (with a little help of Perl, of course).

Note: we are interested in the transcripts independently of the coding domain sequences and other annotations.

Thanks for any suggestion!
Valerie2011 is offline   Reply With Quote
Reply

Tags
coverage, depth

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:30 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO