thank you very much for your answer westerman. You're right, I have no idea about the transcriptome size. It's a tissue specific from a non-model specie. I got 62M from a 454 pyrosequencing process from 8 different organisms (same plate) and I suspect that there were too much samples for a single plate. I would like to guess the ideal ratio sample/plate so I thought that coverage could be a good way to estimate this.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
We've started using BED Tools as part of our pipeline and so far it's been good. We calculate 2 types of coverage, non-collapsed (total mapped reads x read length / genome size) and collapsed coverage (total mapped reads with PCR duplicates removed x read length / genome size).
The tough part is identifying those reads aligning to multiple locations. What do people do in that case? Are people using uniquely aligned reads only and using a mappable genome size (different read lengths will change this number). We use Picard to collapse our reads. If you use BEDTools, you'll need to remove the duplicates since BEDTools doesn't seem to do a flag check for the read.
Comment
-
Originally posted by quinlana View PostHi,
As ewilbanks suggested, BEDTools will do this for you.
If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
chr1 0 10000
chr1 9999 20000
...
Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
1) The number of reads in reads.bed that overlap the window
2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)
coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage
Sample output:
<chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
chr1 0 10000 0 0 10000 0
chr1 9999 20000 33 1000 10000 0.10
I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.
Best,
Aaron
Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando
Comment
-
Hi Jonathan,
About R script above, I have tried and go these error. Any suggestion is great as I'm R beginner.
> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])
Error: unexpected symbol in:
"rangefrom<-0
rangeto<-length(data[,"sca_position"
> data.1kb<-runmed(data[,"coverage"],k=window)
Error in as.integer(k) :
cannot coerce type 'closure' to vector of type 'integer'
> png(file="cov_1k.png",width=400,height=1000)
> plot(x=data[rangefrom:rangeto,"sca_position"],y=data.1kb[rangefrom:rangeto],pch=".",cex1,xlab="bp position",ylab="depth",type="p")
Error in `[.data.frame`(data, rangefrom:rangeto, "sca_position") :
object 'rangefrom' not found
> dev.off()
Comment
-
Originally posted by [email protected] View PostHi Jonathan,
> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])
It should be like that:
Code:depth<-mean(data[,[B][COLOR="Red"]"coverage"[/COLOR][/B]])
Comment
-
Hi frymor,
Thank you for the suggestion.
I have fix the script and now I got another message from coverPlot.Rout.
> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage"])
Warning message:
In mean.default(data[, "coverage"]) :
argument is not numeric or logical: returning NA
> #depth now has the mean (overall)coverage
> #set the bin-size
> window<-1001
> rangefrom<-0
> rangeto<-length(data[,"sca_position"])
> data.1kb<-runmed(data[,"coverage"],k=window)
> png(file="cov_1k.png",width=400,height=1000)
coverPlot.Rout
Any suggestion would be great.
Best regards,
Sutada
Comment
-
Based on the options you've given, "~/q20snpref/illusmp454merCbed" [note no extension] should be a headerless file with fields separated by tabs.
The warning message suggests that you either have a header line in your file (or possibly a comment line), or a non-numeric line in the 'coverage' column.
Just to make sure the input looks like what is expected, what is the output of this command [which displays the first 5 lines of a file]?
Code:readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
Comment
-
hi gringer,
Thank you very much for suggestion.
I got this with this code.
> readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
[1] "Scaffold\tsca_position\tcoverage" "Scaffold1\t1\t0"
[3] "Scaffold1\t2\t0" "Scaffold1\t3\t0"
[5] "Scaffold1\t4\t0"
I have also tried
data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T) and I got
> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
Killed
^@^@^Killed.
Should I put # for the header line. Actually illusmp454merCbed is a text file contains 3 column separate by tab. I thought that to run R we need the column name to define the data.
Any suggestion will be great.
Best,
Comment
-
With a file looking like that, just what you've put should work (i.e. changing the header value to 'TRUE'):
data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
Comment
-
Hi you all,
I have a little question concerning the best way to estimate the RNASeq coverage, or more precisely global transcription rate distribution.
To put it simple let's assume a bacterium from which an RNASeq has been performed. For this analysis we work only with the pairs (positive and negative strands) of .wig files, filtering out rRNAs, tRNAs and other known RNAs that are transcribed at very high levels.
Briefly (ommiting a normalization step between the samples) we then calculate a global trancription rate for each strand by simply adding up all the values in the corresponding .wig file and by dividing this sum by the genome size: we get the global average read per nucleotide. Easy. A standard deviation on this mean is also easily calculated at the same time.
My question is, is the standard deviation the best parameter to describe the distribution around this mean? As the majority of the nucleotides are not read a single time and as some transcripts are very abundant this standard deviation is obviously huge compared to the mean (e.g. mean = 8, stdev = 420). Would then the standard error (= stdev / sqrt (genome size)) be more relevant? Or should the 0 values be skipped from the analysis, counting only nucleotides called and their number of reads?
Sorry for not being very good at stats, I forgot most of what I learnt ages ago. Basically the aim is to test whether one strand is being transcribed at (statistically) the same rate as the other... We would prefer not to use fancy software to do this analysis but to do it ourselves "manually" (with a little help of Perl, of course).
Note: we are interested in the transcripts independently of the coding domain sequences and other annotations.
Thanks for any suggestion!
Comment
-
You're going to get a huge variation in expression for all transcripts, such that calculating a 'global deviation' doesn't really make sense. If you do want to do it globally, I'd recommend log-transforming the coverage values before working out mean coverage and deviation, because that transformation seems to better fit a normal curve.
On a per-gene (or per-isoform) basis, I've been calculating coverage using CV (i.e. SD / mean) as an indication of variation. With about 20 genes that I did this for, the CV ranged from about 40% to 400%, but anything above about 140% appeared to be a mis-mapped transcript.
It might be useful to only do this for expressed regions (e.g. coverage > 15) to attempt to control for these bad mappings.Last edited by gringer; 11-24-2011, 04:38 AM.
Comment
-
cufflinks Error
Hello,all
when I use cufflinks without a gtf file for bacterial RNA-seq, I get the follow result:
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
,that is all in the result file.
I don't know the reason why there are not values.anyone can help me?
thanks. best wishes.Attached Files
Comment
-
Originally posted by fhb View PostHi Dr. Quinlan,
Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando
Thanks
Comment
-
By utilising BEDtools and UCSC GB I am trying to get a picture like this (histogram of the bedgraph in wiggle form)
so far i have used a SORTED bam on which genomeCoverageBed was run with -bg -ibam options and -g mm8.fa.fai as the index
The resulting bedgraph was uploaded on UCSC GB and produced this "stripe" track which has numbers 1 2 3 and os on , before each stripe ( where do I find the meaning of the numbers?)
Then added a first line of the bedgraph file like this
track type=bedGraph name="set11bamSorted" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20
and uploaded the new file.
This time I am getting an error
"Error File 'set11_fixed.bedgraph' - Error line 1224705 of custom track: chromEnd larger than chrom chr13 size (114176901 > 114142980)"
Why should I get this error this time , while the contents of the bedgraph file are exactly the same as before? Is this the correct way to get the wiggle histogram like the one on the first picture ?
Below you may find the code that I used to get the original bedgraph file
genomeCoverageBed -bg -ibam ~/set11/tophatOut11/set11TopHataccepted_hits.Sorted.bam -g /opt/bowtie/indexes/mm8.fa.fai > ~/set11/set11.bedgraph &
Comment
-
I realize this is quite an old forum but I am currently trying to calculate the coverage per bp of my NGS data. I am new to NGS and bioinformatics so my apologies if this does not make sense...
For quality control I would like to determine a logical min and max coverage to exclude from my downstream analyses; however, I cannot seem to find a "gold standard" for such purposes. It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism. Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping. As such, I would like to exclude these regions before performing genome-wide population genetic analyses. Does anyone have any suggestions on what cutoffs to use and how to go about doing so?
Thanks in advance!
Comment
Latest Articles
Collapse
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-17-2024, 10:28 AM
|
0 responses
27 views
0 likes
|
Last Post
by seqadmin
12-17-2024, 10:28 AM
|
||
Started by seqadmin, 12-13-2024, 08:24 AM
|
0 responses
43 views
0 likes
|
Last Post
by seqadmin
12-13-2024, 08:24 AM
|
||
Started by seqadmin, 12-12-2024, 07:41 AM
|
0 responses
29 views
0 likes
|
Last Post
by seqadmin
12-12-2024, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
42 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
Comment