Seqanswers Leaderboard Ad

**jordi** · 07-29-2010, 12:45 AM

thank you very much for your answer westerman. You're right, I have no idea about the transcriptome size. It's a tissue specific from a non-model specie. I got 62M from a 454 pyrosequencing process from 8 different organisms (same plate) and I suspect that there were too much samples for a single plate. I would like to guess the ideal ratio sample/plate so I thought that coverage could be a good way to estimate this.

**rdeborja** · 07-29-2010, 02:37 AM

We've started using BED Tools as part of our pipeline and so far it's been good. We calculate 2 types of coverage, non-collapsed (total mapped reads x read length / genome size) and collapsed coverage (total mapped reads with PCR duplicates removed x read length / genome size).

The tough part is identifying those reads aligning to multiple locations. What do people do in that case? Are people using uniquely aligned reads only and using a mappable genome size (different read lengths will change this number). We use Picard to collapse our reads. If you use BEDTools, you'll need to remove the duplicates since BEDTools doesn't seem to do a flag check for the read.

**fhb** · 12-01-2010, 08:44 PM

Originally posted by quinlana View Post

Hi,
As ewilbanks suggested, BEDTools will do this for you.

Attention Required! | Cloudflare

http://people.virginia.edu/~arq5x/bedtools.html

If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
chr1 0 10000
chr1 9999 20000
...

Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
1) The number of reads in reads.bed that overlap the window
2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)

coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage

Sample output:
<chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
chr1 0 10000 0 0 10000 0
chr1 9999 20000 33 1000 10000 0.10

I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.

Best,
Aaron

Hi Dr. Quinlan,

Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando

**[email protected]** · 11-17-2011, 06:52 PM

Hi Jonathan,

About R script above, I have tried and go these error. Any suggestion is great as I'm R beginner.

> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])
Error: unexpected symbol in:
"rangefrom<-0
rangeto<-length(data[,"sca_position"
> data.1kb<-runmed(data[,"coverage"],k=window)
Error in as.integer(k) :
cannot coerce type 'closure' to vector of type 'integer'
> png(file="cov_1k.png",width=400,height=1000)
> plot(x=data[rangefrom:rangeto,"sca_position"],y=data.1kb[rangefrom:rangeto],pch=".",cex1,xlab="bp position",ylab="depth",type="p")
Error in `[.data.frame`(data, rangefrom:rangeto, "sca_position") :
object 'rangefrom' not found
> dev.off()

**frymor** · 11-20-2011, 08:29 AM

Originally posted by [email protected] View Post

Hi Jonathan,
> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])

You have two different types of quoting in coverage
It should be like that:

Code:

depth<-mean(data[,[B][COLOR="Red"]"coverage"[/COLOR][/B]])

and than you will also define the windows size and the ranges

**[email protected]** · 11-22-2011, 02:21 AM

Hi frymor,

Thank you for the suggestion.
I have fix the script and now I got another message from coverPlot.Rout.

> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage"])
Warning message:
In mean.default(data[, "coverage"]) :
argument is not numeric or logical: returning NA
> #depth now has the mean (overall)coverage
> #set the bin-size
> window<-1001
> rangefrom<-0
> rangeto<-length(data[,"sca_position"])
> data.1kb<-runmed(data[,"coverage"],k=window)
> png(file="cov_1k.png",width=400,height=1000)
coverPlot.Rout

Any suggestion would be great.

Best regards,
Sutada

**gringer** · 11-22-2011, 10:48 AM

Based on the options you've given, "~/q20snpref/illusmp454merCbed" [note no extension] should be a headerless file with fields separated by tabs.

The warning message suggests that you either have a header line in your file (or possibly a comment line), or a non-numeric line in the 'coverage' column.

Just to make sure the input looks like what is expected, what is the output of this command [which displays the first 5 lines of a file]?

Code:

readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)

**[email protected]** · 11-23-2011, 02:05 AM

hi gringer,

Thank you very much for suggestion.
I got this with this code.
> readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
[1] "Scaffold\tsca_position\tcoverage" "Scaffold1\t1\t0"
[3] "Scaffold1\t2\t0" "Scaffold1\t3\t0"
[5] "Scaffold1\t4\t0"

I have also tried
data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T) and I got
> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
Killed
^@^@^Killed.

Should I put # for the header line. Actually illusmp454merCbed is a text file contains 3 column separate by tab. I thought that to run R we need the column name to define the data.

Any suggestion will be great.
Best,

**gringer** · 11-23-2011, 02:10 AM

With a file looking like that, just what you've put should work (i.e. changing the header value to 'TRUE'):

data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)

Is there a particular reason why the reading of the table was killed? How big is the input file? How much memory does your computer have? Did you break out of it manually? How long did it take before the 'Killed' message appeared?

**Valerie2011** · 11-24-2011, 04:27 AM

Hi you all,

I have a little question concerning the best way to estimate the RNASeq coverage, or more precisely global transcription rate distribution.

To put it simple let's assume a bacterium from which an RNASeq has been performed. For this analysis we work only with the pairs (positive and negative strands) of .wig files, filtering out rRNAs, tRNAs and other known RNAs that are transcribed at very high levels.

Briefly (ommiting a normalization step between the samples) we then calculate a global trancription rate for each strand by simply adding up all the values in the corresponding .wig file and by dividing this sum by the genome size: we get the global average read per nucleotide. Easy. A standard deviation on this mean is also easily calculated at the same time.

My question is, is the standard deviation the best parameter to describe the distribution around this mean? As the majority of the nucleotides are not read a single time and as some transcripts are very abundant this standard deviation is obviously huge compared to the mean (e.g. mean = 8, stdev = 420). Would then the standard error (= stdev / sqrt (genome size)) be more relevant? Or should the 0 values be skipped from the analysis, counting only nucleotides called and their number of reads?

Sorry for not being very good at stats, I forgot most of what I learnt ages ago. Basically the aim is to test whether one strand is being transcribed at (statistically) the same rate as the other... We would prefer not to use fancy software to do this analysis but to do it ourselves "manually" (with a little help of Perl, of course).

Note: we are interested in the transcripts independently of the coding domain sequences and other annotations.

Thanks for any suggestion!

**gringer** · 11-24-2011, 04:36 AM

You're going to get a huge variation in expression for all transcripts, such that calculating a 'global deviation' doesn't really make sense. If you do want to do it globally, I'd recommend log-transforming the coverage values before working out mean coverage and deviation, because that transformation seems to better fit a normal curve.

On a per-gene (or per-isoform) basis, I've been calculating coverage using CV (i.e. SD / mean) as an indication of variation. With about 20 genes that I did this for, the CV ranged from about 40% to 400%, but anything above about 140% appeared to be a mis-mapped transcript.

It might be useful to only do this for expressed regions (e.g. coverage > 15) to attempt to control for these bad mappings.

**qqtwee** · 11-24-2011, 11:04 PM

cufflinks Error

Hello,all
when I use cufflinks without a gtf file for bacterial RNA-seq, I get the follow result:
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
,that is all in the result file.
I don't know the reason why there are not values.anyone can help me?
thanks. best wishes.

Attached Files

HDCM7G]UUWNN4NFQU4VKGGX.jpg (18.6 KB, 45 views)

**SEQond** · 12-13-2011, 05:49 AM

Originally posted by fhb View Post

Hi Dr. Quinlan,

Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando

If possible I would like that too.

Thanks

**SEQond** · 12-14-2011, 04:03 AM

By utilising BEDtools and UCSC GB I am trying to get a picture like this (histogram of the bedgraph in wiggle form)

so far i have used a SORTED bam on which genomeCoverageBed was run with -bg -ibam options and -g mm8.fa.fai as the index

The resulting bedgraph was uploaded on UCSC GB and produced this "stripe" track which has numbers 1 2 3 and os on , before each stripe ( where do I find the meaning of the numbers?)

Then added a first line of the bedgraph file like this
track type=bedGraph name="set11bamSorted" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20

and uploaded the new file.

This time I am getting an error
"Error File 'set11_fixed.bedgraph' - Error line 1224705 of custom track: chromEnd larger than chrom chr13 size (114176901 > 114142980)"

Why should I get this error this time , while the contents of the bedgraph file are exactly the same as before? Is this the correct way to get the wiggle histogram like the one on the first picture ?

Below you may find the code that I used to get the original bedgraph file

genomeCoverageBed -bg -ibam ~/set11/tophatOut11/set11TopHataccepted_hits.Sorted.bam -g /opt/bowtie/indexes/mm8.fa.fai > ~/set11/set11.bedgraph &

**mrood** · 06-04-2013, 09:45 AM

I realize this is quite an old forum but I am currently trying to calculate the coverage per bp of my NGS data. I am new to NGS and bioinformatics so my apologies if this does not make sense...

For quality control I would like to determine a logical min and max coverage to exclude from my downstream analyses; however, I cannot seem to find a "gold standard" for such purposes. It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism. Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping. As such, I would like to exclude these regions before performing genome-wide population genetic analyses. Does anyone have any suggestions on what cutoffs to use and how to go about doing so?

Thanks in advance!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News