SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count for sam and gff3 sofia17 RNA Sequencing 45 11-04-2016 03:32 PM
htseq-count paolo.kunder Bioinformatics 10 10-22-2014 04:45 AM
Strange error when using htseq-count shhuang Bioinformatics 13 11-19-2012 12:40 AM
htseq-count output Palgrave Bioinformatics 7 03-05-2012 07:04 AM
htseq count missing last mapped fragment mapardo RNA Sequencing 2 11-08-2011 05:27 AM

Reply
 
Thread Tools
Old 03-05-2012, 12:11 PM   #1
Proteos
Junior Member
 
Location: USA

Join Date: Dec 2011
Posts: 6
Default HTSeq-count and BED-formatted coordinates

My task is to estimate the expression of certain unannotated intergenic regions (i.e. not genes) in several tissues.
The answer I need is pretty much Yes/No i.e. whether these regions are expressed or not.

On my previous steps, I used UCSC database and therefore my coordinates are in BED format.
So, I need to:
(1) count the reads mapping to my regions.
(2) calculate expression.

For (1)
I started searching for the package that would accept BED as an input and count the reads.
I found HTSeq. Great package! I really love its simplicity yet powerfulness; also the fact that it is in Python and I can easily incorporate it into my pipeline.
So I modified the script HTSeq-count to accept the BED file as an input instead of GFF and ran it with default parameters:
Code:
$>./tool.htseq_count_by_bed --countout <counts_file> <sam_file> <bed_file>
The script happily read SAM, BED and gave me the standard output of counts
I even created another script which generates the table of format

Code:
ID	tissue1	tissue2	tissue3
id1	1.3	1.2	0.0
id2	0.0	0.0	10
id3	0.5	0.7	0.9

For (2)
I don't know what to use to calculate actual expression (RPKM/FPKM) from this counts table.
I calculated RPKMs manually used maximum of expressions from all tissues, but I don't know how to deal with the 'background RPKM' that must be subtracted.

What would you suggest?

ADDITIONAL QUESTIONS:

* Would you suggest some other way of using HTSeq with BED that is better than mine?
(I tried converting BED to GFF via Galaxy and using Cufflinks without success. Cufflinks is very capricious)

* What would be your preferred way to go further i.e. estimate the actual expressions from HTSeq package?
I would prefer to stay in Python and not to use R, if possible.

Thanks,

Proteos
Proteos is offline   Reply With Quote
Old 03-07-2012, 05:43 AM   #2
Proteos
Junior Member
 
Location: USA

Join Date: Dec 2011
Posts: 6
Default

Is there anyone who can help me?
Proteos is offline   Reply With Quote
Old 03-07-2012, 06:39 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

RNA-Seq read counts usually show quite good proportionality to the concentration of transcripts, as several authors demonstrated with spike-ins. Hence, the counts are a good measure of expression strength, once you normalize for sequencing depth. For more information, see this thread, e.g. my post #13.

It might also be appropriate to divide by transcript length. If you have multiple isoforms, figuring out what length to use is a highly non-trivial (but nevertheless often ignored) problem.

There are bias effects due to CG content and transcript length etc. which you may want to look into, but this is not always that important.
Simon Anders is offline   Reply With Quote
Old 03-08-2012, 05:29 AM   #4
Proteos
Junior Member
 
Location: USA

Join Date: Dec 2011
Posts: 6
Default

Thank you very much Simon!
You saved me!

So, I've read your 'post #13' and used the function estimateSizeFactors() to estimate the factors instead of using the library sizes.
The full logics is as follows:
* get counts
* get newCountDataSet from counts and conditions (in my case simply the samples)
* calculate size factors
* calculate base means

I realized that by dividing count on sizeFactor you get what would be 'baseMeanA' if you did nbinomTest().
(The baseMeanB is simply the second condition's baseMean)
Although I don't know how to get 'baseMean' i.e. the first column in nbinomTest's result and whether it is the one I need.

I guess not, and what I need is dividing counts by sizeFactor i.e. baseMeanA. Is not it correct?


So, with my scarce R knowledge and with a help of Google, I created a little script that I am sharing:

Code:
#!/usr/bin/env Rscript
# Accepts counts file as an input ( fileIn )
# Outputs the file with baseMeans (fileOut)
#
# In the R shell, create variable:
# fileIn = "filename.counts"
# and run this script
# example:
# fileIn="counts_hs.tab"
# source("tool.counts_to_bmeans.r")

fileOut=paste(fileIn,"_bmeans.csv",sep="")

library( DESeq )

# read the counts data from file
countsTable <- read.delim( fileIn, header=TRUE, stringsAsFactors=TRUE )

# Convert column 1 to row names
rownames( countsTable ) <- countsTable$name
countsTable <- countsTable[ , -1 ]

# Get conditions from column names
# 'conds' should determine conditions, but in our case when every sample is separate, it is the same as 'samples'
conds <- colnames(countsTable)	

# Calculate factors
cds <- newCountDataSet( countsTable, conds )
cds <- estimateSizeFactors( cds )	# Factors to normalize from count data

# Calculate baseMeans
nfeatures <- nrow(assayData(cds)$counts)
nfactors  <- NROW(pData(cds)$sizeFactor)
mfactors <- matrix(pData(cds)$sizeFactor,nfeatures,nfactors, byrow=TRUE)
bmeans <- assayData(cds)$counts / mfactors

# Output to tab file
#write.table(bmeans, file="cn.tab", sep="\t")


# Output to CSV file
write.csv(bmeans, file=fileOut)
I've checked the script. It is working correctly.
(compared the results to baseMeanA of nbinomTest() results and they are the same.)

Now I will take these basemeans data and simply divide by the summary length of exons of the corresponding gene.
( because my case is simple: I don't need the particular isoform expression; I just need pretty much Yes/No value.)

Right now, I will simply import basemeans to Excel and do the length normalization there. Later on, maybe I will do it in R or better in python and will share that too

Please let me know if I am doing something wrong.
I hope not, because I have to present these data next Wednesday before my group and I don't have much time

Thanks again!

P.S. I think it would be a good idea to create Python script for all this.
If you are interested and think it is worth, at some point when it is more mature, one might even add it to your HTSeq package; I will be more than glad to share what I have
Proteos is offline   Reply With Quote
Old 03-08-2012, 05:39 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

The 'baseMeans' itself is just the mean of the normalized counts over all samples, ignoring the condition.

If you want to get rid of your script's dependency on R, just reimplement the scheme I explained in the post in Python. With numpy, this should be just three or four lines.
Simon Anders is offline   Reply With Quote
Old 03-08-2012, 05:47 AM   #6
Proteos
Junior Member
 
Location: USA

Join Date: Dec 2011
Posts: 6
Default

And what would you say about the algorithm and logics?
Is everything correct or I am missing some point?


Right now, because the task is simple, it is not worth porting to Python and R is fine.
(I have very limited time in these weeks)

But at some point, when I am more 'hardcore HTS guy', I will probably try to do that
Proteos is offline   Reply With Quote
Old 03-08-2012, 05:52 AM   #7
Proteos
Junior Member
 
Location: USA

Join Date: Dec 2011
Posts: 6
Default

Oops, forgot to ask:
If I have replicates, should I do something in addition to this? (variance, dispersion etc)
At least for my simple task?
Proteos is offline   Reply With Quote
Old 10-30-2012, 04:30 PM   #8
ugolino
Member
 
Location: maryland, usa

Join Date: Oct 2011
Posts: 14
Default

Hi Proteos,

I am also interested in read coverage of intergenic regions and have only bed formatted coordinates for those. Could you possibly share or give instruction on how to modify htseq-count to accept a bed format with say 4 columns (chrom, start, end, gene_name)? Bed formatted coordinates are 1-based start and 0-based end, but it is opposite in htseq, correct? I am decent in perl, but clueless in python.

thanks much!
ugolino is offline   Reply With Quote
Old 10-30-2012, 05:09 PM   #9
ugolino
Member
 
Location: maryland, usa

Join Date: Oct 2011
Posts: 14
Default

correction, bed format is 0-based start and 1-based end, so it's the same as htseq.
ugolino is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO