SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting FPKM from Cufflinks to raw counts for DESeq jebe Bioinformatics 34 02-05-2014 08:19 AM
How to rescue multi-reads when using htseq to generate edgeR/DESeq counts? Hilary April Smith Bioinformatics 3 05-06-2013 11:07 AM
cufflinks counts vs. rsem counts papori RNA Sequencing 0 07-22-2012 02:35 AM
RSEM expected counts question tboothby Bioinformatics 2 01-26-2012 04:45 AM
DESeq: Read counts vs. BP counts burkard Bioinformatics 0 08-05-2010 11:52 PM

Reply
 
Thread Tools
Old 03-19-2013, 03:25 AM   #1
dnet
Junior Member
 
Location: Israel

Join Date: Mar 2013
Posts: 1
Default Getting raw counts needed for Deseq/EdgeR from TCGA RSEM files

Hi,

I wish to run DE analysis using DESeq or EdgeR on RNA-seq data downloaded from TCGA. I would like to use the Level 3 RNA-Seq data, which is already processed using RSEM.

I wonder if I can use the column named "raw counts" in the RSEM un-normalized output as the raw read counts needed for the input for DESeq and EdgeR.

For example, the column marked in bold in the file :

Filename:
unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A1-A0SB-01A-11R-A144-07__expression_rsem_gene.txt


barcode gene_id raw_count scaled_estimate transcript_id
TCGA-A1-A0SB-01A-11R-A144-07 ?|100130426 0 0 uc011lsn.1
TCGA-A1-A0SB-01A-11R-A144-07 ?|100133144 34.05 1.23812E-06 uc010unu.1,uc010uoa.1


Thanks !
D. N.
dnet is offline   Reply With Quote
Old 12-04-2013, 09:19 AM   #2
mmuurr
Junior Member
 
Location: Earth

Join Date: Oct 2013
Posts: 2
Default

i don't have an answer, but essentially am curious about the same point.
i believe the TCGA Level 3 RNASeqv2 "unnormalized" data represents the 'raw' RSEM counts, and thus piping this input into edgeR would be fine... but perhaps i'm mistaken.
one spot where i've seen conflicting opinions is on how edgeR handles non-integer based counts, which will be the case with the RSEM output.

in a few test cases i've run, i haven't encountered any glaring errors, though i found a HUGE number of differentially expressed genes when comparing prostate cancer samples (both unmatched cases using exact tests and using a subset of matched cases using a GLM approach to handle the paired samples).
at an FDR threshold of 0.05 (using B-H correction), nearly half the genome qualified as differentially expressed, which -- at first glance -- seemed high to me.
mmuurr is offline   Reply With Quote
Old 12-05-2013, 12:18 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Are you sure that you used counts per gene, and not counts per transcript, as input? RSEM outputs the latter by default, but these are unsuitable (even in principle) for downstream analysis for differential expression testing. (If you don't know why, see my earlier posts on the subject.)
Simon Anders is offline   Reply With Quote
Old 12-05-2013, 04:28 AM   #4
mmuurr
Junior Member
 
Location: Earth

Join Date: Oct 2013
Posts: 2
Default

yes, the counts are represented at the gene level for the publicly available TCGA RNASeqv2 (unnormalized) data.
while i can't find the verbose output for the execution of their RNASeqv2 pipeline, my guess is that the RSEM mappings to transcripts are collapsed to the individual gene level by summing counts.
so, there are ~20,000 genes represented in their "Level 3" files (lower levels, representing increasingly raw data -- e.g. the reads themselves -- are not all publicly available).

as for the large number of differentially expressed genes, additional reading lends me to believe the non-integer counts do need to be rounded prior to edgeR analyses.
(though even this rounding step has been the focus of some debate on the R/Bioconductor-help mailing list.)
mmuurr is offline   Reply With Quote
Old 03-27-2014, 10:17 AM   #5
GenePool
Registered Vendor
 
Location: San Francisco, CA

Join Date: Mar 2014
Posts: 18
Default

Hi All,

For what it's worth, we're committed to making this sort of data more freely available and usable by the community. In that spirt, we've included a freely available reference library of genomics data in our product, GenePool. This library happens to include the RNASeqV2 gene-level counts computed by the UNC pipeline that leveraged RSEM. We've also taken the time to extract and curate the sample-level metadata and make it easily available to researchers to subset the samples, and analyze the data accordingly. For more advanced users, you can easily just export out the counts and sample level metadata and get into more high-powered statistical analyses, that hopefully some day we just roll right back into the GenePool platform :-)

Incidentally, we've also included the isoform-, splice-junction-, and exon-level counts as part of GenePool's premium content.

If you're interested in learning more, please check out GenePool's growing genomics library, check out the following threads:

http://seqanswers.com/forums/showthread.php?t=42471
http://seqanswers.com/forums/showthread.php?t=48485

We'd love to have your feedback on this effort.

------------------------------
GenePool is making genomics data management, analysis, and sharing easier!
Products @ www.stationxinc.com

Last edited by GenePool; 11-23-2014 at 09:03 PM.
GenePool is offline   Reply With Quote
Reply

Tags
rsem tcga rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO