yes, the counts are represented at the gene level for the publicly available TCGA RNASeqv2 (unnormalized) data.
while i can't find the verbose output for the execution of their RNASeqv2 pipeline, my guess is that the RSEM mappings to transcripts are collapsed to the individual gene level by summing counts.
so, there are ~20,000 genes represented in their "Level 3" files (lower levels, representing increasingly raw data -- e.g. the reads themselves -- are not all publicly available).
as for the large number of differentially expressed genes, additional reading lends me to believe the non-integer counts do need to be rounded prior to edgeR analyses.
(though even this rounding step has been the focus of some debate on the R/Bioconductor-help mailing list.)
|