SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq-newCountDataSet lynn012 RNA Sequencing 14 03-21-2015 09:00 AM
comparing results by cuffdiff, edgeR, DESeq PFS Bioinformatics 5 03-12-2014 03:01 AM
DESeq problem NicoBxl Bioinformatics 21 08-11-2012 09:57 AM
DESeq 1.5.30 - estimateDispersions horizon Bioinformatics 5 03-29-2012 11:08 AM
Inconsistency between cuffdiff 1.1.0 and cuffdiff 1.0.2 tleonardi Bioinformatics 3 11-16-2011 08:23 AM

Reply
 
Thread Tools
Old 02-14-2012, 10:35 AM   #1
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 310
Default Cuffdiff or DESeq

Without starting some kind of pointless back and forth, I am wondering what are the advantages of using Cuffdiff or DESeq to identify differentially expressed genes from RNA-seq data. The statistical side is a little over my head at the moment but is also what spurs my interest in this question. I find Simon Anders posts very informative so I am inclined to use DESeq but everything R and Bioconductor is a little biologist unfriendly. The Cuffdiff workflow is easier to follow but am I sacrificing anything by using Cuffdiff?
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 02-14-2012, 10:43 AM   #2
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

The vignette for DESeq is exceptionally well written and makes it easy for a biologist with no R experience to follow it. I say this having at one time being that biologist with NO understanding of R.
chadn737 is offline   Reply With Quote
Old 02-14-2012, 11:47 AM   #3
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 111
Default

There's a discussion of this going on over at the bioconductor mailing list. I'm quoting some of the recent posts from the last two days:

Quote:
Dear Bioconductor list,

Sometime ago Simon Anders explained the difference
between DeSeq and Cuffdiff as follows:

"If you have two samples, cuffdiff tests, for each transcript, whether
there is evidence that the concentration of this transcript is not the
same in the two samples.

If you have two different experimental conditions, with replicates for
each condition, DESeq tests, whether, for a given gene, the change in
expression strength between the two conditions is large as compared to
the variation within each replicate group."

Current language on the Cuffdiff site suggests that the current version
of that program tests for whether the change is significant compared to
changes in each condition.

http://cufflinks.cbcb.umd.edu/howitworks.html#hdif

http://cufflinks.cbcb.umd.edu/howitworks.html#reps

Can someone please comment on the relative merits of Cuffdiff and
DeSeq. I ask here because our sequencing core delivers results
based on Cuffdiff and I want to know if I should redo it using
DeSeq,I would greatly appreciate any guidance in this matter.

Thanks and best wishes,
Rich
Reply 1

Quote:
Not directly relevant to gene-level RNA-seq DE calls, but rather for
exon-level DE,
I found it useful to read this:
http://precedings.nature.com/documents/6837/version/1
In particular, section 4.3 on page 11, and supplementary figures S7 and S8
on page 19.

I was informed by a coworker that since everyone uses
BowTie-TopHat-Cufflinks-Cuffdiff, it is the sensible thing to do.
Conversations with people who know what they are doing (Terry Speed &
BCGSC) suggest the matter is not yet settled.
So I retrieved ~1TB of BAMs, extracted the reads, and started looking into
how that compares to DEXSeq and/or subread.

It would be incredibly informative if the Cufflinks and DEXSeq authors had
time to weigh in on their strengths/weaknesses. DEXSeq & cummeRbund both
offer nice tools for exploring the results; I am curious which pipeline
fits best for my needs.

Thanks for bringing this up.
Reply 2 (from me)

Quote:
I'm also in the same boat as Rich. I run a new bioinformatics core
here and I'm building a pipeline for RNA-seq. Cufflinks for some time
has supported biological replicates, and I'm also curious about the
relative merits of using
bowtie/tophat-cufflinks-cuffmerge-cuffdiff-cummeRbund versus using
tophat-HTSeq?-customScriptForCreatingMatrix?-DESeq. Cufflinks also
gives me a host of other tests (differential splicing load,
differential TSS usage, differential coding output, etc), which also
seem useful for certain applications.

On a related note, does anyone have a workflow for taking multiple bam
files, running HTSeq-count (or another program), plus some other
program or custom script, to produce a matrix of counts as input to
DESeq?

Stephen
Reply 4:

Quote:
Dear Stephen,

To your related note, you could have a look at the easyRNASeq package (bioC 2.10) for R (2.15). It reads in your annotation, your bam files and generate a count table for DESeq, all in R. It can actually do the first step of DESeq (estimating size library and dispersion) and give you back a normalized countDataSet object plus some validation plots as described in the DESeq vignettes (the same is true for edgeR). I'm about to push some changes in SVN to correct an issue that prevented the package vignette to be build. The package should be available in a couple of days as binary or you could install it directly from SVN. See http://wiki.fhcrc.org/bioc/SvnHowTo. The package URL is: https://hedgehog.fhcrc.org/bioconduc...cks/easyRNASeq.

Getting the proper set of annotation is definitely the most important step in the whole process and the one that requires most attention, the rest is then pretty straightforward. What I mean by annotation is the description of your feature of interest, be it gene, transcript, exon, enhancers, etc... as genomic loci (chr, start, width, etc...). The main issue in defining the annotation is to avoid counting reads multiple times and the kind of annotation needed does of course depends on your project. If you are interested in looking at isoforms differential expression, you probably want to define synthetic exons (to avoid double counting) and process the obtained count table with DEXseq. If you're looking at gene expression, you would want to create gene models to avoid multiple counting and use these to create your count table. If you are interested in eRNAs, you can define enhancer loci as the count "feature". All this can be done relatively easily in R. Once you have the proper annotation that suits your need, running easyRNASeq is very straightforward. easyRNASeq accepts both RangedData and GRangesList as annotation input, amo ng other formats. easyRNASeq is in addition able to fetch annotations for you from different sources, but most of the time these would need to be post-processed. You can look at my post: "u! sing easyRNASeq examples" from 2 days ago for some examples and comments in addition to the vignette content.

Cheers,

Nico
And finally, the post that Nico mentioned about easyRNAseq

Quote:
I've fixed a bug in easyRNASeq and made a couple of modifications to deal with novelties introduced in the IRanges and in the DESeq packages. I've just committed that whole bunch, so it will take ~2 days before a binary package is available. You can always install from SVN in the meanwhile.

I've listed below the different examples you sent me in our email exchange last week and I've added some comments which I hope will prove useful to you. I've created a new "thread" to make it easier to read, and I'm confident it will prove useful to others. This leads me to thanking you for your patience, for providing the data and for giving me the opportunity to post such an example

I hope this helps, do not hesitate to contact me if you have further questions and to suggest improvements.

Cheers,

Nico
Code:
## load the library
library(easyRNASeq)

## load the chromosome sizes
## Note that you can just provide a named list; using the BSgenome is not necessary (especially as the
Hsapiens ones are huge >800Mb)
library(BSgenome.Hsapiens.UCSC.hg19)
chr.sizes=as.list(seqlengths(Hsapiens))

## set the wd directory to wherever you have your bam files
## that's just for this example, you could give that directory
## directly to the filesDirectory argument of the function
setwd("Desktop/Francesco")

## get the bam filenames
bamfiles=dir(getwd(),pattern="*\\.bam$")

## run easyRNASeq to get an RNAseq object.
## we use biomaRt to fetch the annotation directly from ensembl
## 
rnaSeq <- easyRNASeq(filesDirectory=getwd(),
                     organism="Hsapiens",
                     chr.sizes=chr.sizes,
                     readLength=100L,
                     annotationMethod="biomaRt",
                     format="bam",
                     count="exons",
                     filenames=bamfiles[1],
                     outputFormat="RNAseq"
                     )

## Fetching the annotation that way is time consuming
## Using the RNAseq object you can extract the genic information
## and save them as an rda for a faster processing time
## later on
gAnnot <- genomicAnnotation(rnaSeq)

## There are 244 "chromosomes" in that annotation, let's keep only what we need
gAnnot <- gAnnot[space(gAnnot) %in% paste("chr",c(1:22,"X","Y","M"),sep=""),]
save(gAnnot,file="gAnnot.rda")

## you could use it this way
countTable <- easyRNASeq(filesDirectory=getwd(),
                         organism="Hsapiens",
                         chr.sizes=chr.sizes,
                         readLength=100L,
                         annotationMethod="rda",
                         annotationFile="gAnnot.rda",
                         format="bam",
                         count="exons",
                         filenames=bamfiles[1]
                         )

## using different annotation (makeTranscriptDB)
library(GenomicFeatures)
hg19.tx <- makeTranscriptDbFromUCSC(
                                    genome="hg19",
                                    tablename="refGene")

## easyRNASeq can deal with GRangesList object, so no need to modify it much, i.e. no need to convert it to a RangedData
gAnnot <- exons(hg19.tx)
## change the metadata column name to suit easyRNASeq
colnames(elementMetadata(gAnnot)) <- "exon"
## finally turn it into a GrangesList
gAnnot <- split(gAnnot,seqnames(gAnnot))

## you could save it as an rda object, you can as well use it directly
## by using the annotationMethod "env" and the annotationObject arguments.
## In addition we will be selecting for a single chromosome: chr1
## as the corresponding name in your bam file is 1, that's what we'll use
countTable <- easyRNASeq(filesDirectory=getwd(),
                         organism="Hsapiens",
                         chr.sizes=chr.sizes,
                         readLength=100L,
                         annotationMethod="env",
                         annotationObject=gAnnot,
                         format="bam",
                         count="exons",
                         filenames=bamfiles[1],
                         chr.sel="1"
                         )

## applying DESeq
## this will not yield very sensitive results as we have no replicates (biological)
## These are important for DESeq to accurately model the technical and biological variance
## With no replicates for every condition, the dispersion will be based on a "pooled" estimate
## making the differential expression call lose sensitivity. In addition, DESeq is in such case
## using a conservative approach (which is good) so you'd get even less significant results.

## Defining the conditions
conditions <- c("A","B")
names(conditions) <- bamfiles

## running DESeq
## NOTE that the two last arguments are for the DESeq estimateDispersions method. They are just
transferred through.
## There are necessary as we have only 1 replicate per condition; see the DESeq vignette for more details.
countDataSet <- easyRNASeq(filesDirectory=getwd(),
                           organism="Hsapiens",
                           chr.sizes=chr.sizes,
                           readLength=100L,
                           annotationMethod="env",
                           annotationObject=gAnnot,
                           format="bam",
                           count="exons",
                           filenames=bamfiles,
                           chr.sel="1",
                           outputFormat="DESeq",
                           normalize=TRUE,
                           conditions=conditions,
                           fitType="local",
                           method="blind"
                         )

## finally my session info
> sessionInfo()
R Under development (unstable) (2012-02-07 r58290)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] GenomicFeatures_1.7.16 AnnotationDbi_1.17.15  easyRNASeq_1.1.6      
 [4] ShortRead_1.13.12      latticeExtra_0.6-19    RColorBrewer_1.0-5    
 [7] Rsamtools_1.7.29       DESeq_1.7.6            locfit_1.5-6          
[10] lattice_0.20-0         akima_0.5-7            Biobase_2.15.3        
[13] BSgenome_1.23.2        GenomicRanges_1.7.24   Biostrings_2.23.6     
[16] IRanges_1.13.24        genomeIntervals_1.11.0 intervals_0.13.3      
[19] edgeR_2.5.9            limma_3.11.11          biomaRt_2.11.1        
[22] BiocGenerics_0.1.4    

loaded via a namespace (and not attached):
 [1] annotate_1.33.2    bitops_1.0-4.1     DBI_0.2-5          genefilter_1.37.0 
 [5] geneplotter_1.33.1 grid_2.15.0        hwriter_1.3        RCurl_1.91-1      
 [9] RSQLite_0.11.1     rtracklayer_1.15.7 splines_2.15.0     survival_2.36-12  
[13] tools_2.15.0       XML_3.9-4          xtable_1.6-0       zlibbioc_1.1.1
turnersd is offline   Reply With Quote
Old 02-14-2012, 12:06 PM   #4
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 111
Default

One of the issues I faced with DESeq was going from alignments (bam files) to the counts table that DESeq requires, e.g.:

Code:
      Sample1 Sample2 Sample3 Sample4 Sample5
Gene1    2614    7000    7119    6689    4893
Gene2    1162    2939      72    4988     390
Gene3    5799    8533    9263    8413    5635
Gene4    8129    4023    1524    2270    1816
Gene5    3536    4320    4927    2659    6795
From numerous responses here and elsewhere, I started using HTSeq (specifically, htseq-count) to do this.

htseq-count is dead simple to use. It takes an alignment and a GTF annotation file (e.g. htseq-count [options] <alignment> <gtf>) and produces a list of features and counts of reads that overlap those features (how overlapping reads are counted is specified by the options).

WASH5P 3
WDR18 201
ZNF555 44
ZNF57 82
ZNF77 55
psiTPTE22 0
tAKR 0
no_feature 6321
ambiguous 196
too_low_aQual 0
not_aligned 0
alignment_not_unique 15890

I hacked out a perl script that takes a list of bam files, runs samtools view on them, pipes that output into htseq-count, and drops an output file. It does this for each bam file, dropping a temporary file containing all the features and counts for that particular alignment. I then pipe together a series of join commands to glue together all these temporary count text files that HTSeq produces. This gets me my table that I can use with DESeq.

I've heard far too many examples of people using tophat-cufflinks-cuffmerge-cuffdiff for their RNA-seq workflow, and getting extremely different results from one version of cufflinks to the next. But when it takes some perl hacking around a python script to munge alignments into a format that R can read, it's currently easier IMHO to use the cufflinks solution. In addition, cufflinks gives you other useful information about splicing load, differential TSS usage, differential coding output, etc.

I would like to see this discussion continue, and if anyone has any better solution for going from bam alignments to a DESeq-ready counts table, I'm all ears.
turnersd is offline   Reply With Quote
Old 02-14-2012, 03:21 PM   #5
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 815
Default

Quote:
Originally Posted by turnersd View Post
I hacked out a perl script that takes a list of bam files, runs samtools view on them, pipes that output into htseq-count, and drops an output file. It does this for each bam file, dropping a temporary file containing all the features and counts for that particular alignment.
...
I would like to see this discussion continue, and if anyone has any better solution for going from bam alignments to a DESeq-ready counts table, I'm all ears.
There's really no need to use Perl for that if you're also using HTSeq. HTSeq has been designed as a somewhat easy-to-use Python library -- I say 'somewhat' because in some very small points, it didn't quite do what I wanted.

I've got a python script lying around that takes in multiple BAM files (or SAM files, if that's your liking) and spits out counts for each feature in a three-column format (feature, experiment, count) using HTSeq. I'm mapping to a transcriptome for the reads so haven't bothered with GTF files yet, but that's something that can be fairly easily plugged in using the HTSeq library. It would probably also be fairly easy to modify HTSeq-count to support no GFF files and multiple experiments.

Once you have the three column format, it's a somewhat easy process to convert into a count matrix using the xtabs function in R.

... but I'm sitting at the wrong computer at the moment. I'll attach / embed the useful scripts tomorrow after I get in to work.

Taking note of what else has been mentioned, easyRNASeq (which I hadn't heard about before) sounds like an even better tool for this given that it stays in R. As long as the BAM file reading/processing is quick enough (e.g. considering 5-50GB BAM files for each experiment), there's no reason not to do everything in R.
gringer is offline   Reply With Quote
Old 02-15-2012, 01:03 AM   #6
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 310
Default

Quote:
Originally Posted by chadn737 View Post
The vignette for DESeq is exceptionally well written and makes it easy for a biologist with no R experience to follow it. I say this having at one time being that biologist with NO understanding of R.
Agree. Really, it is one of the few Bioconductor vignettes that make any sense.


Quote:
Originally Posted by gringer View Post
Taking note of what else has been mentioned, easyRNASeq (which I hadn't heard about before) sounds like an even better tool for this given that it stays in R. As long as the BAM file reading/processing is quick enough (e.g. considering 5-50GB BAM files for each experiment), there's no reason not to do everything in R.
Actually, I always look for any reason not to use R. R workflows are time consuming in both trying to put together a workflow and using a workflow once you have put it together.

Anyway, enough of my ranting against Bioconductor. I'm still interested in opinions about what is happening under the hood of these methods.

Quote:
Originally Posted by turnersd View Post
I've heard far too many examples of people using tophat-cufflinks-cuffmerge-cuffdiff for their RNA-seq workflow, and getting extremely different results from one version of cufflinks to the next.
Sometimes people look at data expression data sets that are nearly identical and end up concluding they are very different. For example, you are looking for differentially expressed genes in two data sets. You get a list of differentially expressed genes from software A. Software A gives you a value you use for cutoff say FDR. You make the cutoff and then run the same data set through software B and make the exact same arbitrary cutoff. You look at the list of differentially expressed genes in the two lists and they are quite different because of small variations put some genes above the arbitrary cutoff line and some below. In the conversion from a measurable variable to a nominal variable you lost a lot of information and it makes the data sets look very different when if you look before you made the arbitrary cutoff the data sets looked nearly identical. I'm not saying that's what people have done but it's at the very least possible.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 02-15-2012, 05:53 AM   #7
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

I have HTseq-count in my pipeline. At the end I just import all the individual count files into R and create the table there.

Or for those who are very unfamiliar with perl, R, or python. Just open the count files that are output by HTseq-count in excel. Then copy and paste to make your table and save it as a csv file. I realize this is labor intensive, but I would expect basically every scientist to at least know excel.
chadn737 is offline   Reply With Quote
Old 02-15-2012, 06:09 AM   #8
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by ETHANol
Anyway, enough of my ranting against Bioconductor. I'm still interested in opinions about what is happening under the hood of these methods.
Cuffdiff actually uses the same underlying model as DESeq:

For single isoform genes, Cuffdiff models the variance in fragment counts across replicates using the negative binomial distribution, similar to the method described by Anders and Huber (2010)

http://cufflinks.cbcb.umd.edu/howitworks.html

However since Cufflinks attempts to test differential expression for isoforms, their model gets a lot more complex and they attempt adjust expression values with reads that map to multiple isoforms/genes.

My understanding of statistics is far to rudimentary to actually speak on which performs better. For my own purposes, looking for expression differences between isoforms does not make my results anymore meaningful and also makes the types of analysis I am interested in far more difficult.

Also, in every paper I have seen that compares various methods of testing differential expression, DESeq and EdgeR give very similar results, and DESeq gives fairly similar results to BaySeq, whereas Cufflinks can differ considerably from all of these.

Cufflinks can also take a considerable time to run and even with extra steps, I can finish up using DESeq well before Cufflinks ever completes.

If its not obvious by now, I prefer DESeq and for that matter EdgeR, over Cufflinks. It fits my needs better.
chadn737 is offline   Reply With Quote
Old 02-15-2012, 11:49 AM   #9
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Hi all,
Quote:
Originally Posted by chadn737 View Post
For my own purposes, looking for expression differences between isoforms does not make my results anymore meaningful and also makes the types of analysis I am interested in far more difficult.
There are certainly cases when evaluating differences between isoforms is not only the preferred way to conduct the analysis, but is in fact the only way that you would identify a significant change in the expression of a gene. Additionally, biologically significant differential expression of a isoform can be missed when the smallest unit analyzed is a summary model of a 'gene'.

The increased difficulty in interpretation of cuffdiff output is something that I have tried to address with the 'cummeRbund' R/Bioconductor package. The hope is that making the cuffdiff output manageable in a familiar environment, while at the same time maintaining the relationships between genes, isoforms, promoters, etc, will begin to make the interpretation of cuffdiff output more engaging.
lgoff is offline   Reply With Quote
Old 02-16-2012, 01:38 AM   #10
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 815
Default

Quote:
Originally Posted by gringer View Post
... but I'm sitting at the wrong computer at the moment. I'll attach / embed the useful scripts tomorrow after I get in to work.
I've attached my python script for doing basic contig counting. I use it like this (assuming all experiment IDs start with 'Sample_'):
Code:
$ ~/scripts/sam2rawCounts.py $(for x in $(ls -d Sample_*); do \
echo ${x}=$(ls ${x}/*contigs*.bam); done) > contig_reads.txt
In the absence of a consistent directory prefix for sample IDs, I create an 'experiment_dirs.txt' file that identifies the directories that contain the required BAM files:
Code:
$ ~/scripts/sam2rawCounts.py $(for x in $(cat experiment_dirs.txt); do \
echo ${x}=$(ls ${x}/*contigs*.bam); done) > contig_reads.txt
I then convert the output to a DESeq-compatible table in R using xtabs:
Code:
# get counts from transcriptome mapping
reads.3col.df <- read.table("contig_reads.txt", header = TRUE, stringsAsFactors = FALSE);

## convert to raw count table formatted for DESeq
experiment.counts <- xtabs(Reads ~ Contig + Experiment, data = reads.3col.df);

## assume no biological replicates
experiment.conditions <- sub("^Sample_","",colnames(experiment.counts));

## convert to DESeq CountDataSet object
experiment.cds <- newCountDataSet(unclass(experiment.counts), experiment.conditions);
Attached Files
File Type: txt sam2rawCounts.py.txt (4.6 KB, 179 views)
gringer is offline   Reply With Quote
Old 02-17-2012, 09:29 AM   #11
gaberudy
Junior Member
 
Location: Bozeman, MT

Join Date: Sep 2010
Posts: 5
Default

In regards to the question of how to get whole-number counts appropriate for using DESeq, I would recommend using RSEM for quantification: http://deweylab.biostat.wisc.edu/rsem/

RSEM only does quantification of known-transcripts but does so in a very stable manner (comapred to cufflinks) that gives gene-level counts that are the sum of the transcript-level counts for that gene (which cufflinks does not do). It also is more stable from version to version and has less issues in "failing" to produce a count for certain isoforms/genes.

RSEM can take BAM files as input, but expects them to be aligned to the transcriptome (i.e. won't take the output of TopHat that includes both genomic and txome alignments) so you may end up re-doing alignment to use it. But RSEM will happily call bowtie for you if you give it fastq as input.

It will do probablistic assignment of multi-reads giving you counts that can be fractions of whole numbers, but simply rounding these numbers before inputting them to DEseq should give you better results than simply doing a raw counting of reads in genes/exons.
gaberudy is offline   Reply With Quote
Old 02-22-2012, 04:51 AM   #12
sudders
Member
 
Location: Sheffield, UK

Join Date: Dec 2011
Posts: 32
Default

DESeq/edgeR also allow you to use generalized linear modelling to use more complicated statistical models. For example to do paired same analysis (e.g. healthy tissue vs cancer tissue from 10 patients). In this sort of situation I find that cuffdiff finds no genes/isoforms are differentially expressed because the patient-to-patient variance is too high.

On the other hand, I'm still to find a solution to the read counting that truly satisfies me. DESeq and edgeR get away without normalising for length because the two entities being compared are the same length. However, if you sum over all exons in a gene then this isn't necessarily the case: different exons might be being used in different samples. You can test exon-by-exon, but then have to work out how you go from differentially expressed exons to differentially expressed genes. I admit I haven't looked at RSEM, i'll go and do that.
sudders is offline   Reply With Quote
Old 03-09-2012, 08:09 AM   #13
epi
Member
 
Location: USA

Join Date: Jan 2012
Posts: 38
Default

Using cuffdiff means you are coming from tophat-cufflinks-cuffmerge workflow. I have found that this workflow is bugged and mixes up gene annotations. It can not be said with 100% certainty if the reported FPMKs indeed belong to the corresponding gene/transcript.

Please view my thread if interested:
http://seqanswers.com/forums/showthread.php?t=18070

I am now using DESeq for my analysis.

Last edited by epi; 03-09-2012 at 08:23 AM.
epi is offline   Reply With Quote
Old 03-12-2012, 05:18 AM   #14
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by epi View Post
Using cuffdiff means you are coming from tophat-cufflinks-cuffmerge workflow. I have found that this workflow is bugged and mixes up gene annotations. It can not be said with 100% certainty if the reported FPMKs indeed belong to the corresponding gene/transcript.

Please view my thread if interested:
http://seqanswers.com/forums/showthread.php?t=18070

I am now using DESeq for my analysis.
The thread you linked to doesn't show a bug, just an incorrect use of Cufflinks, as I just noted in that thread.

Regarding use of Cufflinks or RSEM with count-based tools like DEseq or edgeR - the authors of DEseq have noted on a number of occasions that it is NOT a good idea to use RSEM or Cufflinks with their tool. The reason is that those tools produce (either explicitly or internally during estimation of gene and transcript FPKM) an *estimate* of the number of fragments that came from each transcript. However, that estimate comes with *uncertainty*, and those tools don't account for it when testing for DE. Think of it this way - when you have multiple isoforms, you have to decide to assign your fragments among them. In addition to the biological variability that's present between replicates, there's now also variability in this assignment process. Those tools don't account for it, so using them with the estimated counts from transcript-level tools will lead to false positives.
Cole Trapnell is offline   Reply With Quote
Old 03-12-2012, 05:44 AM   #15
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by sudders View Post

On the other hand, I'm still to find a solution to the read counting that truly satisfies me. DESeq and edgeR get away without normalising for length because the two entities being compared are the same length. However, if you sum over all exons in a gene then this isn't necessarily the case: different exons might be being used in different samples. .
DEseq and edgeR (and other tools that test raw gene-level counts) assume that the features are the same length. However, if you have any differential splicing or isoform switching going on between conditions, this assumption gets you into trouble. Suppose for example, you have an gene with two isoforms, one twice as long as the other. In both conditions being compared, you have 100 reads on the gene. However, in condition A, all the reads are on the shorter one. In B, they're all on the longer one. Comparing the raw gene-level count will show no change, even though the expression of the gene is two-fold higher in condition A! This is a contrived example (the reads wouldn't *all* move like that in a real situation), but it illustrates the point.
Cole Trapnell is offline   Reply With Quote
Old 04-19-2012, 03:14 AM   #16
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 310
Default

This thread was pretty informative to me, so thanks everyone for your comments.

I have been using DESeq. I wrote a little script to wrap up the process, so I thought I would share it. Maybe someone will find it useful.

Anyway, it's here:
http://ethanomics.wordpress.com/2012/04/19/ezdeseq/
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 04-19-2012, 03:48 AM   #17
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 111
Default

Thanks Ethan. Looking forward to trying that one out, or possibly modifying it to work with DEXSeq.
turnersd is offline   Reply With Quote
Old 04-19-2012, 05:00 AM   #18
sudders
Member
 
Location: Sheffield, UK

Join Date: Dec 2011
Posts: 32
Default

Quote:
Originally Posted by Cole Trapnell View Post
DEseq and edgeR (and other tools that test raw gene-level counts) assume that the features are the same length. However, if you have any differential splicing or isoform switching going on between conditions, this assumption gets you into trouble. Suppose for example, you have an gene with two isoforms, one twice as long as the other. In both conditions being compared, you have 100 reads on the gene. However, in condition A, all the reads are on the shorter one. In B, they're all on the longer one. Comparing the raw gene-level count will show no change, even though the expression of the gene is two-fold higher in condition A! This is a contrived example (the reads wouldn't *all* move like that in a real situation), but it illustrates the point.
I noted this in my original post. Unfortunately, several of the experimental designs I use require more complex designs than just testing group A against group B. Most often this is because of a paired design, but there are several experiments that involve testing interactions between factors etc. which just can't be done in cufflinks. What we can do with the tag based system is test differential expression of exons, and then find some way of selecting an exon to represent a gene, with the understanding that this won't be valid if there is a differential splicing event that changes the usage of that exon.
sudders is offline   Reply With Quote
Old 04-19-2012, 10:07 AM   #19
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Well, Cole is certainly write with both that this is an issue and that it is a contrived example. In practice, it will be rare that a change in isoform usage and a change in overall expression cancel out that well, and hence I would not worry about it. It is true, however, that DESeq and edgeR cannot distinguish changes in number of transcripts from changes in the average transcript length, and hence, if the tools report significant evidence for differential expression, this should be read as "significant evidence for differential expression and/or differential isoform usage". In many cases, you will be able to live with that -- either because you intend to have a closer look at the top hits anyway and would then discover this by manual inspection, or because you just want an overview and don't mind if a few splicing changes are among your expression changes.

And if you are interested in splicing changes, have a look at DEXSeq, too.
Simon Anders is offline   Reply With Quote
Old 04-22-2012, 02:45 PM   #20
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

I'd like to point out that Cole's example is not only showing that DESeq and edgeR's assumption of features being the same length is problematic because it might lead to a canceling out that leads to non-DE calls when they should be made. In fact, the reverse can happen as well. In the example, suppose that in condition B you actually have 200 reads (on the longer isoform), instead of 100 on the shorter. In this case, the 2 fold change in read count is certain to look like differential expression, but in reality, there has been no change in expression at all. So its not only that you might get a few splicing changes among your expression changes with DESeq or edgeR. You will also miss significant expression changes.

The numbers of counts in Cole's example are made up (to make the point), but the differences in transcript length are not. Its common for transcripts in a gene to have variable lengths, even much more than the factor of 2 in the example. In DESeq/edgeR every time that happens in the structure of a gene, the computed fold change in expression, which is the basis for calling differential expression, is incorrect.

In summary, I agree with Simon that DESeq and edgeR are useful tools for analysis of differential expression in single isoform genes where all reads map uniquely.
lpachter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO