My understanding of the RPKM calculation in Tophat is that it includes multi-reads that match < 40 times in the genome (by default).
It seems like cufflinks does something more complex involving some kind of allocation of multiple reads.
Cufflinks models the sequencing process by asking what the probability is of observing each read, given a set of transcripts and a set of abundances. The program then multiplies these probabilities to compute the overall likelihood that one would observe the reads in the experiment, given the proposed abundances on the transcripts. Because Cufflinks' statistical model is linear, the likelihood function has a unique maximum value, and Cufflinks finds it with a numerical optimization algorithm.
I know we're supposed to start using cufflinks' RPKM now, but I'd like to understand tophat's as well. Does anyone know if my description is correct?
It seems like cufflinks does something more complex involving some kind of allocation of multiple reads.
Cufflinks models the sequencing process by asking what the probability is of observing each read, given a set of transcripts and a set of abundances. The program then multiplies these probabilities to compute the overall likelihood that one would observe the reads in the experiment, given the proposed abundances on the transcripts. Because Cufflinks' statistical model is linear, the likelihood function has a unique maximum value, and Cufflinks finds it with a numerical optimization algorithm.
I know we're supposed to start using cufflinks' RPKM now, but I'd like to understand tophat's as well. Does anyone know if my description is correct?
Comment