SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Bug? duplicated genes in cufflinks output genes.expr silin284 Bioinformatics 3 05-18-2014 12:19 AM
transcripts missing from cufflinks djm2007 RNA Sequencing 3 05-30-2013 12:39 AM
PubMed: Quantification of Gene Transcripts with Deep Sequencing Analysis of Gene Expr Newsbot! Literature Watch 0 01-13-2011 03:00 AM
cuffdiif or cufflinks expr comparisons middlemale Bioinformatics 0 11-05-2010 03:44 AM
inconsistent gene names in genes.expr - Cufflinks Boel Bioinformatics 2 04-14-2010 06:16 AM

Reply
 
Thread Tools
Old 02-08-2010, 12:48 PM   #1
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default Cufflinks transcripts.expr file

Hello,

I am looking for a way to derive the raw count of reads mapping to individual Cufflinks transcripts. My closest guess is that the "coverage" value (column 13) in the transcripts.expr file is what I need (the manual explains this as being an "Estimate for the absolute depth of read coverage across the whole transcript"). This number is non-integer for most transcripts, but I guess that may come from averaging across isoforms. Does anyone have an insight into this? Any advice would be much appreciated.

Thanks,

Shurjo
shurjo is offline   Reply With Quote
Old 02-08-2010, 01:03 PM   #2
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. Note that Cufflinks' statistical model probabilistically assigns reads to individual isoforms, because when isoforms share exons, reads from those exons can't be unambiguously assigned to one particular isoform.
Cole Trapnell is offline   Reply With Quote
Old 02-08-2010, 01:11 PM   #3
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

OK, cool, thanks.
shurjo is offline   Reply With Quote
Old 02-09-2010, 12:01 PM   #4
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Watch out when considering these read densities if they come from overlapping transcripts though: although it is counter-intuitive, they cannot be directly compared to each other as if they could indicate a relative "expression" or "splicing" value.

Indeed, since a read that overlaps a common exonic part of two splicing variants is used to compute both density values (although it comes from one of the variants only), it would be mathematically incorrect to assume that read densities can be compared between such overlapping splicing variants. If reads are assigned to each variant independently, cases can appear where a variant A that is present at a lower number of copy than a variant B is given a higher density.. so RPKM are only relevant across "genes" (non overlapping transcripts).

Therefore, reads should actually be assigned across transcripts (considering the whole set of variants) in a discriminative way, and that is why proper transcriptome reconstruction requires linear programming algorithms in order to find the optimal distribution.

Unless of course i am totally wrong
steven is offline   Reply With Quote
Old 03-30-2010, 12:02 PM   #5
thinkRNA
Member
 
Location: Carlsbad,CA

Join Date: Jan 2010
Posts: 94
Default

Quote:
Originally Posted by Cole Trapnell View Post
Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. Note that Cufflinks' statistical model probabilistically assigns reads to individual isoforms, because when isoforms share exons, reads from those exons can't be unambiguously assigned to one particular isoform.
So, how do you determine the average depth. For example, consider
exon1| exon2|exon3

Lets say that Sample A exhibits
exon1|exon3 only 20% of the times
exon1|exon2|exon3 only 80% of the times

while Sample B exhibits
exon1|exon3 100% of the times

how will you determine average depth of this gene in sampleA?
I understand sampleB should be easier as you should see no reads spanning exon1| exon2 or exon2|exon3

How does cufflinks handle or calculate relative abundance of isoforms within a sample?
thinkRNA is offline   Reply With Quote
Old 03-30-2010, 04:02 PM   #6
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by thinkRNA View Post
So, how do you determine the average depth. For example, consider
exon1| exon2|exon3

Lets say that Sample A exhibits
exon1|exon3 only 20% of the times
exon1|exon2|exon3 only 80% of the times

while Sample B exhibits
exon1|exon3 100% of the times

how will you determine average depth of this gene in sampleA?
I understand sampleB should be easier as you should see no reads spanning exon1| exon2 or exon2|exon3

How does cufflinks handle or calculate relative abundance of isoforms within a sample?
In the upcoming Cufflinks paper, we have a large section in the supplement devoted to the math of how this is done. This isn't the best medium for explaining how the algorithm and the math work, so keep an eye out for the paper. We hope it will be out within a few weeks, and I'll post a message to our mailing list and link to the paper from the site when it appears.

For now, let me roughly describe how it works to give you some intuition. Suppose you have two isoforms A and B of a gene (we'll assume they are the same length for now). We want to calculate the abundances of A and B by counting up how many fragments came from A and how many came from B, and then dividing each count by the total number of fragments. This gives us the relative abundances for A and B. Now consider a fragment that could have come from either of two isoforms, A or B. We can't just assign this fragment to A or B - we need to "probabilistically assign" to both of them. The probability it came from A in this example is just equal to the relative abundance of A (and the same goes for B). Now we're in a bind, because to get the abundances we need the counts, and to get the (probabilistic) counts, we need the abundances! There's an additional complication that happens because of course transcripts have differing lengths.

Cufflinks calculates the abundances of transcripts by assuming that the fragments in an experiment were generated according to a basic model of how RNA-Seq works. With these assumptions one can write down a function that takes an assignment of relative abundances to each transcript as its argument. This function outputs a single number - the likelihood of observing the data in the experiment given the abundances supplied as the argument. The function is built by writing down the probability of observing each fragment in the actual data set, given some assignment of abundances to the transcript. It turns out that with this model, there's a single assignment of abundances to transcripts that *best* explains the data that came off of the sequencer. Cufflinks is able to find this value by systematically exploring the space of possible abundances for each transcript.

There are a lot of details in the paper, but this is the core idea. Once we have the abundances, we can use them to estimate how many fragments came from each transcript, and thus we can estimate the average depth of sequencing coverage for each one.
Cole Trapnell is offline   Reply With Quote
Old 03-30-2010, 04:35 PM   #7
thinkRNA
Member
 
Location: Carlsbad,CA

Join Date: Jan 2010
Posts: 94
Default

Quote:
Originally Posted by Cole Trapnell View Post

This function outputs a single number - the likelihood of observing the data in the experiment given the abundances supplied as the argument. The function is built by writing down the probability of observing each fragment in the actual data set, given some assignment of abundances to the transcript. It turns out that with this model, there's a single assignment of abundances to transcripts that *best* explains the data that came off of the sequencer. Cufflinks is able to find this value by systematically exploring the space of possible abundances for each transcript.
.
Great explanation and definitely helps intuitively understand the process! Since numbers/examples help a lot, if we continue with your example of two isoforms A and B of a gene. Lets say that reads spanning isoform A = 100 reads and reads spanning isoform B=10 reads. So, cufflinks explores all possible *isoforms* using these numbers and figures out that probabilistically isoform A is more prevalent than isoform B.
Did I get this right?

I think, it takes into account what are all the possible isoform for this gene by taking in the gene annotation that we provide (is this where the GTF file comes in as input). It then taken the counts of reads and determines which combination of exons (or which isoform) is most probable.
thinkRNA is offline   Reply With Quote
Old 05-15-2010, 12:04 PM   #8
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Cool

Quote:
Originally Posted by Cole Trapnell View Post
Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript.
Please correct me if I am wrong (which most definitively I am) in interpreting how to calculate the number of mapped reads per transcript. I am total noob at this RNAseq stuff

if a transcript looks like this in transcripts.gtf:

Code:
chr1	Cufflinks	transcript	3204563	3661579	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
chr1	Cufflinks	exon	3204563	3207049	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "1"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
chr1	Cufflinks	exon	3411783	3411982	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "2"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
chr1	Cufflinks	exon	3660633	3661579	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "3"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
then

transcript length = sum of length of exons = 3,631
cov = 0.020363

number of mapped reads/transcript = 3,631 * 0.020363 = 74

Is doesn't seem to be correct!

It would be great if cufflinks could report not only normalized FPKM values but also the raw number of mapped reads per transcript and exon (to use in software like DESeq, which I am using since I couldn't figure out how to use cuffcompare/cuffdiff with biological replicates for each conditition )

Thanx in advance for your help and consideration.

Last edited by marcora; 05-18-2010 at 04:59 AM.
marcora is offline   Reply With Quote
Old 09-07-2012, 11:06 AM   #9
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

marcora,

it's been a while since you posted this, but it appears to me that 74 is the number or *bases* that mapped to that transcript, as cufflinks coverage is i think on a base-level. so to get the number of reads you would have to divide this number by your read length, i.e, by 50 or 75, or the length of your reads.

Does this make sense?

Carmen
carmeyeii is offline   Reply With Quote
Old 10-28-2013, 07:20 AM   #10
hubery_Bio
Junior Member
 
Location: china

Join Date: Oct 2013
Posts: 5
Default

Quote:
Originally Posted by carmeyeii View Post
marcora,

it's been a while since you posted this, but it appears to me that 74 is the number or *bases* that mapped to that transcript, as cufflinks coverage is i think on a base-level. so to get the number of reads you would have to divide this number by your read length, i.e, by 50 or 75, or the length of your reads.

Does this make sense?

Carmen
yeah,it makes sense!
hubery_Bio is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:02 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO