Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

keebs42

Member

Join Date: May 2009
Posts: 17

Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

10-01-2010, 06:23 PM

Hi gang,

I"m working through an RNA-seq project with tophat-cufflinks and I've come across a question that I haven't been able to answer with forum searches or looking through the manual.

After running cufflinks using a reference GTF, and looking at the transcripts.gtf file, I see transcripts both with and without an additional -Protein record. If I understand correctly, the -Protein transcript (which almost always has a CUFF.* gene name as opposed to the reference gene name) gives FPKM values for only the coding region of the transcript. However, not all genes have this additional -Protein listing.

For example, here is a transcript with the additional -Protein lines:

Code:

Chr1    Cufflinks       transcript      471990  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       exon    471990  472507  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       exon    472668  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "2"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       transcript      472138  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
Chr1    Cufflinks       exon    472138  472507  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "1"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
Chr1    Cufflinks       exon    472668  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "2"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";

... and here is one without...

Code:

Chr1    Cufflinks       transcript      61963   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";
Chr1    Cufflinks       exon    61963   62124   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";Chr1    Cufflinks       exon    63431   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "2"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";

Sometimes only lines specific to the Protein are printed. My question is why would a transcript occasionally be predicted without a corresponding protein. Or, why would a protein be predicted without a corresponding transcript?

This becomes a larger issue when comparing samples with cuffcompare and lines like the following show up in a tracking file when comparing 3 samples:

Code:

TCONS_00000013  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:AT1G01260.1|AT1G01260.1|100|1.636771|1.424183|1.849358|6.358318|2578 q2:CUFF.112|AT1G01260.2,AT1G01260.2-Protein|100|8.204362|7.299812|9.108912|34.086923|1773       -
TCONS_00000014  XLOC_000012     AT1G01260|AT1G01260.2   =       q1:AT1G01260.2|AT1G01260.2|100|0.472212|0.422116|0.522308|1.834390|2378 q2:AT1G01260.2|AT1G01260.2|100|0.995585|0.905067|1.086104|4.136391|2378 -
TCONS_00000015  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:CUFF.90|AT1G01260.2,AT1G01260.2-Protein|100|4.929213|4.241811|5.616615|19.148380|1773        -       q3:CUFF.95|AT1G01260.2,AT1G01260.2-Protein|100|8.246326|7.368782|9.123870|34.015752|1773

In this case three lines are used to track two isoforms, and the transcript-level and the protein-level records from the three samples have been mixed, and even associated with the wrong isoform.

I'll also add that the tss_id and p_id don't do much to clear up the picture. Here's the corresponding entries in the combined.gtf file:

Code:

Chr1    Cufflinks       exon    109032  111609  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000013"; exon_number "1"; oId "AT1G01260.1"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";
Chr1    Cufflinks       exon    109076  109330  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "1"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
Chr1    Cufflinks       exon    109413  111535  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "2"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
Chr1    Cufflinks       exon    109595  111367  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000015"; exon_number "1"; oId "AT1G01260.2,AT1G01260.2-Protein"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";

Why would TCONS_00..13 not be assigned a tss_id, but the same p_id as the 2nd isoform? Are all -Protein predictions assigned only p_ids and no tss_id ?

I"m now in the process of going through the tracking file with a perl script to make sure situations like this are sorted out, but I"m wondering if there's a reason why this might be expected.

Or am I just doing something wrong? Thanks for any feedback you might have... that's a lot of questions for one post.

Cheers,
Jonathan

Tags: cuffcompare, cufflinks, isoform prediction

keebs42

Member

Join Date: May 2009

Posts: 17
- Share
- Tweet
#2

10-05-2010, 05:44 AM

Bumping this during the work week hoping to get a response.
Comment
honey

Senior Member

Join Date: Feb 2010

Posts: 151
- Share
- Tweet
#3

01-27-2011, 03:31 PM

p_Id

Any one has any suggestion how to get p_Ids and TSS-IDs both I am using Ensemble Hg19 as GTF file.
Thanks
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

Comment

Comment

Latest Articles

ad_right_rmr

News