Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

keebs42

Member

Join Date: May 2009
Posts: 17

Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

10-01-2010, 06:23 PM

Hi gang,

I"m working through an RNA-seq project with tophat-cufflinks and I've come across a question that I haven't been able to answer with forum searches or looking through the manual.

After running cufflinks using a reference GTF, and looking at the transcripts.gtf file, I see transcripts both with and without an additional -Protein record. If I understand correctly, the -Protein transcript (which almost always has a CUFF.* gene name as opposed to the reference gene name) gives FPKM values for only the coding region of the transcript. However, not all genes have this additional -Protein listing.

For example, here is a transcript with the additional -Protein lines:

Code:

Chr1    Cufflinks       transcript      471990  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       exon    471990  472507  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       exon    472668  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "2"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       transcript      472138  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
Chr1    Cufflinks       exon    472138  472507  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "1"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
Chr1    Cufflinks       exon    472668  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "2"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";

... and here is one without...

Code:

Chr1    Cufflinks       transcript      61963   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";
Chr1    Cufflinks       exon    61963   62124   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";Chr1    Cufflinks       exon    63431   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "2"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";

Sometimes only lines specific to the Protein are printed. My question is why would a transcript occasionally be predicted without a corresponding protein. Or, why would a protein be predicted without a corresponding transcript?

This becomes a larger issue when comparing samples with cuffcompare and lines like the following show up in a tracking file when comparing 3 samples:

Code:

TCONS_00000013  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:AT1G01260.1|AT1G01260.1|100|1.636771|1.424183|1.849358|6.358318|2578 q2:CUFF.112|AT1G01260.2,AT1G01260.2-Protein|100|8.204362|7.299812|9.108912|34.086923|1773       -
TCONS_00000014  XLOC_000012     AT1G01260|AT1G01260.2   =       q1:AT1G01260.2|AT1G01260.2|100|0.472212|0.422116|0.522308|1.834390|2378 q2:AT1G01260.2|AT1G01260.2|100|0.995585|0.905067|1.086104|4.136391|2378 -
TCONS_00000015  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:CUFF.90|AT1G01260.2,AT1G01260.2-Protein|100|4.929213|4.241811|5.616615|19.148380|1773        -       q3:CUFF.95|AT1G01260.2,AT1G01260.2-Protein|100|8.246326|7.368782|9.123870|34.015752|1773

In this case three lines are used to track two isoforms, and the transcript-level and the protein-level records from the three samples have been mixed, and even associated with the wrong isoform.

I'll also add that the tss_id and p_id don't do much to clear up the picture. Here's the corresponding entries in the combined.gtf file:

Code:

Chr1    Cufflinks       exon    109032  111609  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000013"; exon_number "1"; oId "AT1G01260.1"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";
Chr1    Cufflinks       exon    109076  109330  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "1"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
Chr1    Cufflinks       exon    109413  111535  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "2"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
Chr1    Cufflinks       exon    109595  111367  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000015"; exon_number "1"; oId "AT1G01260.2,AT1G01260.2-Protein"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";

Why would TCONS_00..13 not be assigned a tss_id, but the same p_id as the 2nd isoform? Are all -Protein predictions assigned only p_ids and no tss_id ?

I"m now in the process of going through the tracking file with a perl script to make sure situations like this are sorted out, but I"m wondering if there's a reason why this might be expected.

Or am I just doing something wrong? Thanks for any feedback you might have... that's a lot of questions for one post.

Cheers,
Jonathan

Tags: cuffcompare, cufflinks, isoform prediction

keebs42

Member

Join Date: May 2009

Posts: 17
- Share
- Tweet
#2

10-05-2010, 05:44 AM

Bumping this during the work week hoping to get a response.
Comment
honey

Senior Member

Join Date: Feb 2010

Posts: 151
- Share
- Tweet
#3

01-27-2011, 03:31 PM

p_Id

Any one has any suggestion how to get p_Ids and TSS-IDs both I am using Ensemble Hg19 as GTF file.
Thanks
Comment

Previous template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Yesterday, 06:35 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 21 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

Comment

Comment

Latest Articles

ad_right_rmr

News