SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
High performance computing & Tophat & Cufflinks paolo.kunder Bioinformatics 1 01-24-2012 08:02 AM
Cufflinks doesn't recognize read type marb Bioinformatics 4 12-13-2011 03:56 AM
Reads map in Tophat, but Cufflinks doesn't find transcripts janec Bioinformatics 0 07-07-2011 01:42 AM
Non uniform gene coverage and Cufflinks predictions : artifacts ? hugo D RNA Sequencing 0 03-02-2011 07:04 AM
RNA-Seq: Analysis of transcript and protein overlap in a human osteosarcoma cell line Newsbot! Literature Watch 0 12-04-2010 02:01 AM

Reply
 
Thread Tools
Old 10-01-2010, 06:23 PM   #1
keebs42
Member
 
Location: Raleigh, NC

Join Date: May 2009
Posts: 17
Default Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

Hi gang,

I"m working through an RNA-seq project with tophat-cufflinks and I've come across a question that I haven't been able to answer with forum searches or looking through the manual.

After running cufflinks using a reference GTF, and looking at the transcripts.gtf file, I see transcripts both with and without an additional -Protein record. If I understand correctly, the -Protein transcript (which almost always has a CUFF.* gene name as opposed to the reference gene name) gives FPKM values for only the coding region of the transcript. However, not all genes have this additional -Protein listing.

For example, here is a transcript with the additional -Protein lines:
Code:
Chr1    Cufflinks       transcript      471990  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       exon    471990  472507  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       exon    472668  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "2"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
Chr1    Cufflinks       transcript      472138  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
Chr1    Cufflinks       exon    472138  472507  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "1"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
Chr1    Cufflinks       exon    472668  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "2"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
... and here is one without...

Code:
Chr1    Cufflinks       transcript      61963   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";
Chr1    Cufflinks       exon    61963   62124   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";Chr1    Cufflinks       exon    63431   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "2"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";
Sometimes only lines specific to the Protein are printed. My question is why would a transcript occasionally be predicted without a corresponding protein. Or, why would a protein be predicted without a corresponding transcript?

This becomes a larger issue when comparing samples with cuffcompare and lines like the following show up in a tracking file when comparing 3 samples:

Code:
TCONS_00000013  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:AT1G01260.1|AT1G01260.1|100|1.636771|1.424183|1.849358|6.358318|2578 q2:CUFF.112|AT1G01260.2,AT1G01260.2-Protein|100|8.204362|7.299812|9.108912|34.086923|1773       -
TCONS_00000014  XLOC_000012     AT1G01260|AT1G01260.2   =       q1:AT1G01260.2|AT1G01260.2|100|0.472212|0.422116|0.522308|1.834390|2378 q2:AT1G01260.2|AT1G01260.2|100|0.995585|0.905067|1.086104|4.136391|2378 -
TCONS_00000015  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:CUFF.90|AT1G01260.2,AT1G01260.2-Protein|100|4.929213|4.241811|5.616615|19.148380|1773        -       q3:CUFF.95|AT1G01260.2,AT1G01260.2-Protein|100|8.246326|7.368782|9.123870|34.015752|1773
In this case three lines are used to track two isoforms, and the transcript-level and the protein-level records from the three samples have been mixed, and even associated with the wrong isoform.

I'll also add that the tss_id and p_id don't do much to clear up the picture. Here's the corresponding entries in the combined.gtf file:

Code:
Chr1    Cufflinks       exon    109032  111609  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000013"; exon_number "1"; oId "AT1G01260.1"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";
Chr1    Cufflinks       exon    109076  109330  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "1"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
Chr1    Cufflinks       exon    109413  111535  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "2"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
Chr1    Cufflinks       exon    109595  111367  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000015"; exon_number "1"; oId "AT1G01260.2,AT1G01260.2-Protein"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";
Why would TCONS_00..13 not be assigned a tss_id, but the same p_id as the 2nd isoform? Are all -Protein predictions assigned only p_ids and no tss_id ?

I"m now in the process of going through the tracking file with a perl script to make sure situations like this are sorted out, but I"m wondering if there's a reason why this might be expected.

Or am I just doing something wrong? Thanks for any feedback you might have... that's a lot of questions for one post.

Cheers,
Jonathan
keebs42 is offline   Reply With Quote
Old 10-05-2010, 05:44 AM   #2
keebs42
Member
 
Location: Raleigh, NC

Join Date: May 2009
Posts: 17
Default

Bumping this during the work week hoping to get a response.
keebs42 is offline   Reply With Quote
Old 01-27-2011, 02:31 PM   #3
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default p_Id

Any one has any suggestion how to get p_Ids and TSS-IDs both I am using Ensemble Hg19 as GTF file.
Thanks
honey is offline   Reply With Quote
Reply

Tags
cuffcompare, cufflinks, isoform prediction

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO