Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks Transcript & Protein predictions: When one happens but the other doesn't.

    Hi gang,

    I"m working through an RNA-seq project with tophat-cufflinks and I've come across a question that I haven't been able to answer with forum searches or looking through the manual.

    After running cufflinks using a reference GTF, and looking at the transcripts.gtf file, I see transcripts both with and without an additional -Protein record. If I understand correctly, the -Protein transcript (which almost always has a CUFF.* gene name as opposed to the reference gene name) gives FPKM values for only the coding region of the transcript. However, not all genes have this additional -Protein listing.

    For example, here is a transcript with the additional -Protein lines:
    Code:
    Chr1    Cufflinks       transcript      471990  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
    Chr1    Cufflinks       exon    471990  472507  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
    Chr1    Cufflinks       exon    472668  473160  1000    -       .       gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "2"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879";
    Chr1    Cufflinks       transcript      472138  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
    Chr1    Cufflinks       exon    472138  472507  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "1"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
    Chr1    Cufflinks       exon    472668  473116  1000    -       .       gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "2"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
    ... and here is one without...

    Code:
    Chr1    Cufflinks       transcript      61963   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";
    Chr1    Cufflinks       exon    61963   62124   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";Chr1    Cufflinks       exon    63431   63811   1000    -       .       gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "2"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";
    Sometimes only lines specific to the Protein are printed. My question is why would a transcript occasionally be predicted without a corresponding protein. Or, why would a protein be predicted without a corresponding transcript?

    This becomes a larger issue when comparing samples with cuffcompare and lines like the following show up in a tracking file when comparing 3 samples:

    Code:
    TCONS_00000013  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:AT1G01260.1|AT1G01260.1|100|1.636771|1.424183|1.849358|6.358318|2578 q2:CUFF.112|AT1G01260.2,AT1G01260.2-Protein|100|8.204362|7.299812|9.108912|34.086923|1773       -
    TCONS_00000014  XLOC_000012     AT1G01260|AT1G01260.2   =       q1:AT1G01260.2|AT1G01260.2|100|0.472212|0.422116|0.522308|1.834390|2378 q2:AT1G01260.2|AT1G01260.2|100|0.995585|0.905067|1.086104|4.136391|2378 -
    TCONS_00000015  XLOC_000012     AT1G01260|AT1G01260.1   =       q1:CUFF.90|AT1G01260.2,AT1G01260.2-Protein|100|4.929213|4.241811|5.616615|19.148380|1773        -       q3:CUFF.95|AT1G01260.2,AT1G01260.2-Protein|100|8.246326|7.368782|9.123870|34.015752|1773
    In this case three lines are used to track two isoforms, and the transcript-level and the protein-level records from the three samples have been mixed, and even associated with the wrong isoform.

    I'll also add that the tss_id and p_id don't do much to clear up the picture. Here's the corresponding entries in the combined.gtf file:

    Code:
    Chr1    Cufflinks       exon    109032  111609  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000013"; exon_number "1"; oId "AT1G01260.1"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";
    Chr1    Cufflinks       exon    109076  109330  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "1"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
    Chr1    Cufflinks       exon    109413  111535  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "2"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12";
    Chr1    Cufflinks       exon    109595  111367  .       +       .       gene_id "XLOC_000012"; transcript_id "TCONS_00000015"; exon_number "1"; oId "AT1G01260.2,AT1G01260.2-Protein"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";
    Why would TCONS_00..13 not be assigned a tss_id, but the same p_id as the 2nd isoform? Are all -Protein predictions assigned only p_ids and no tss_id ?

    I"m now in the process of going through the tracking file with a perl script to make sure situations like this are sorted out, but I"m wondering if there's a reason why this might be expected.

    Or am I just doing something wrong? Thanks for any feedback you might have... that's a lot of questions for one post.

    Cheers,
    Jonathan

  • #2
    Bumping this during the work week hoping to get a response.

    Comment


    • #3
      p_Id

      Any one has any suggestion how to get p_Ids and TSS-IDs both I am using Ensemble Hg19 as GTF file.
      Thanks

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        05-06-2024, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:35 AM
      0 responses
      15 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-09-2024, 02:46 PM
      0 responses
      21 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-07-2024, 06:57 AM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-06-2024, 07:17 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Working...
      X