Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • epi
    Member
    • Jan 2012
    • 38

    Cufflinks annotation handling business

    Help me understand if you know, or comment if you find similar

    Cufflinks run with -g, using GTF (from illumina igenomes).



    ORIGINAL ANNOTATION
    Code:
    chr2	unknown	exon	152163161	152164423	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	stop_codon	152163943	152163945	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	CDS	152163946	152164423	.	-	1	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	CDS	152165451	152165743	.	-	0	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	exon	152165451	152165743	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	CDS	152168773	152169063	.	-	0	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	exon	152168773	152169063	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	start_codon	152169061	152169063	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	exon	152169593	152169796	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    ;




    CUFFLINKS OUTPUT:
    What happened to gene ID?? where did it go
    Code:
    chr2	Cufflinks	transcript	152163161	152169796	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273"; full_read_support "yes";
    chr2	Cufflinks	exon	152163161	152164423	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "1"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";
    chr2	Cufflinks	exon	152165451	152165743	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "2"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";
    chr2	Cufflinks	exon	152168773	152169063	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "3"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";
    chr2	Cufflinks	exon	152169593	152169796	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "4"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";



    CUFFMERGE OUTPUT:
    ok, got my gene name back!
    What about gene_id? What happened to transcript ID? Where did CUFF.7092.2 come from? Where happened to p_ID?
    Code:
    chr2	Cufflinks	exon	152163157	152164423	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023065"; exon_number "1"; gene_name "Trib3"; oId "CUFF.7092.2"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16814";
    chr2	Cufflinks	exon	152165420	152165427	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023065"; exon_number "2"; gene_name "Trib3"; oId "CUFF.7092.2"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16814";
    chr2	Cufflinks	exon	152163157	152164423	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023066"; exon_number "1"; gene_name "Trib3"; oId "CUFF.7092.1"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16815";
    chr2	Cufflinks	exon	152166010	152166019	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023066"; exon_number "2"; gene_name "Trib3"; oId "CUFF.7092.1"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16815";
    chr2	Cufflinks	exon	152163161	152164423	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "1"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";
    chr2	Cufflinks	exon	152165451	152165743	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "2"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";
    chr2	Cufflinks	exon	152168773	152169063	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "3"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";
    chr2	Cufflinks	exon	152169593	152169796	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "4"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";




    CUFFDIFF OUTPUT:
    I will be damned!
    Whats with the new TSS ID. IN my original GTF, this ID belongs to an entirely different gene (LOC100040911) on chrY. For which gene am i looking at expression difference -- Trib3 or LOC100040911?
    Code:
    tracking_id	class_code	nearest_ref_id	gene_id	gene_short_name	tss_id	locus	length	coverage	q1_FPKM	q1_conf_lo	q1_conf_hi	q1_status	q2_FPKM	q2_conf_lo	q2_conf_hi	q2_status
    TSS16815	XLOC_013512	Trib3	chr2:152163156-152169796	q1	q2	OK	0.00575202	14.359	11.2856	-2.70945	0.0067394	0.0480599	yes
    Last edited by epi; 03-01-2012, 08:00 AM.
  • epi
    Member
    • Jan 2012
    • 38

    #2
    To all Cufflinks users and author, can anyone please comment if I have stated my problem clearly.

    Unfortunately it may be a serious bug in this software, which is causing it to mix FPKMs between different genes. Not only it obviously makes cufflinks highly unreliable, but a puts a question mark over the published findings.

    Thanks for your attention and replies.

    Comment

    • jcgrenier
      Member
      • May 2011
      • 12

      #3
      Hi,

      The option "-g" that you are using is usually used to discover new transcripts. So the CUFF.* names that you got there are for those newly found transcripts.
      If you only want to do it on known transcripts, use the "-G" option.

      That is, I don't know why CuffMerge gives us certain oId with CUFF.* names.

      Comment

      • jcgrenier
        Member
        • May 2011
        • 12

        #4
        For the tss_id, there is probably a bug, because it's numerated in ascending order but the TSS number as nothing to do with the gene itself.


        ***Update***
        Cuffmerge merge together multiple transcripts.gtf files coming from different analyses. The TSS information is not present in that file. Maybe it doesn't use the information contained in the given reference GTF file and just rename the tss_id...
        Last edited by jcgrenier; 03-09-2012, 12:24 PM.

        Comment

        • Cole Trapnell
          Senior Member
          • Nov 2008
          • 213

          #5
          As far as I can tell, Cufflinks is working correctly and as described in the manual. As noted above, you want -G, not -g.

          Comment

          • Luyi Tian
            Member
            • Mar 2012
            • 15

            #6
            There is a nature protocol published several days before, which may help you:
            "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks"

            Comment

            • epi
              Member
              • Jan 2012
              • 38

              #7
              Thanks you all for replies, and I apologize if I was panicking more than needed. Perhaps I should explain my question a bit further.

              As I understand, -g adds new transcripts "in addition to" existing ones, which is my goal under this analysis.

              From the manual:
              Output will include all reference transcripts as well as any novel genes and isoforms that are assembled
              While for many other genes, it is using the original gene name, for Trib3, it is converting it to CUFF.9451, and after cuffmerge to CUFF.7092.2. This is the source of all confusion, from manual and published protocol it is not clear why it should do that. After all, the annotation of Trib3 transcript and its exons is identical to CUFF.9451. They why this behaves like a novel gene now.

              From jcgrenier comments, I went back and found that it is indeed correct that transcript_id are incremented in ascending order, which has made a chance identity of IDs between chr2 and chrY, could very well be other cases which I don't know of.

              Actually, apart form a specific case, the real question here is how come cufflinks is not using the transcript_ID, p_ID and in some cases gene_ID from the original GFF as it is supposed to.
              Thanks again for comments.
              Last edited by epi; 03-12-2012, 08:02 AM.

              Comment

              • Cole Trapnell
                Senior Member
                • Nov 2008
                • 213

                #8
                Originally posted by epi View Post
                Actually, apart form a specific case, the real question here is how come cufflinks is not using the transcript_ID, p_ID and in some cases gene_ID from the original GFF as it is supposed to.
                Hmm, that's annoying... can you show us how you're running cuffmerge? One limitation of the current version of Cuffmerge is that it doesn't do any of its own ORF prediction. So new isoforms of coding genes won't get assigned a new p_id or attached to an existing one. New genes will not be declared as coding or non-coding (this is a hard problem in general). However, they will get grouped under a tss_id. The thing is, Cuffmerge will reassign tss_ids just like it reassigns transcript_ids, because those also have to be unique. The only thing it really tries to propogate from the reference in terms of metadata is the gene short name.

                Comment

                • epi
                  Member
                  • Jan 2012
                  • 38

                  #9
                  Originally posted by Cole Trapnell View Post
                  Hmm, that's annoying... can you show us how you're running cuffmerge? One limitation of the current version of Cuffmerge is that it doesn't do any of its own ORF prediction. So new isoforms of coding genes won't get assigned a new p_id or attached to an existing one. New genes will not be declared as coding or non-coding (this is a hard problem in general). However, they will get grouped under a tss_id. The thing is, Cuffmerge will reassign tss_ids just like it reassigns transcript_ids, because those also have to be unique. The only thing it really tries to propogate from the reference in terms of metadata is the gene short name.
                  Hi Cole, Thanks again for replying. While your comment clarifies some things, it also raises some more question. First here is my command

                  Code:
                  cuffmerge -o test1 -g genes.gtf -p 10 -s ~/path/mm9 assemblies1.txt
                  Assemblies1.txt contains absolute path to cufflinks gtf
                  Code:
                  path-to-dir/transcripts_sample1.gtf
                  path-to-dir/transcripts_sample2.gtf
                  .
                  .
                  .
                  The behavior you described is what I am observing as well. As I mentioned earlier, I am using -g and not -G to allow known as well as novel genes. Should I not expect it to preserve transcript, TSS and pids for the known genes.

                  Also I am still unclear why there are new predictions (with CUFF. IDs) that are identical to existing annotations. I have manually looked at one of the test runs and all new predictions (with CUFF. IDs) were original genes/transcripts in the source GTF annotation (I must have looked only about 20-25 ). So far I have not encountered any novel predictions which do not correspond to existing annotation. In some cases, original annotation still exists and in some cases not. Similar to the example about Trib3 I posted above.

                  Comment

                  • Lee Sam
                    Member
                    • Oct 2008
                    • 57

                    #10
                    Has anyone figured this issue out? I'm running a small study with cancer samples and I don't even see TP53 annotated in the genes.fpkm_tracking using either the Illumina-provided UCSC or Ensembl files (with chromosome names edited for compatibility). I'm running using the "-g" option as well and I was expecting that existing genes observed by cufflinks would have entries with the annotated gene names.
                    Last edited by Lee Sam; 04-05-2012, 10:05 AM.

                    Comment

                    • AsoBioInfo
                      Member
                      • Dec 2011
                      • 37

                      #11
                      Hi, I also encountered the same issue. The ./cufflinks also gave me some *.CUFF id's.
                      I used -g option, as I am interested in novel transcripts. How can you get all novel transcripts or genes?

                      Comment

                      • epi
                        Member
                        • Jan 2012
                        • 38

                        #12
                        Thanks for commenting here, it is good to know I am not the only one in the this boat. I have pinged Cole again, hope he writes something here.

                        Comment

                        • goudurix
                          Junior Member
                          • Mar 2012
                          • 6

                          #13
                          Hi all,
                          Epi, did you find a workaround ? I ran into the same problem. Running cufflinks with -g options, some original gene_ids were converted to new identifiers sharing the CUFF prefix.
                          Il2ra (NM_008367) became CUFF.265243...
                          Regards

                          Comment

                          • andremho
                            Junior Member
                            • May 2012
                            • 3

                            #14
                            Hi, does anyone have an answer to this question?

                            I am fairly new to this game, but I am running a small RNA-seq study on cancer samples.
                            I have mapped the reads using tophat, and quantified transcript/gene expression using cufflinks and the -g option with ENSEMBL annotation .gtf file.

                            However, I also don't find many of the expected expressed gene IDs in my genes.fpkm_tracking file, like TP53 etc. But when i do a manual inspection of the assembly using IGV, I clearly see that these genes are highly covered by reads....

                            Also, in my isoforms.fpkm_tracking file I can find some of the transcript Ids for these genes, but as i said not their gene IDs..

                            Can anyone explain this/give a solution to my problem?

                            Comment

                            • manianslab
                              Junior Member
                              • Jun 2012
                              • 9

                              #15
                              Unfortunately, cuffdiff does not seem to carry over the original transcript ids from the merged.gtf file in to the output files. However, you can either use Excel (vlookup) or a perlscript or awk to populate an additional field in your 'diff' output files by searching merged.gtf. The original transcript ids are listed as 'oId' in the merged.gtf created by cuffmerge.

                              I will post a script if I find time to get one done.

                              hope this helps!

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              20 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...