When cuffcompare is run with multiple input gtf files from different sequencing expts, how does it decide which transcript to be the "representative" transcript to list in the combined.gtf file? If multiple transcripts have the same intron structure but are not identical, which one does it choose to put in combined.gtf? It does not always choose the longest one as the "representative" transcript (which is what I thought it would do).
For example, from these two input gtf files,
and
the transcript listed in the combined.gtf file is the first and shorter transcript:
I am running version 0.9.2 of cufflinks/cuffcompare, and am running it without a "reference" annotation GTF.
Incidentally, my goal here is to create a transcriptome for a mostly unannotated, novel genome. I have RNA-seq data from two different sequencing runs. I ran TopHat (with a couple of parameter sets) and Cufflinks on the reads to predict transcripts. I am now using cuffcompare to consolidate the results of the transcript predictions from the two sequencing/TopHat runs to create the most "comprehensive" transcriptome.
For example, from these two input gtf files,
Code:
contig00177 Cufflinks transcript 230 711 1000 . . gene_id "CUFF.68933"; transcript_id "CUFF.68933.1"; FPKM "107.2929288684"; frac "1.000000"; conf_lo "86.576469"; conf_hi "128.009389"; cov "2.269504"; contig00177 Cufflinks exon 230 711 1000 . . gene_id "CUFF.68933"; transcript_id "CUFF.68933.1"; exon_number "1"; FPKM "107.2929288684"; frac "1.000000"; conf_lo "86.576469"; conf_hi "128.009389"; cov "2.269504";
Code:
contig00177 Cufflinks transcript 230 1047 1000 . . gene_id "CUFF.71009"; transcript_id "CUFF.71009.1"; FPKM "86.1874620509"; frac "1.000000"; conf_lo "67.620022"; conf_hi "104.754903"; cov "2.055336"; contig00177 Cufflinks exon 230 1047 1000 . . gene_id "CUFF.71009"; transcript_id "CUFF.71009.1"; exon_number "1"; FPKM "86.1874620509"; frac "1.000000"; conf_lo "67.620022"; conf_hi "104.754903"; cov "2.055336";
Code:
contig00177 Cufflinks exon 230 711 . ^@ . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.68933.1"; class_code ".";
Incidentally, my goal here is to create a transcriptome for a mostly unannotated, novel genome. I have RNA-seq data from two different sequencing runs. I ran TopHat (with a couple of parameter sets) and Cufflinks on the reads to predict transcripts. I am now using cuffcompare to consolidate the results of the transcript predictions from the two sequencing/TopHat runs to create the most "comprehensive" transcriptome.