SEQanswers (
-   Bioinformatics (
-   -   Bug? duplicated genes in cufflinks output genes.expr (

silin284 12-03-2010 11:53 AM

Bug? duplicated genes in cufflinks output genes.expr

When i supplied a reference gtf to cufflinks (-G), i found there are duplicated geneID in the output "genes.expr". That is a bit weird to me and it is very rare (3 out of 50k genes). I checked those 3 and it turns out that cufflink consider their isoforms as individual genes but still use the same gene_id supplied in the gtf file. All these 3 genes have a common characteristics. The genome positions of each isoform's transcript/exon/CDS are completely different. I guess cufflink use this information to judge whether different transcripts belongs to the same gene instead of using the gene_id information supplied in gtf.

I can remove them by hand but is there a way to "force" cufflinks to recognize them as a single gene?


original GTF file
chr06 SZ transcript 3851140 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ CDS 3851140 3851247 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ CDS 3853062 3853304 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ exon 3853305 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ transcript 3851392 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ exon 3851392 3851900 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ CDS 3851901 3852434 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ exon 3852435 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";

cufflinks output "genes.expr"
Os06g07923 141826 chr06 3851139 3853473 0 0 0 OK
Os06g07923 141826 chr06 3851391 3852964 0 0 0 OK

apadr007 12-13-2011 08:12 AM

I have the same question. Why is cufflinks repeating genes?

kenphi 02-24-2012 02:48 AM

Dear silin

I think this is because in your reference annotation there are "unrelated" transcripts annotated to the same gene. I noticed that this happens, when there are independent transcript groups, i.e. groups of transcripts that do not overlap in exon coordinates. The can be side-by-side or one in the intron of the other. Some examples are in Ensembl 64


In some of these cases, I would say that Ensembl didn't follow its own guidelines, to assign the same gene identifier to transcripts with overlapping position, because there are clearly independent clusters.

I keep them and use the gene_id column of cufflinks to make tables unique.


emanlee 05-18-2014 12:19 AM

Another thread on this issue:

A solution based on mgogol's code:

All times are GMT -8. The time now is 12:42 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.