Hi
When i supplied a reference gtf to cufflinks (-G), i found there are duplicated geneID in the output "genes.expr". That is a bit weird to me and it is very rare (3 out of 50k genes). I checked those 3 and it turns out that cufflink consider their isoforms as individual genes but still use the same gene_id supplied in the gtf file. All these 3 genes have a common characteristics. The genome positions of each isoform's transcript/exon/CDS are completely different. I guess cufflink use this information to judge whether different transcripts belongs to the same gene instead of using the gene_id information supplied in gtf.
I can remove them by hand but is there a way to "force" cufflinks to recognize them as a single gene?
cheers
silin
original GTF file
chr06 SZ transcript 3851140 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ CDS 3851140 3851247 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ CDS 3853062 3853304 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ exon 3853305 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
###
chr06 SZ transcript 3851392 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ exon 3851392 3851900 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ CDS 3851901 3852434 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ exon 3852435 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
cufflinks output "genes.expr"
Os06g07923 141826 chr06 3851139 3853473 0 0 0 OK
Os06g07923 141826 chr06 3851391 3852964 0 0 0 OK
When i supplied a reference gtf to cufflinks (-G), i found there are duplicated geneID in the output "genes.expr". That is a bit weird to me and it is very rare (3 out of 50k genes). I checked those 3 and it turns out that cufflink consider their isoforms as individual genes but still use the same gene_id supplied in the gtf file. All these 3 genes have a common characteristics. The genome positions of each isoform's transcript/exon/CDS are completely different. I guess cufflink use this information to judge whether different transcripts belongs to the same gene instead of using the gene_id information supplied in gtf.
I can remove them by hand but is there a way to "force" cufflinks to recognize them as a single gene?
cheers
silin
original GTF file
chr06 SZ transcript 3851140 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ CDS 3851140 3851247 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ CDS 3853062 3853304 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
chr06 SZ exon 3853305 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
###
chr06 SZ transcript 3851392 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ exon 3851392 3851900 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ CDS 3851901 3852434 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.1";
chr06 SZ exon 3852435 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
cufflinks output "genes.expr"
Os06g07923 141826 chr06 3851139 3853473 0 0 0 OK
Os06g07923 141826 chr06 3851391 3852964 0 0 0 OK
Comment