Hi,
I searched for a while for my problem running cufflinks, sounds no answer yet.
I run tophat + bowtie for RNA-seq data (single end read), and got the widetype .sam file plus treated .sam file. The -G GFF option was supplied for tophat, which file was converted from Danio rerio GTF file and downloaded from http://www.ensembl.org/info/data/ftp/index.html.
Then I try to run cufflinks with the following command:
[mMi@devaP Felipa]$ cufflinks -G /home/RNASeq/FishGenome/Danio_rerio_Zv8_57.gtf ./WT_accepted_hits.sam
Counting hits in map
Error: duplicate GFF ID 'ENSDART00000099599' (or exons too far apart)!
#####################
I cannot find strings of 'ENSDART00000099599' in the WT.accepted_hits.sam file but write a pl script looking in Danio_rerio_Zv8_57.gtf file
mMi@mMi-Ubuntu:/A01 RNA-seq$ perl FindTargetRecord.pl
18 protein_coding exon 16261480 16262025 .- . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 16261480 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding start_codon 16262023 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding exon 14234408 14234520 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14234408 14234520 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding exon 14234169 14234325 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14234169 14234325 . - 1 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding exon 14231851 14232003 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14231851 14232003 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding exon 14223590 14224135 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14223593 14224135 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding stop_codon 14223590 14223592 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
Does this mean to delete some lines in reference gtf file?
#################
Then I delete the -G option
[mMi@devaP Felipa]$ cufflinks WT_accepted_hits.sam
now it sounds fine and produces .gtf gene.expr and trasncripts.expr files, but all ID are annotated with cuffID, not gene or transcript ID.
#####################
any suggestion of sorting it out?
cheers
I searched for a while for my problem running cufflinks, sounds no answer yet.
I run tophat + bowtie for RNA-seq data (single end read), and got the widetype .sam file plus treated .sam file. The -G GFF option was supplied for tophat, which file was converted from Danio rerio GTF file and downloaded from http://www.ensembl.org/info/data/ftp/index.html.
Then I try to run cufflinks with the following command:
[mMi@devaP Felipa]$ cufflinks -G /home/RNASeq/FishGenome/Danio_rerio_Zv8_57.gtf ./WT_accepted_hits.sam
Counting hits in map
Error: duplicate GFF ID 'ENSDART00000099599' (or exons too far apart)!
#####################
I cannot find strings of 'ENSDART00000099599' in the WT.accepted_hits.sam file but write a pl script looking in Danio_rerio_Zv8_57.gtf file
mMi@mMi-Ubuntu:/A01 RNA-seq$ perl FindTargetRecord.pl
18 protein_coding exon 16261480 16262025 .- . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 16261480 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding start_codon 16262023 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding exon 14234408 14234520 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14234408 14234520 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding exon 14234169 14234325 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14234169 14234325 . - 1 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding exon 14231851 14232003 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14231851 14232003 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding exon 14223590 14224135 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
18 protein_coding CDS 14223593 14224135 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
**
18 protein_coding stop_codon 14223590 14223592 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
**
Does this mean to delete some lines in reference gtf file?
#################
Then I delete the -G option
[mMi@devaP Felipa]$ cufflinks WT_accepted_hits.sam
now it sounds fine and produces .gtf gene.expr and trasncripts.expr files, but all ID are annotated with cuffID, not gene or transcript ID.
#####################
any suggestion of sorting it out?
cheers
Comment