![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
HTseq:Adding GTF annotation to SAM alignment | mbobro2 | RNA Sequencing | 40 | 12-11-2012 12:43 PM |
Problems with the illumina .fastq sequence data annotation | tractorsazi | Bioinformatics | 3 | 01-30-2012 07:50 AM |
where can I find annotation.gtf when trying Cuffcompare? | joyce kang | Bioinformatics | 0 | 11-14-2011 07:59 AM |
Cufflinks' computation of FPKM for --GTF and --GTF-guide estimation | burt | Bioinformatics | 0 | 08-24-2011 12:59 AM |
Acceptable Sp/Sn output from cufflinks and problems with Homo_sapiens.GRCh37.60.gtf | nat | Bioinformatics | 0 | 12-02-2010 10:58 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Kansas City Join Date: Oct 2009
Posts: 88
|
![]()
I have been trying to supply a GTF for annotation with Cufflinks/Cuffcompare and I have been having no success at all.
I started by only having GFF files. The organism I work with, Arabidopsis, does not have any published GTF annotation files that I have been able to locate and I saw someone else on here was unable to locate any as well. So I attempted to convert the GFFs I had into GTFs by converting the ninth column. I used http://mblab.wustl.edu/GTF22.html as my reference. On the first try I simply took the feature column and made it the gene_id and the transcript_id, knowing the names would be nice, but for our purposes just knowing what the reads represent is sufficient (mRNA, miRNA, siRNA, pseudogene, etc.) Code:
Chr1 TAIR9 gene 3631 5899 . + . gene_id "gene"; transcript_id "gene"; Chr1 TAIR9 mRNA 3631 5899 . + . gene_id "mRNA"; transcript_id "mRNA"; Chr1 TAIR9 protein 3760 5630 . + . gene_id "protein"; transcript_id "protein"; Code:
cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf Loading reference transcripts.. Error: duplicate GFF ID 'mRNA' encountered! Code:
Chr1 TAIR9 gene 3631 5899 . + . gene_id "gene2"; transcript_id "gene-2"; Chr1 TAIR9 mRNA 3631 5899 . + . gene_id "mRNA3"; transcript_id "mRNA-3"; Chr1 TAIR9 protein 3760 5630 . + . gene_id "protein4"; transcript_id "protein-4"; Code:
cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf Loading reference transcripts.. GList error (GList.hh:592):Invalid list index: -1 Code:
Chr1 TAIR9 gene 3631 5899 . + . gene_id "gene2"; transcript_id "gene12"; Chr1 TAIR9 mRNA 3631 5899 . + . gene_id "mRNA3"; transcript_id "mRNA13"; Chr1 TAIR9 protein 3760 5630 . + . gene_id "protein4"; transcript_id "protein14"; Code:
cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf Loading reference transcripts.. GList error (GList.hh:592):Invalid list index: -1 Can anyone make a recommendation on changing a GFF into a GTF? Tophat was able to supply GFF files for annotation, but for some reason Cufflinks only allows GTF files to provide annotation. It's great for some of the more mainstream organisms, but a lot of them (Arabidopsis in my case) only have annotations in GFF and GFF3 which creates a wall in being able to process the expression data. Any and all help/suggestions would be greatly appreciated. I've been hung on up this problem for some time now and I have no more ideas on how to proceed. Thanks as always. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
Ignore everything except for exons and CDS lines; those are all that matter to cufflinks. Every exon or CDS entry which is part of the same gene must have the same "gene_id". Every exon or CDS which is part of the same transcript must have the same "transcript_id". Here is an example of one gene (AT1G01020) which has two transcripts (AT1G01020.1 and AT1G01020.2).
The GFF3 (TAIR9 annotation); Code:
Chr1 TAIR9 gene 5928 8737 . - . ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020 Chr1 TAIR9 mRNA 5928 8737 . - . ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1 Chr1 TAIR9 protein 6915 8666 . - . ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1 Chr1 TAIR9 five_prime_UTR 8667 8737 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 8571 8666 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 8571 8737 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 8417 8464 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 8417 8464 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 8236 8325 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 8236 8325 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 7942 7987 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 7942 7987 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 7762 7835 . - 2 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 7762 7835 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 7564 7649 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 7564 7649 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 7384 7450 . - 1 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 7384 7450 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 7157 7232 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 exon 7157 7232 . - . Parent=AT1G01020.1 Chr1 TAIR9 CDS 6915 7069 . - 2 Parent=AT1G01020.1,AT1G01020.1-Protein; Chr1 TAIR9 three_prime_UTR 6437 6914 . - . Parent=AT1G01020.1 Chr1 TAIR9 exon 6437 7069 . - . Parent=AT1G01020.1 Chr1 TAIR9 three_prime_UTR 5928 6263 . - . Parent=AT1G01020.1 Chr1 TAIR9 exon 5928 6263 . - . Parent=AT1G01020.1 Chr1 TAIR9 mRNA 6790 8737 . - . ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2;Index=1 Chr1 TAIR9 protein 7315 8666 . - . ID=AT1G01020.2-Protein;Name=AT1G01020.2;Derives_from=AT1G01020.2 Chr1 TAIR9 five_prime_UTR 8667 8737 . - . Parent=AT1G01020.2 Chr1 TAIR9 CDS 8571 8666 . - 0 Parent=AT1G01020.2,AT1G01020.2-Protein; Chr1 TAIR9 exon 8571 8737 . - . Parent=AT1G01020.2 Chr1 TAIR9 CDS 8417 8464 . - 0 Parent=AT1G01020.2,AT1G01020.2-Protein; Chr1 TAIR9 exon 8417 8464 . - . Parent=AT1G01020.2 Chr1 TAIR9 CDS 8236 8325 . - 0 Parent=AT1G01020.2,AT1G01020.2-Protein; Chr1 TAIR9 exon 8236 8325 . - . Parent=AT1G01020.2 Chr1 TAIR9 CDS 7942 7987 . - 0 Parent=AT1G01020.2,AT1G01020.2-Protein; Chr1 TAIR9 exon 7942 7987 . - . Parent=AT1G01020.2 Chr1 TAIR9 CDS 7762 7835 . - 2 Parent=AT1G01020.2,AT1G01020.2-Protein; Chr1 TAIR9 exon 7762 7835 . - . Parent=AT1G01020.2 Chr1 TAIR9 CDS 7564 7649 . - 0 Parent=AT1G01020.2,AT1G01020.2-Protein; Chr1 TAIR9 exon 7564 7649 . - . Parent=AT1G01020.2 Chr1 TAIR9 CDS 7315 7450 . - 1 Parent=AT1G01020.2,AT1G01020.2-Protein; Chr1 TAIR9 three_prime_UTR 7157 7314 . - . Parent=AT1G01020.2 Chr1 TAIR9 exon 7157 7450 . - . Parent=AT1G01020.2 Chr1 TAIR9 three_prime_UTR 6790 7069 . - . Parent=AT1G01020.2 Chr1 TAIR9 exon 6790 7069 . - . Parent=AT1G01020.2 Code:
Chr1 TAIR9 CDS 8571 8666 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 8571 8737 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 8417 8464 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 8417 8464 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 8236 8325 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 8236 8325 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 7942 7987 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 7942 7987 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 7762 7835 . - 2 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 7762 7835 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 7564 7649 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 7564 7649 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 7384 7450 . - 1 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 7384 7450 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 7157 7232 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 7157 7232 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 6915 7069 . - 2 gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 6437 7069 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 EXON 5928 6263 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.1"; Chr1 TAIR9 CDS 8571 8666 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 8571 8737 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 CDS 8417 8464 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 8417 8464 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 CDS 8236 8325 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 8236 8325 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 CDS 7942 7987 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 7942 7987 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 CDS 7762 7835 . - 2 gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 7762 7835 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 CDS 7564 7649 . - 0 gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 7564 7649 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 CDS 7315 7450 . - 1 gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 7157 7450 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; Chr1 TAIR9 EXON 6790 7069 . - . gene_id "AT1G01020"; transcript_id "AT1G01020.2"; |
![]() |
![]() |
![]() |
#3 |
Member
Location: Kansas City Join Date: Oct 2009
Posts: 88
|
![]()
Thank you for the reply that clears some things up for me.
I do have a few questions though: 1.) How were able to convert the TAIR9 GFF3 files into GTF format? 2.) We are mostly interested in investigating small RNA such as miRNA, siRNA, and other non-coding RNA. We have files for them in GFF. The siRNA data started out as just sequences in supplementary data. From those I aligned them to the genome and created a GFF from that data. How could I supply files such as those to Cufflinks? Example: Code:
Chr1 TAIR9 Jacobsen_siRNA 10002796 10002812 . . . . Chr1 TAIR9 Jacobsen_siRNA 10004771 10004794 . . . . Chr1 TAIR9 Jacobsen_siRNA 10004925 10004941 . . . . Chr1 TAIR9 Jacobsen_siRNA 10007606 10007626 . . . . |
![]() |
![]() |
![]() |
#4 |
Member
Location: Singapore Join Date: Jan 2010
Posts: 36
|
![]()
Hi, I'm encountering a similar issue with cuffcompare. While trying to run it with the transcripts.gtf generated from cufflinks, it gave me the following error:
GList error (GList.hh:592):Invalid list index: 0 This is very strange because the file was generated from cufflinks, it's supposed to work with cuffcompare. Could someone please help? Thanks! -EDIT- I found out that it could be because of the missing strand information. Sorry about that. Last edited by Haneko; 04-07-2010 at 08:25 PM. Reason: Problem may be solved |
![]() |
![]() |
![]() |
#5 |
Member
Location: Oxford Join Date: Feb 2010
Posts: 16
|
![]()
Same situation for me. I cannot run cuffcompare because of duplicate errors. What I did was to delete all duplicated exon lines (exon numbers vary though) but keep transcript lines with a perl script. Compared to original gtf file generated by cufflinks, this new "transcript only" gtf file sounds have all information including strand.
however, I still got error "GList error (GList.hh:592):Invalid list index: 0". Henko, can you share your idea what is going on? cheers |
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
![]() Note: I was going to post the entire TAIR9 GTF but the gzipped file is too large to attach and I don't have an accessible server. If you desperately need it send me a PM an I could e-mail it to you. |
|
![]() |
![]() |
![]() |
#7 |
Junior Member
Location: Kannapolis, NC Join Date: Feb 2009
Posts: 1
|
![]()
Hi kmcarr,
Would it be possible for you to email me the TAIR9 gtf file? thanks |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: France Join Date: Jan 2010
Posts: 4
|
![]()
Hi kmcarr,
I am also interested in your TAIR9 gtf file. Would it be possible to email me this file (cek5767@yahoo.fr) ? Thanks ! |
![]() |
![]() |
![]() |
#9 |
Junior Member
Location: Virginia Join Date: Oct 2010
Posts: 2
|
![]()
Hi kmcarr,
could you post your gff.pm hack? I need to do this conversion and need to worry about frame. Thanks, Bob |
![]() |
![]() |
![]() |
#10 |
Member
Location: Durham Join Date: Oct 2010
Posts: 19
|
![]()
It seems that the GTF file is provided by TAIR now, has anyone tried it?
ftp://ftp.arabidopsis.org/home/tair/...enes_exons.gtf thanks, |
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: Pune, India Join Date: Dec 2012
Posts: 8
|
![]()
Hello All,
I was having this issue, while I was running "cuffmerge" on the assemblies built using cufflinks 2.1.1. It turned out, that the problem with duplicated entries was not with the gencode gtf file which I was using for reference, but the "transcripts.gtf" file created during cufflinks step. After, updating cufflinks to a newer version 2.2.1 and re-running cufflinks step has resolved this issue. Hope that helps. Good luck |
![]() |
![]() |
![]() |
Thread Tools | |
|
|