SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
HTseq:Adding GTF annotation to SAM alignment mbobro2 RNA Sequencing 40 12-11-2012 11:43 AM
Problems with the illumina .fastq sequence data annotation tractorsazi Bioinformatics 3 01-30-2012 06:50 AM
where can I find annotation.gtf when trying Cuffcompare? joyce kang Bioinformatics 0 11-14-2011 06:59 AM
Cufflinks' computation of FPKM for --GTF and --GTF-guide estimation burt Bioinformatics 0 08-23-2011 11:59 PM
Acceptable Sp/Sn output from cufflinks and problems with Homo_sapiens.GRCh37.60.gtf nat Bioinformatics 0 12-02-2010 09:58 PM

Reply
 
Thread Tools
Old 04-01-2010, 08:29 AM   #1
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default Problems creating GTF for Cufflinks annotation

I have been trying to supply a GTF for annotation with Cufflinks/Cuffcompare and I have been having no success at all.

I started by only having GFF files. The organism I work with, Arabidopsis, does not have any published GTF annotation files that I have been able to locate and I saw someone else on here was unable to locate any as well. So I attempted to convert the GFFs I had into GTFs by converting the ninth column. I used http://mblab.wustl.edu/GTF22.html as my reference.

On the first try I simply took the feature column and made it the gene_id and the transcript_id, knowing the names would be nice, but for our purposes just knowing what the reads represent is sufficient (mRNA, miRNA, siRNA, pseudogene, etc.)

Code:
Chr1	TAIR9	gene	3631	5899	.	+	.	gene_id "gene"; transcript_id "gene";

Chr1	TAIR9	mRNA	3631	5899	.	+	.	gene_id "mRNA"; transcript_id "mRNA";

Chr1	TAIR9	protein	3760	5630	.	+	.	gene_id "protein"; transcript_id "protein";
This resulted in an error in Cuffcompare:

Code:
cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf
Loading reference transcripts..
Error: duplicate GFF ID 'mRNA' encountered!
Based on the error results I reformatted my GFF>GTF conversion file by simply numbering each of the gene_id and transcript_id in a unique way to remove any redundancy in the file:

Code:
Chr1	TAIR9	gene	3631	5899	.	+	.	gene_id "gene2"; transcript_id "gene-2";

Chr1	TAIR9	mRNA	3631	5899	.	+	.	gene_id "mRNA3"; transcript_id "mRNA-3";

Chr1	TAIR9	protein	3760	5630	.	+	.	gene_id "protein4"; transcript_id "protein-4";
Result:

Code:
cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf
Loading reference transcripts..
GList error (GList.hh:592):Invalid list index: -1
I investigate the error, but was really unable to find anything so I figured that maybe the way I set up the transcript_id was throwing an error (*****-N) so I altered the GTF again. "-" > "1"

Code:
Chr1	TAIR9	gene	3631	5899	.	+	.	gene_id "gene2"; transcript_id "gene12";

Chr1	TAIR9	mRNA	3631	5899	.	+	.	gene_id "mRNA3"; transcript_id "mRNA13";

Chr1	TAIR9	protein	3760	5630	.	+	.	gene_id "protein4"; transcript_id "protein14";
Result:

Code:
cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf
Loading reference transcripts..
GList error (GList.hh:592):Invalid list index: -1
I have no idea what the "GList error (GList.hh:592):Invalid list index: -1" error means or how to correct it.

Can anyone make a recommendation on changing a GFF into a GTF? Tophat was able to supply GFF files for annotation, but for some reason Cufflinks only allows GTF files to provide annotation. It's great for some of the more mainstream organisms, but a lot of them (Arabidopsis in my case) only have annotations in GFF and GFF3 which creates a wall in being able to process the expression data.

Any and all help/suggestions would be greatly appreciated. I've been hung on up this problem for some time now and I have no more ideas on how to proceed.


Thanks as always.
DrD2009 is offline   Reply With Quote
Old 04-01-2010, 08:25 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Ignore everything except for exons and CDS lines; those are all that matter to cufflinks. Every exon or CDS entry which is part of the same gene must have the same "gene_id". Every exon or CDS which is part of the same transcript must have the same "transcript_id". Here is an example of one gene (AT1G01020) which has two transcripts (AT1G01020.1 and AT1G01020.2).

The GFF3 (TAIR9 annotation);

Code:
Chr1	TAIR9	gene	5928	8737	.	-	.	ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1	TAIR9	mRNA	5928	8737	.	-	.	ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
Chr1	TAIR9	protein	6915	8666	.	-	.	ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
Chr1	TAIR9	five_prime_UTR	8667	8737	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	8571	8666	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	8571	8737	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	8417	8464	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	8417	8464	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	8236	8325	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	8236	8325	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	7942	7987	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	7942	7987	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	7762	7835	.	-	2	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	7762	7835	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	7564	7649	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	7564	7649	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	7384	7450	.	-	1	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	7384	7450	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	7157	7232	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	exon	7157	7232	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	CDS	6915	7069	.	-	2	Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1	TAIR9	three_prime_UTR	6437	6914	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	exon	6437	7069	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	three_prime_UTR	5928	6263	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	exon	5928	6263	.	-	.	Parent=AT1G01020.1
Chr1	TAIR9	mRNA	6790	8737	.	-	.	ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2;Index=1
Chr1	TAIR9	protein	7315	8666	.	-	.	ID=AT1G01020.2-Protein;Name=AT1G01020.2;Derives_from=AT1G01020.2
Chr1	TAIR9	five_prime_UTR	8667	8737	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	CDS	8571	8666	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
Chr1	TAIR9	exon	8571	8737	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	CDS	8417	8464	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
Chr1	TAIR9	exon	8417	8464	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	CDS	8236	8325	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
Chr1	TAIR9	exon	8236	8325	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	CDS	7942	7987	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
Chr1	TAIR9	exon	7942	7987	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	CDS	7762	7835	.	-	2	Parent=AT1G01020.2,AT1G01020.2-Protein;
Chr1	TAIR9	exon	7762	7835	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	CDS	7564	7649	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
Chr1	TAIR9	exon	7564	7649	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	CDS	7315	7450	.	-	1	Parent=AT1G01020.2,AT1G01020.2-Protein;
Chr1	TAIR9	three_prime_UTR	7157	7314	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	exon	7157	7450	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	three_prime_UTR	6790	7069	.	-	.	Parent=AT1G01020.2
Chr1	TAIR9	exon	6790	7069	.	-	.	Parent=AT1G01020.2
Same information in GTF:

Code:
Chr1	TAIR9	CDS	8571	8666	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	8571	8737	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	8417	8464	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	8417	8464	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	8236	8325	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	8236	8325	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	7942	7987	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	7942	7987	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	7762	7835	.	-	2	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	7762	7835	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	7564	7649	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	7564	7649	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	7384	7450	.	-	1	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	7384	7450	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	7157	7232	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	7157	7232	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	6915	7069	.	-	2	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	6437	7069	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	EXON	5928	6263	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
Chr1	TAIR9	CDS	8571	8666	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	8571	8737	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	CDS	8417	8464	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	8417	8464	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	CDS	8236	8325	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	8236	8325	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	CDS	7942	7987	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	7942	7987	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	CDS	7762	7835	.	-	2	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	7762	7835	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	CDS	7564	7649	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	7564	7649	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	CDS	7315	7450	.	-	1	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	7157	7450	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
Chr1	TAIR9	EXON	6790	7069	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
kmcarr is offline   Reply With Quote
Old 04-02-2010, 10:08 AM   #3
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default

Thank you for the reply that clears some things up for me.

I do have a few questions though:

1.) How were able to convert the TAIR9 GFF3 files into GTF format?

2.) We are mostly interested in investigating small RNA such as miRNA, siRNA, and other non-coding RNA. We have files for them in GFF. The siRNA data started out as just sequences in supplementary data. From those I aligned them to the genome and created a GFF from that data. How could I supply files such as those to Cufflinks?

Example:
Code:
Chr1	TAIR9	    Jacobsen_siRNA	10002796	10002812	.	.	.	.
Chr1	TAIR9       Jacobsen_siRNA	10004771	10004794	.	.	.	.
Chr1	TAIR9       Jacobsen_siRNA	10004925	10004941	.	.	.	.
Chr1	TAIR9	    Jacobsen_siRNA	10007606	10007626	.	.	.	.
DrD2009 is offline   Reply With Quote
Old 04-07-2010, 07:21 PM   #4
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi, I'm encountering a similar issue with cuffcompare. While trying to run it with the transcripts.gtf generated from cufflinks, it gave me the following error:

GList error (GList.hh:592):Invalid list index: 0

This is very strange because the file was generated from cufflinks, it's supposed to work with cuffcompare. Could someone please help?

Thanks!

-EDIT-
I found out that it could be because of the missing strand information. Sorry about that.

Last edited by Haneko; 04-07-2010 at 07:25 PM. Reason: Problem may be solved
Haneko is offline   Reply With Quote
Old 04-13-2010, 01:54 AM   #5
middlemale
Member
 
Location: Oxford

Join Date: Feb 2010
Posts: 16
Default GList.hh:592 error

Same situation for me. I cannot run cuffcompare because of duplicate errors. What I did was to delete all duplicated exon lines (exon numbers vary though) but keep transcript lines with a perl script. Compared to original gtf file generated by cufflinks, this new "transcript only" gtf file sounds have all information including strand.

however, I still got error "GList error (GList.hh:592):Invalid list index: 0".

Henko, can you share your idea what is going on?

cheers
middlemale is offline   Reply With Quote
Old 04-13-2010, 10:44 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by DrD2009 View Post
Thank you for the reply that clears some things up for me.

I do have a few questions though:

1.) How were able to convert the TAIR9 GFF3 files into GTF format?

2.) We are mostly interested in investigating small RNA such as miRNA, siRNA, and other non-coding RNA. We have files for them in GFF. The siRNA data started out as just sequences in supplementary data. From those I aligned them to the genome and created a GFF from that data. How could I supply files such as those to Cufflinks?

Example:
Code:
Chr1	TAIR9	    Jacobsen_siRNA	10002796	10002812	.	.	.	.
Chr1	TAIR9       Jacobsen_siRNA	10004771	10004794	.	.	.	.
Chr1	TAIR9       Jacobsen_siRNA	10004925	10004941	.	.	.	.
Chr1	TAIR9	    Jacobsen_siRNA	10007606	10007626	.	.	.	.
I converted the TAIR9 GFF3 file using the attached perl script. This script uses Bioperl, specifically Bio::FeatureIO. However there appears to be a bug in Bio::FeatureIO::gff related to the phase/frame value. To get this script to work properly I actually had to hack up Bio/FeatureIO/gff.pm a little. I am properly ashamed for having done this . Now since frame/phase is irrelevant to your siRNA annotations you would not have to worry about this issue. You would need to install BioPerl to run the script though.

Note: I was going to post the entire TAIR9 GTF but the gzipped file is too large to attach and I don't have an accessible server. If you desperately need it send me a PM an I could e-mail it to you.
Attached Files
File Type: pl bp_gff3_to_gtf.pl (1.0 KB, 325 views)
kmcarr is offline   Reply With Quote
Old 07-27-2010, 10:23 AM   #7
kpatel
Junior Member
 
Location: Kannapolis, NC

Join Date: Feb 2009
Posts: 1
Default

Hi kmcarr,

Would it be possible for you to email me the TAIR9 gtf file?

thanks
kpatel is offline   Reply With Quote
Old 10-11-2010, 12:14 PM   #8
cek
Junior Member
 
Location: France

Join Date: Jan 2010
Posts: 4
Default

Hi kmcarr,

I am also interested in your TAIR9 gtf file. Would it be possible to email me this file (cek5767@yahoo.fr) ?
Thanks !
cek is offline   Reply With Quote
Old 10-28-2010, 05:58 PM   #9
Bob Settlage
Junior Member
 
Location: Virginia

Join Date: Oct 2010
Posts: 2
Default gff.pm

Hi kmcarr,
could you post your gff.pm hack? I need to do this conversion and need to worry about frame.
Thanks,
Bob
Bob Settlage is offline   Reply With Quote
Old 11-16-2010, 09:14 AM   #10
SongLi
Member
 
Location: Durham

Join Date: Oct 2010
Posts: 19
Default

It seems that the GTF file is provided by TAIR now, has anyone tried it?

ftp://ftp.arabidopsis.org/home/tair/...enes_exons.gtf

thanks,

Quote:
Originally Posted by kpatel View Post
Hi kmcarr,

Would it be possible for you to email me the TAIR9 gtf file?

thanks
SongLi is offline   Reply With Quote
Old 02-23-2015, 06:20 AM   #11
amolkolte
Junior Member
 
Location: Pune, India

Join Date: Dec 2012
Posts: 8
Default

Hello All,

I was having this issue, while I was running "cuffmerge" on the assemblies built using cufflinks 2.1.1.

It turned out, that the problem with duplicated entries was not with the gencode gtf file which I was using for reference, but the "transcripts.gtf" file created during cufflinks step.

After, updating cufflinks to a newer version 2.2.1 and re-running cufflinks step has resolved this issue.

Hope that helps.
Good luck
amolkolte is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:09 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO