SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
OrthoMCL duplicate entry error flipwell Bioinformatics 6 09-24-2013 06:48 AM
Cufflinks Duplicate GFF ID: will it affect the result? byb121 Bioinformatics 2 07-15-2013 04:10 PM
Problem Encountered When Using SOAPdenovo: Floating Point / Segmentation Error andyding Bioinformatics 0 11-21-2011 09:35 AM
tophat gff file error repinementer Bioinformatics 2 07-20-2010 03:28 AM
How to estimate error rate for short-reads and base-calling duplicate? zchou Illumina/Solexa 10 01-20-2010 08:13 AM

Reply
 
Thread Tools
Old 08-21-2012, 04:29 PM   #1
drosoform
Junior Member
 
Location: Oregon

Join Date: Apr 2012
Posts: 6
Question Cuffdiff error - duplicate GFF ID encountered?

Hi all,

I am very new to all things RNA-seq, so please bear with me if the questions are really basic
I am trying to compare two things for differential expression.

The pipeline I am using is: Tophat -> Cuffdiff
(with newest versions of each, Tophat 2.04 and Cufflinks 2.02)

I am skipping running Cufflinks separately before Cuffdiff, because I'm not really interested in new gene/transcript discovery.

The problem is, when I try to run Cuffdiff, it quits with an error saying the reference annotation contains duplicate GFF IDs:

Code:
You are using Cufflinks v2.0.2, which is the most recent release.
[16:22:49] Loading reference annotation.
Error: duplicate GFF ID 'FBtr0100868' encountered!
The reference annotation I am using is the gff downloaded from Flybase: dmel-all-r5.46.gff.

However, when I searched this gff file, I didn't see duplicate lines containing this id, FBtr0100868.
Just to experiment though, I tried removing the lines containing the offending GFF id from the gff file, and running Cuffdiff again to see if it would fix the problem, but then it just had the same error with a different GFF id.
I tried doing this more times with each duplicate GFF id, but every time it just comes back with the same error and a different GFF id.

Has anyone else encountered this error using the gff file from Flybase, or anywhere else for that matter? I don't know if I'm doing the right thing by removing the "bad" IDs from the reference annotation either, especially since there seem to be an endless number of them. Is there any other way I should fix the reference annotation? Or would it be easier to just run Cufflinks and use its output gtf, instead of trying to fix the Flybase gff?

Any help would be very much appreciated!
drosoform is offline   Reply With Quote
Old 08-22-2012, 02:26 AM   #2
mticlla
Junior Member
 
Location: Brussels, Belgium

Join Date: May 2012
Posts: 1
Default Same problem!

Hi all!

I have the same problem, does anyone found a solution? please, help us
mticlla is offline   Reply With Quote
Old 08-24-2012, 07:18 PM   #3
drosoform
Junior Member
 
Location: Oregon

Join Date: Apr 2012
Posts: 6
Default

Update: I wasn't able to exactly fix the issue, but I was able to get around it:

So, I realized I couldn't use the Flybase gff in Cuffmerge either (same error), so my idea of possibly using the Cufflinks->Cuffmerge gtf didn't work out.

However, I was reading through this thread (mostly on the 2nd page) where people were having similar problems:

http://seqanswers.com/forums/showthread.php?t=3493

Someone was able to fix theirs by removing the duplicates, sort of like what I was trying to do, since their gff only seemed to have a few problem lines.

Others mentioned just trying a different gff file, if one was available. Since removing the duplicates by hand wasn't an option, I just tried the gtf from Ensembl instead, and it worked without a problem!

The Flybase gff I had for some reason worked with an older version of Cufflinks, so I guess that's why I didn't think of trying a different gff/gtf before.

Last edited by drosoform; 08-24-2012 at 07:20 PM. Reason: typo
drosoform is offline   Reply With Quote
Old 02-13-2013, 01:13 PM   #4
kwatts59
Member
 
Location: nevada

Join Date: Apr 2011
Posts: 46
Default

I had the same problem running Cufflinks v2.0 on dmel release 5.49.
To get around the problem, I wrote a PERL script to pull out the lines from the GFF file containing the word "gene" in the third column. That seemed to fix the problem.

There is no GTF file from Ensembl that corresponds to dmel release 5.49 from flybase.
kwatts59 is offline   Reply With Quote
Old 04-16-2013, 12:05 AM   #5
Boel
Member
 
Location: Stockholm, Sweden

Join Date: Oct 2009
Posts: 62
Default One possible solution

Dear All,

I am using Cufflinks v2.1.1 and encountered the same error upon running CuffDiff with a mask file and the gencode v15 annotation ("Error: duplicate GFF ID 'ENST00000389680.2' encountered!"). (Might be interesting to note that when NOT using a mask file I did not get an error at this point at all, but CuffDiff got stuck at a locus for > 12 hours). However, there was no duplicated ID:

Code:
$ grep ENST00000389680.2 gencode.v15.annotation.gtf
chrM	ENSEMBL	transcript	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; level 3; tag "basic";
chrM	ENSEMBL	exon	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; exon_number 1;  level 3; tag "basic";
As you can see these entries are for a transcript and for an exon, not duplicates at all. Yet it did get me thinking: Cufflinks only really uses CDS and exon entries, so the existence of all the other identifiers ('gene', 'transcript' etc in the third column of my GTF might be the cause of the hassle. So I removed all entries in the GTF except the 'exon' and 'CDS' lines and Voila! Now it is working.

If there is not in fact duplicated entries in your annotation file I would suggest trying this approach.

Best,
Boel
Boel is offline   Reply With Quote
Old 05-07-2013, 01:09 AM   #6
lzhdennisdn
Junior Member
 
Location: Germany

Join Date: Jan 2013
Posts: 3
Default

Dear all,

I made the same problem with cufflink 2.1.1 and I do find duplicates in the gtf file produced by cufflinks:
Error: duplicate GFF ID 'SL1sc04444 | LOCATED IN chloroplast chloroplast inner encountered!

I trace back to the transcripts file and find the duplicates below:

scaffold3562 Cufflinks transcript 123676 142383 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
scaffold3562 Cufflinks exon 123676 138285 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
scaffold3562 Cufflinks exon 139443 142383 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner

scaffold3562 Cufflinks transcript 139443 142383 1000 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
scaffold3562 Cufflinks exon 139443 142383 1000 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner

The coordinate of the second transcript is exactly the same as annotation file, but the first one is expanded from the 5' head, although the 3' end is the same, it should belong to another transcript of the same gene right? Why cufflink still group it into the same transcript and cause this duplicate ID error?

Anyone have any suggestion how to fix it?

Best,
Zhihao
lzhdennisdn is offline   Reply With Quote
Old 07-31-2013, 04:08 AM   #7
cylsae
Junior Member
 
Location: Denmark

Join Date: Mar 2012
Posts: 1
Default it works

Quote:
Originally Posted by Boel View Post
Dear All,

I am using Cufflinks v2.1.1 and encountered the same error upon running CuffDiff with a mask file and the gencode v15 annotation ("Error: duplicate GFF ID 'ENST00000389680.2' encountered!"). (Might be interesting to note that when NOT using a mask file I did not get an error at this point at all, but CuffDiff got stuck at a locus for > 12 hours). However, there was no duplicated ID:

Code:
$ grep ENST00000389680.2 gencode.v15.annotation.gtf
chrM	ENSEMBL	transcript	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; level 3; tag "basic";
chrM	ENSEMBL	exon	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; exon_number 1;  level 3; tag "basic";
As you can see these entries are for a transcript and for an exon, not duplicates at all. Yet it did get me thinking: Cufflinks only really uses CDS and exon entries, so the existence of all the other identifiers ('gene', 'transcript' etc in the third column of my GTF might be the cause of the hassle. So I removed all entries in the GTF except the 'exon' and 'CDS' lines and Voila! Now it is working.

If there is not in fact duplicated entries in your annotation file I would suggest trying this approach.

Best,
Boel

Thanks for posting the solution. I had same issue of duplicated gff id when I ran cuffmerge. It started to work after I ran " awk '($3 == "exon" || $3 == "CDS")' " for all the input gtf files ( both the gtf files from cufflinks and the reference ).
cylsae is offline   Reply With Quote
Old 02-06-2014, 12:15 PM   #8
sugo
Junior Member
 
Location: Canada

Join Date: Nov 2013
Posts: 8
Default

Quote:
Originally Posted by kwatts59 View Post
I had the same problem running Cufflinks v2.0 on dmel release 5.49.
To get around the problem, I wrote a PERL script to pull out the lines from the GFF file containing the word "gene" in the third column. That seemed to fix the problem.
Hi, I am having this problem as well, but sadly have no idea how to write PERL scripts. I was wondering if you'd be willing to share your PERL script with the rest of us who may be having this problem?

Thanks
sugo is offline   Reply With Quote
Old 02-11-2014, 12:44 PM   #9
MDonlin
Member
 
Location: St. Louis, MO

Join Date: May 2010
Posts: 14
Default Cuffdiff error-- duplicate GFF IDs & using VIM to edit transcript files

You can use the vi or vim editor (default on most unix systems) to edit the gtf files.
>vim transcript.gtf
:g/dup/d
Is a global command to find any line that contains "dup" and delete the entire line.
Similarly, you can do the same to remove the gene
:g/gene/d

To save the changes:
:wq

Search online for vim commands to help you if you get stuck.

I edited the reference gene transcript file to remove lines with "gene".
After running cufflinks, I edited the transcript.gtf file to remove any lines with "dup" in them.
Cuffmerge ran quite happily after that.
MDonlin is offline   Reply With Quote
Old 09-10-2014, 04:12 PM   #10
dhir_kumar
Postdoc
 
Location: India

Join Date: Oct 2013
Posts: 4
Default Duplicate GFF ID: a possible solution

Hi,

I am using Cufflinks v2.1.1 with ensembl human genome annotation GTF and was getting the same error of duplicate GFF IDs. Following the previously posted solutions I tried the following

(1) awk '($3 == "exon" || $3 == "CDS")
Although it worked but the resulting GTF loses 1/4 of the annotation lines(mostly UTR) and that might affect the transcript assembly in an unknown(minor or major??) way.

(2) tried igenome GTF and still got the same error

When I tried to locate the problem in the GTF itself, it seems that duplicate entries were associated with transcripts having "Selenocysteine" annotation lines(114 lines in recent annotations GTF both Ensembl and igenomes). Once I get rid of these 114 lines from GTF files using
"awk '!/Selenocysteine/' Homo_sapiens.GRCh38.76.gtf >Homo_sapiens.GRCh38.76.gtf_seleno_filtered". It worked without any error and without losing too much information from the annotation GTF.

Best
Dhirendra

Last edited by dhir_kumar; 11-07-2016 at 06:55 AM.
dhir_kumar is offline   Reply With Quote
Old 11-01-2014, 04:31 AM   #11
Kristoffer Vitting-Seerup
Junior Member
 
Location: Denmark

Join Date: Nov 2014
Posts: 1
Default Dhirendra's solution works

so +1 to Dhirendra.

Kindes Regards
Kristoffer Vitting-Seerup is offline   Reply With Quote
Old 02-23-2015, 06:17 AM   #12
amolkolte
Junior Member
 
Location: Pune, India

Join Date: Dec 2012
Posts: 8
Default

Dear All,

I was having this issue, while I was running "cuffmerge" on the assemblies built using cufflinks 2.1.1.

I checked in my reference gtf if the duplicated entry exists but there wasn't any. Later I found out that the problem with duplicated entries was not with the gencode gtf file which I was using as reference, but with the "transcripts.gtf" file created during cufflinks step.

After, updating cufflinks to a newer version 2.2.1 and re-running cufflinks step has resolved this issue.

Hope that helps.
Good luck
amolkolte is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:40 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO