SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
cufflinks - having difficulty reading reference GTF tguo Bioinformatics 31 07-07-2019 07:03 AM
cufflinks : analysis comparison with and without a gtf reference file sohnic Bioinformatics 3 07-07-2019 06:40 AM
cufflinks GTF files incompatible with cuffmerge? shurjo Bioinformatics 21 09-23-2013 05:47 PM
Tophat v1.1 with GTF files hyjkim Bioinformatics 7 12-17-2012 08:11 AM
Best source for GTF file for use with TopHat/Cufflinks sdarko Bioinformatics 17 12-14-2012 12:48 PM

Reply
 
Thread Tools
Old 12-06-2010, 12:09 PM   #1
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default GTF reference files that work with TopHat/Cufflinks

Maybe it's just me, but the only GTF reference file that seems to work in TopHat (when using the -G option that is) is refFlat (RefSeq) downloaded from the UCSC Table Browser. Any other GTF file I download from there (e.g., ensGene) or ENSEMBL (following the recommendation in the TopHat website) just doesn't work (i.e., I get the "TopHat did not find any junctions in GTF file" warning) .

I know the problem has popped up in this forum multiple times. Therefore, I would like to get as much feedback as possible as to which GTF files people are using for their RNAseq analysis. And whether anybody has managed to get the ENSEMBL gene annotation to work.

Thank you in advance for your help and consideration.

Last edited by marcora; 12-06-2010 at 01:09 PM.
marcora is offline   Reply With Quote
Old 01-05-2011, 09:58 AM   #2
bcalder
Junior Member
 
Location: NYC

Join Date: Sep 2008
Posts: 3
Default

There seem to be many conditions under which various GTF files will cause both tophat and cufflinks to unceremoniously segfault when analyzing stranded RNA-seq data. I've been bisecting my sequence files to try to find the reads that are making these applications unhappy, but nothing has completely helped me other than moving from ENSEMBL GTF to UCSC refFlat as you suggested. Some official guidelines on GTF files and more meaningful error messages would be very useful.
bcalder is offline   Reply With Quote
Old 01-05-2011, 10:21 AM   #3
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by bcalder View Post
There seem to be many conditions under which various GTF files will cause both tophat and cufflinks to unceremoniously segfault when analyzing stranded RNA-seq data. I've been bisecting my sequence files to try to find the reads that are making these applications unhappy, but nothing has completely helped me other than moving from ENSEMBL GTF to UCSC refFlat as you suggested. Some official guidelines on GTF files and more meaningful error messages would be very useful.
I don't quite understand yet why (even with the correct naming of chromosomes etc in the bowtie index) tophat and cufflinks only seem to work with refFlat against mm9, instead of with any well-formed GTF file (such as the Ensembl one).

Feedback from tophat/cufflinks authors would be very welcome!
marcora is offline   Reply With Quote
Old 01-05-2011, 07:11 PM   #4
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default

I have been struggling with the same issue and only refflat is working. I have even posted several messages on this forum. I will join in requesting authors of Tophat and Cufflinker to please share some information even if it is some issue which cannot be corrected, so that other researchers should not waste time in trying to which is not possible and may seek some alternatives. Alternatively in the best interest of scientific community authors may like to share, if some improvement is being made which can take care of this problem made to address this issue.
Thanks.
honey is offline   Reply With Quote
Old 01-06-2011, 07:11 AM   #5
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

Have you tried renaming the chromosomes? If you are aligning against the UCSC hg19 genome chromosomes will have names such as chr1, chr2, ..., chrX, chrY, but Ensembl denotes them by 1, 2, ..., X, Y. Only exception is chrM which Ensembl denotes MT.

Try runnings this on your Ensembl GTF:
$ awk '{print "chr"$0}' Homo_sapiens.GRCh37.56.gtf | sed 's/chrMT/chrM/g' > hg19.ensembl-for-tophat.gtf

The UCSC gtf files work out of the box if you are using a UCSC fasta reference genome.

EDIT
Just saw that you had already tried with renaming the chromosomes (although in the bowtie index). Just be sure you have used the same chromosome names right from the beginning. Simply fixing the Ensembl GTF chromosome names with the above command was enough for it to work on my station.

Last edited by Thomas Doktor; 01-06-2011 at 07:15 AM.
Thomas Doktor is offline   Reply With Quote
Old 01-06-2011, 07:37 AM   #6
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Red face

Quote:
Originally Posted by Thomas Doktor View Post
Have you tried renaming the chromosomes? If you are aligning against the UCSC hg19 genome chromosomes will have names such as chr1, chr2, ..., chrX, chrY, but Ensembl denotes them by 1, 2, ..., X, Y. Only exception is chrM which Ensembl denotes MT.

Try runnings this on your Ensembl GTF:
$ awk '{print "chr"$0}' Homo_sapiens.GRCh37.56.gtf | sed 's/chrMT/chrM/g' > hg19.ensembl-for-tophat.gtf

The UCSC gtf files work out of the box if you are using a UCSC fasta reference genome.

EDIT
Just saw that you had already tried with renaming the chromosomes (although in the bowtie index). Just be sure you have used the same chromosome names right from the beginning. Simply fixing the Ensembl GTF chromosome names with the above command was enough for it to work on my station.
I have already tried that before, but just in case I re-tried today.

1) I downloaded the latest gtf file for mus musculus from ftp://ftp.ensembl.org/pub/current/gtf/

2) I ran your awk one-liner on it

3) I bowtie-inspected my mm9 index to make sure chromosome names match and I get the following output:
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM

4) I ran tophat (with -G and --no-novel-juncs options) using the mm9 index and the modified gtf file and got the same error as before!!!

"Reading known junctions from GTF file
Warning: TopHat did not find any junctions in GTF file"


Has anybody had any luck running tophat using mm9 and ensembl gtf???

Thanx in advance for your help and consideration,

Dado

p.s.: I run bowtie v0.12.7.0 and tophat v1.1.4, the mm9 bowtie index is the one from the bowtie home page

I have also run both the original and modified GTF file through this GTF validator and obtained the following output, identical for both files:

Code:
Transcript ENSMUST00000146893 contains no CDS features.
Missing stop_codon for transcript "ENSMUST00000132028".
Transcript ENSMUST00000125766 contains no CDS features.
Transcript ENSMUST00000123080 contains no CDS features.
Transcript ENSMUST00000134441 contains no CDS features.
Transcript ENSMUST00000119451 contains no CDS features.
Missing start_codon for transcript "ENSMUST00000110970".
Missing stop_codon for transcript "ENSMUST00000046878".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000156610".
Missing start_codon for transcript "ENSMUST00000057060".
Missing stop_codon for transcript "ENSMUST00000130267".
Missing stop_codon for transcript "ENSMUST00000125833".
Missing stop_codon for transcript "ENSMUST00000136539".
Missing start_codon for transcript "ENSMUST00000126611".
Missing start_codon for transcript "ENSMUST00000150269".
2 short (<20bp) intron(s) found on in transcript "ENSMUST00000090094".
Missing start_codon for transcript "ENSMUST00000027440".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000100381".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000111825".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000112718".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000055026".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112673".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000121280".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000103276".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000115127".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112382".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000120549".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000117830".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000093816".
Start Codon length is not three in transcript "ENSMUST00000144976".
Inconsistent frame accross transcript_id = "ENSMUST00000144976".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000133425".
Start Codon length is not three in transcript "ENSMUST00000139210".
Inconsistent frame accross transcript_id = "ENSMUST00000139210".
Start Codon length is not three in transcript "ENSMUST00000156223".
Inconsistent frame accross transcript_id = "ENSMUST00000156223".
Start Codon length is not three in transcript "ENSMUST00000127994".
Inconsistent frame accross transcript_id = "ENSMUST00000127994".
Start Codon length is not three in transcript "ENSMUST00000123108".
Inconsistent frame accross transcript_id = "ENSMUST00000123108".

Warnings encountered:
Count	Description
37227	Transcript contains no CDS features.
23	Start Codon length is not three.
4929	Transcript has no start codon.
7599	Transcript has no stop codon.
92	Stop codon location inconsistent with terminal exon location.
192	Complete transcript length not multiple of 3.
23	Inconsistant phase value accross transcript.
916	Transcript contains short (<20bp) intron(s).

Statistics:
	36536 genes with 88186 transcripts containing 436860 cds.
Bad Genes:

Last edited by marcora; 01-06-2011 at 07:53 AM.
marcora is offline   Reply With Quote
Old 01-06-2011, 08:37 AM   #7
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Unhappy

I tried one more thing:

1) I loaded the awk-modified GTF file as a custom track into the UCSC genome browser against NCBI37/mm9. The browser complained about annotations starting with chrNT and did not load.

2) I cleaned the GTF file using
Code:
awk '/^chr[1-9XYM]|^chr1[0-9]/' mm.ncbim37.60.gtf > mm.ncbim37.60.clean.gtf
3) Reloaded the GTF file into the UCSC genome browser... everything works fine with the "cleaned" GTF file

... too bad that even with the "cleaned" GTF file tophat still gives the same error!


Quote:
Originally Posted by marcora View Post
I have already tried that before, but just in case I re-tried today.

1) I downloaded the latest gtf file for mus musculus from ftp://ftp.ensembl.org/pub/current/gtf/

2) I ran your awk one-liner on it

3) I bowtie-inspected my mm9 index to make sure chromosome names match and I get the following output:
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM

4) I ran tophat (with -G and --no-novel-juncs options) using the mm9 index and the modified gtf file and got the same error as before!!!

"Reading known junctions from GTF file
Warning: TopHat did not find any junctions in GTF file"


Has anybody had any luck running tophat using mm9 and ensembl gtf???

Thanx in advance for your help and consideration,

Dado

Last edited by marcora; 01-06-2011 at 08:42 AM.
marcora is offline   Reply With Quote
Old 01-06-2011, 09:18 AM   #8
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

There seems to be an error (possibly a duplication) with the entries for the ENSMUST00000127664 transcript in the Ensembl mm9 gtf file.

You can correct your Ensembl file this way (I have succesfully used this with Tophat 1.1.4):
$ awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | grep -v "ENSMUST00000127664" > mm9.Ensembl.gtf

Note that this will of course remove the information about the ENSMUST00000127664 transcript, so it's just a quick fix, you should probably look more into the gtf file to try and correct it.

EDIT
TopHat itself is a bit silent about the error causing this, but if you run the junction finding script (gtf_juncs, located in the TopHat bin folder), it will report a much more meaningful error message:
Quote:
gtf_juncs v1.1.4 (1709)
---------------------------
Error: duplicate GFF ID 'ENSMUST00000127664' (or exons too far apart)!

Last edited by Thomas Doktor; 01-06-2011 at 09:22 AM. Reason: How to find the error
Thomas Doktor is offline   Reply With Quote
Old 01-06-2011, 09:59 AM   #9
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Unhappy

Quote:
Originally Posted by Thomas Doktor View Post
There seems to be an error (possibly a duplication) with the entries for the ENSMUST00000127664 transcript in the Ensembl mm9 gtf file.

You can correct your Ensembl file this way (I have succesfully used this with Tophat 1.1.4):
$ awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | grep -v "ENSMUST00000127664" > mm9.Ensembl.gtf

Note that this will of course remove the information about the ENSMUST00000127664 transcript, so it's just a quick fix, you should probably look more into the gtf file to try and correct it.

EDIT
TopHat itself is a bit silent about the error causing this, but if you run the junction finding script (gtf_juncs, located in the TopHat bin folder), it will report a much more meaningful error message:
First and foremost, thank you for your help and patience.

I removed the offending lines as you indicated and now my GTF file passes the gtf_juncs test without warnings or errors.

However, tophat still gives the same warning!
marcora is offline   Reply With Quote
Old 01-06-2011, 10:02 AM   #10
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

TopHat did not produce any errors here. What commands did you use?
Thomas Doktor is offline   Reply With Quote
Old 01-06-2011, 10:05 AM   #11
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by Thomas Doktor View Post
TopHat did not produce any errors here. What commands did you use?
Code:
#!/bin/bash

GENOME=mm9
GENES=mm9.ensembl

for f in $( ls *.fq ); do
    if [ -f $f ]; then
        mkdir $f.tophat
        tophat -p 8 --library-type fr-unstranded --solexa1.3-quals -G $BOWTIE_INDEXES/$GENES.gtf --no-novel-juncs -o $f.tophat $BOWTIE_INDEXES/$GENOME $f
    fi
done
marcora is offline   Reply With Quote
Old 01-06-2011, 10:06 AM   #12
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

And you are sure the mm9.ensembl.gtf is the corrected one?
Thomas Doktor is offline   Reply With Quote
Old 01-06-2011, 10:22 AM   #13
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Talking

Quote:
Originally Posted by marcora View Post
First and foremost, thank you for your help and patience.

I removed the offending lines as you indicated and now my GTF file passes the gtf_juncs test without warnings or errors.

However, tophat still gives the same warning!
My bad! This last time when I ran tophat I didn't rename the cleaned GTF file to "mm9.ensembl" and therefore tophat couldn't find it. Surprisingly, instead of reporting a missing file, tophat gave me the same exact warning as before.

In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such:

Code:
awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf
I am finally able to run tophat against the ENSEMBL annotation.

You are my hero!

Thank you very much for your help.
marcora is offline   Reply With Quote
Old 01-09-2011, 07:00 PM   #14
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default

Thanks macrora and Thomas,
I was able to use Enseml GTF file finally. One question (may be trivial) in the HG18 build there are several entries in the chromosome column like c6_COX, c6_QBL (haplotypes assembly) as well as NT_113944, 113923, 112884 and so on. How we deal with them in RNA seq. Should we just skip them in RNA seq analysis or there is any other suggestion please.
honey is offline   Reply With Quote
Old 01-09-2011, 07:04 PM   #15
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default Suggestion to moderator

I would suggest making this thread sticky. This problem of using Ensembl file in RNA seq analysis has been raised several times here and in other forums. So that other fellow researchers should not waste too much time and are benefited.

Last edited by honey; 01-09-2011 at 10:59 PM.
honey is offline   Reply With Quote
Old 03-03-2011, 05:19 AM   #16
maximilianh
Member
 
Location: UK

Join Date: Oct 2009
Posts: 15
Default Avoid changing the GTF file

I would caution against messing around with the GTF file, as in the future this should break, as soon as Ensembl switches to the next version. In addition, it doesn't work for any other organisms.

I am running cufflinks and solved the problem differently: I indexed the original Ensembl genome, ran bowtie on it, converted the samfile to bam, sorted it, removed this weird transcript from the Ensembl file and ran cufflinks normally, without any awk script.

here is a log file of what I did:

Code:
bowtie-build -C Mus_musculus.NCBIM37.61.dna.toplevel.fa Mus_musculus.NCBIM37.61.dna.toplevel_c
bowtie -f -C -m 1 -p4 --sam $(BOWTIEINDEX) $$i > $(MAPPEDREADS)/`basename $$i .txt`.sam
samtools view -Sb sam/SL005_R00002_RME033_01pg_F3.csfasta.sam > sam/SL005_R00002_RME033_01pg_F3.csfasta.bam
samtools sort -Sb SL005_R00002_RME033_01pg_F3.csfasta.bam SL005_R00002_RME033_01pg_F3.csfasta.sorted
512391
grep -v ENSMUST00000127664 Mus_musculus.NCBIM37.61.gtf > Mus_musculus.NCBIM37.61.corrected.gtf
~/software/cufflinks-0.9.3/cufflinks -G Mus_musculus.NCBIM37.61.corrected.gtf -v sam.old/SL005_R00002_RME033_01pg_F3.csfasta.sorted.
maximilianh is offline   Reply With Quote
Old 04-22-2011, 02:12 PM   #17
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Another solution is go to tophat-1.2.0/src/gff.cpp
and change
const uint GFF_MAX_LOCUS = 4000000;
to
const uint GFF_MAX_LOCUS = 5000000;
then recompile the tophat

Ensemble database indicate that this transcript ENSMUST00000127664 is 4.43Mb, bigger than previous cut-off GFF_MAX_LOCUS = 4000000.

Quote:
Originally Posted by marcora View Post
My bad! This last time when I ran tophat I didn't rename the cleaned GTF file to "mm9.ensembl" and therefore tophat couldn't find it. Surprisingly, instead of reporting a missing file, tophat gave me the same exact warning as before.

In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such:

Code:
awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf
I am finally able to run tophat against the ENSEMBL annotation.

You are my hero!

Thank you very much for your help.
Auction is offline   Reply With Quote
Old 04-22-2011, 03:11 PM   #18
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default Gtf

Will try new approach in next run uptill now we have been changing GTF file
honey is offline   Reply With Quote
Old 05-09-2012, 01:09 PM   #19
arielpaulson
Member
 
Location: Kansas City

Join Date: May 2008
Posts: 14
Default Additional error source

Just wanted to add -- as of tophat 1.4.0 -- there is another way to get this "did not find any junctions" error message. If your gtf has any entries with nonstandard strand symbols, for instance '*', parsing will apparently fail for the the entire gtf, even though all other entries are OK.

Reading gtf_juncs.log will show you the offending line.
arielpaulson is offline   Reply With Quote
Old 04-14-2013, 08:56 PM   #20
Charitra
Member
 
Location: Seoul, Korea

Join Date: Feb 2013
Posts: 57
Default

Quote:
Originally Posted by marcora View Post
My bad! This last time when I ran tophat I didn't rename the cleaned GTF file to "mm9.ensembl" and therefore tophat couldn't find it. Surprisingly, instead of reporting a missing file, tophat gave me the same exact warning as before.

In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such:

Code:
awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf
I am finally able to run tophat against the ENSEMBL annotation.

You are my hero!

Thank you very much for your help.
Dear All

Please tell me , I got the same error as follows. I am using everything from UCSC hg19 and downloaded everything, unpacked them but still getting this error:
What to do ?

Warning: couldn't find fasta record for 'chr17_ctg5_hap1'!
Warning: couldn't find fasta record for 'chr17_gl000205_random'!
Warning: couldn't find fasta record for 'chr19_gl000209_random'!
Warning: couldn't find fasta record for 'chr1_gl000191_random'!
Warning: couldn't find fasta record for 'chr4_ctg9_hap1'!
Warning: couldn't find fasta record for 'chr4_gl000193_random'!
Warning: couldn't find fasta record for 'chr4_gl000194_random'!
Warning: couldn't find fasta record for 'chr6_apd_hap1'!
Warning: couldn't find fasta record for 'chr6_cox_hap2'!
Warning: couldn't find fasta record for 'chr6_dbb_hap3'!
Warning: couldn't find fasta record for 'chr6_mann_hap4'!
Warning: couldn't find fasta record for 'chr6_mcf_hap5'!
Warning: couldn't find fasta record for 'chr6_qbl_hap6'!
Warning: couldn't find fasta record for 'chr6_ssto_hap7'!
Warning: couldn't find fasta record for 'chr7_gl000195_random'!
Warning: couldn't find fasta record for 'chrUn_gl000211'!
Warning: couldn't find fasta record for 'chrUn_gl000212'!
Warning: couldn't find fasta record for 'chrUn_gl000218'!
Warning: couldn't find fasta record for 'chrUn_gl000219'!
Warning: couldn't find fasta record for 'chrUn_gl000220'!
Warning: couldn't find fasta record for 'chrUn_gl000222'!
Warning: couldn't find fasta record for 'chrUn_gl000223'!
Warning: couldn't find fasta record for 'chrUn_gl000228'!
Charitra is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:17 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO