Seqanswers Leaderboard Ad

**bcalder** · 01-05-2011, 09:58 AM

There seem to be many conditions under which various GTF files will cause both tophat and cufflinks to unceremoniously segfault when analyzing stranded RNA-seq data. I've been bisecting my sequence files to try to find the reads that are making these applications unhappy, but nothing has completely helped me other than moving from ENSEMBL GTF to UCSC refFlat as you suggested. Some official guidelines on GTF files and more meaningful error messages would be very useful.

**marcora** · 01-05-2011, 10:21 AM

Originally posted by bcalder View Post

There seem to be many conditions under which various GTF files will cause both tophat and cufflinks to unceremoniously segfault when analyzing stranded RNA-seq data. I've been bisecting my sequence files to try to find the reads that are making these applications unhappy, but nothing has completely helped me other than moving from ENSEMBL GTF to UCSC refFlat as you suggested. Some official guidelines on GTF files and more meaningful error messages would be very useful.

I don't quite understand yet why (even with the correct naming of chromosomes etc in the bowtie index) tophat and cufflinks only seem to work with refFlat against mm9, instead of with any well-formed GTF file (such as the Ensembl one).

Feedback from tophat/cufflinks authors would be very welcome!

**honey** · 01-05-2011, 07:11 PM

I have been struggling with the same issue and only refflat is working. I have even posted several messages on this forum. I will join in requesting authors of Tophat and Cufflinker to please share some information even if it is some issue which cannot be corrected, so that other researchers should not waste time in trying to which is not possible and may seek some alternatives. Alternatively in the best interest of scientific community authors may like to share, if some improvement is being made which can take care of this problem made to address this issue.
Thanks.

**Thomas Doktor** · 01-06-2011, 07:11 AM

Have you tried renaming the chromosomes? If you are aligning against the UCSC hg19 genome chromosomes will have names such as chr1, chr2, ..., chrX, chrY, but Ensembl denotes them by 1, 2, ..., X, Y. Only exception is chrM which Ensembl denotes MT.

Try runnings this on your Ensembl GTF:
$ awk '{print "chr"$0}' Homo_sapiens.GRCh37.56.gtf | sed 's/chrMT/chrM/g' > hg19.ensembl-for-tophat.gtf

The UCSC gtf files work out of the box if you are using a UCSC fasta reference genome.

EDIT
Just saw that you had already tried with renaming the chromosomes (although in the bowtie index). Just be sure you have used the same chromosome names right from the beginning. Simply fixing the Ensembl GTF chromosome names with the above command was enough for it to work on my station.

**marcora** · 01-06-2011, 07:37 AM

Originally posted by Thomas Doktor View Post

Have you tried renaming the chromosomes? If you are aligning against the UCSC hg19 genome chromosomes will have names such as chr1, chr2, ..., chrX, chrY, but Ensembl denotes them by 1, 2, ..., X, Y. Only exception is chrM which Ensembl denotes MT.

Try runnings this on your Ensembl GTF:
$ awk '{print "chr"$0}' Homo_sapiens.GRCh37.56.gtf | sed 's/chrMT/chrM/g' > hg19.ensembl-for-tophat.gtf

The UCSC gtf files work out of the box if you are using a UCSC fasta reference genome.

EDIT
Just saw that you had already tried with renaming the chromosomes (although in the bowtie index). Just be sure you have used the same chromosome names right from the beginning. Simply fixing the Ensembl GTF chromosome names with the above command was enough for it to work on my station.

I have already tried that before, but just in case I re-tried today.

1) I downloaded the latest gtf file for mus musculus from ftp://ftp.ensembl.org/pub/current/gtf/

2) I ran your awk one-liner on it

3) I bowtie-inspected my mm9 index to make sure chromosome names match and I get the following output:
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM

4) I ran tophat (with -G and --no-novel-juncs options) using the mm9 index and the modified gtf file and got the same error as before!!!

"Reading known junctions from GTF file
Warning: TopHat did not find any junctions in GTF file"

Has anybody had any luck running tophat using mm9 and ensembl gtf???

Thanx in advance for your help and consideration,

Dado

p.s.: I run bowtie v0.12.7.0 and tophat v1.1.4, the mm9 bowtie index is the one from the bowtie home page

I have also run both the original and modified GTF file through this GTF validator and obtained the following output, identical for both files:

Code:

Transcript ENSMUST00000146893 contains no CDS features.
Missing stop_codon for transcript "ENSMUST00000132028".
Transcript ENSMUST00000125766 contains no CDS features.
Transcript ENSMUST00000123080 contains no CDS features.
Transcript ENSMUST00000134441 contains no CDS features.
Transcript ENSMUST00000119451 contains no CDS features.
Missing start_codon for transcript "ENSMUST00000110970".
Missing stop_codon for transcript "ENSMUST00000046878".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000156610".
Missing start_codon for transcript "ENSMUST00000057060".
Missing stop_codon for transcript "ENSMUST00000130267".
Missing stop_codon for transcript "ENSMUST00000125833".
Missing stop_codon for transcript "ENSMUST00000136539".
Missing start_codon for transcript "ENSMUST00000126611".
Missing start_codon for transcript "ENSMUST00000150269".
2 short (<20bp) intron(s) found on in transcript "ENSMUST00000090094".
Missing start_codon for transcript "ENSMUST00000027440".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000100381".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000111825".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000112718".
1 short (<20bp) intron(s) found on in transcript "ENSMUST00000055026".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112673".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000121280".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000103276".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000115127".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112382".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000120549".
Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000117830".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000093816".
Start Codon length is not three in transcript "ENSMUST00000144976".
Inconsistent frame accross transcript_id = "ENSMUST00000144976".
Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000133425".
Start Codon length is not three in transcript "ENSMUST00000139210".
Inconsistent frame accross transcript_id = "ENSMUST00000139210".
Start Codon length is not three in transcript "ENSMUST00000156223".
Inconsistent frame accross transcript_id = "ENSMUST00000156223".
Start Codon length is not three in transcript "ENSMUST00000127994".
Inconsistent frame accross transcript_id = "ENSMUST00000127994".
Start Codon length is not three in transcript "ENSMUST00000123108".
Inconsistent frame accross transcript_id = "ENSMUST00000123108".

Warnings encountered:
Count Description
37227 Transcript contains no CDS features.
23 Start Codon length is not three.
4929 Transcript has no start codon.
7599 Transcript has no stop codon.
92 Stop codon location inconsistent with terminal exon location.
192 Complete transcript length not multiple of 3.
23 Inconsistant phase value accross transcript.
916 Transcript contains short (<20bp) intron(s).

Statistics:
36536 genes with 88186 transcripts containing 436860 cds.
Bad Genes:

**marcora** · 01-06-2011, 08:37 AM

I tried one more thing:

1) I loaded the awk-modified GTF file as a custom track into the UCSC genome browser against NCBI37/mm9. The browser complained about annotations starting with chrNT and did not load.

2) I cleaned the GTF file using

Code:

awk '/^chr[1-9XYM]|^chr1[0-9]/' mm.ncbim37.60.gtf > mm.ncbim37.60.clean.gtf

3) Reloaded the GTF file into the UCSC genome browser... everything works fine with the "cleaned" GTF file

... too bad that even with the "cleaned" GTF file tophat still gives the same error!

Originally posted by marcora View Post

I have already tried that before, but just in case I re-tried today.

1) I downloaded the latest gtf file for mus musculus from ftp://ftp.ensembl.org/pub/current/gtf/

2) I ran your awk one-liner on it

3) I bowtie-inspected my mm9 index to make sure chromosome names match and I get the following output:
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM

4) I ran tophat (with -G and --no-novel-juncs options) using the mm9 index and the modified gtf file and got the same error as before!!!

"Reading known junctions from GTF file
Warning: TopHat did not find any junctions in GTF file"

Has anybody had any luck running tophat using mm9 and ensembl gtf???

Thanx in advance for your help and consideration,

Dado

**Thomas Doktor** · 01-06-2011, 09:18 AM

There seems to be an error (possibly a duplication) with the entries for the ENSMUST00000127664 transcript in the Ensembl mm9 gtf file.

You can correct your Ensembl file this way (I have succesfully used this with Tophat 1.1.4):
$ awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | grep -v "ENSMUST00000127664" > mm9.Ensembl.gtf

Note that this will of course remove the information about the ENSMUST00000127664 transcript, so it's just a quick fix, you should probably look more into the gtf file to try and correct it.

EDIT
TopHat itself is a bit silent about the error causing this, but if you run the junction finding script (gtf_juncs, located in the TopHat bin folder), it will report a much more meaningful error message:

gtf_juncs v1.1.4 (1709)
---------------------------
Error: duplicate GFF ID 'ENSMUST00000127664' (or exons too far apart)!

**marcora** · 01-06-2011, 09:59 AM

Originally posted by Thomas Doktor View Post

There seems to be an error (possibly a duplication) with the entries for the ENSMUST00000127664 transcript in the Ensembl mm9 gtf file.

You can correct your Ensembl file this way (I have succesfully used this with Tophat 1.1.4):
$ awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | grep -v "ENSMUST00000127664" > mm9.Ensembl.gtf

Note that this will of course remove the information about the ENSMUST00000127664 transcript, so it's just a quick fix, you should probably look more into the gtf file to try and correct it.

EDIT
TopHat itself is a bit silent about the error causing this, but if you run the junction finding script (gtf_juncs, located in the TopHat bin folder), it will report a much more meaningful error message:

First and foremost, thank you for your help and patience.

I removed the offending lines as you indicated and now my GTF file passes the gtf_juncs test without warnings or errors.

However, tophat still gives the same warning!

**Thomas Doktor** · 01-06-2011, 10:02 AM

TopHat did not produce any errors here. What commands did you use?

**marcora** · 01-06-2011, 10:05 AM

Originally posted by Thomas Doktor View Post

TopHat did not produce any errors here. What commands did you use?

Code:

#!/bin/bash

GENOME=mm9
GENES=mm9.ensembl

for f in $( ls *.fq ); do
    if [ -f $f ]; then
        mkdir $f.tophat
        tophat -p 8 --library-type fr-unstranded --solexa1.3-quals -G $BOWTIE_INDEXES/$GENES.gtf --no-novel-juncs -o $f.tophat $BOWTIE_INDEXES/$GENOME $f
    fi
done

**Thomas Doktor** · 01-06-2011, 10:06 AM

And you are sure the mm9.ensembl.gtf is the corrected one?

**marcora** · 01-06-2011, 10:22 AM

Originally posted by marcora View Post

First and foremost, thank you for your help and patience.

I removed the offending lines as you indicated and now my GTF file passes the gtf_juncs test without warnings or errors.

However, tophat still gives the same warning!

My bad! This last time when I ran tophat I didn't rename the cleaned GTF file to "mm9.ensembl" and therefore tophat couldn't find it. Surprisingly, instead of reporting a missing file, tophat gave me the same exact warning as before.

In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such:

Code:

awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf

I am finally able to run tophat against the ENSEMBL annotation.

You are my hero!

Thank you very much for your help.

**honey** · 01-09-2011, 07:00 PM

Thanks macrora and Thomas,
I was able to use Enseml GTF file finally. One question (may be trivial) in the HG18 build there are several entries in the chromosome column like c6_COX, c6_QBL (haplotypes assembly) as well as NT_113944, 113923, 112884 and so on. How we deal with them in RNA seq. Should we just skip them in RNA seq analysis or there is any other suggestion please.

**honey** · 01-09-2011, 07:04 PM

Suggestion to moderator

I would suggest making this thread sticky. This problem of using Ensembl file in RNA seq analysis has been raised several times here and in other forums. So that other fellow researchers should not waste too much time and are benefited.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

GTF reference files that work with TopHat/Cufflinks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News