![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
cufflinks - having difficulty reading reference GTF | tguo | Bioinformatics | 31 | 07-07-2019 07:03 AM |
cufflinks : analysis comparison with and without a gtf reference file | sohnic | Bioinformatics | 3 | 07-07-2019 06:40 AM |
cufflinks GTF files incompatible with cuffmerge? | shurjo | Bioinformatics | 21 | 09-23-2013 05:47 PM |
Tophat v1.1 with GTF files | hyjkim | Bioinformatics | 7 | 12-17-2012 08:11 AM |
Best source for GTF file for use with TopHat/Cufflinks | sdarko | Bioinformatics | 17 | 12-14-2012 12:48 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]()
Maybe it's just me, but the only GTF reference file that seems to work in TopHat (when using the -G option that is) is refFlat (RefSeq) downloaded from the UCSC Table Browser. Any other GTF file I download from there (e.g., ensGene) or ENSEMBL (following the recommendation in the TopHat website) just doesn't work (i.e., I get the "TopHat did not find any junctions in GTF file" warning)
![]() I know the problem has popped up in this forum multiple times. Therefore, I would like to get as much feedback as possible as to which GTF files people are using for their RNAseq analysis. And whether anybody has managed to get the ENSEMBL gene annotation to work. Thank you in advance for your help and consideration. ![]() Last edited by marcora; 12-06-2010 at 01:09 PM. |
![]() |
![]() |
![]() |
#2 |
Junior Member
Location: NYC Join Date: Sep 2008
Posts: 3
|
![]()
There seem to be many conditions under which various GTF files will cause both tophat and cufflinks to unceremoniously segfault when analyzing stranded RNA-seq data. I've been bisecting my sequence files to try to find the reads that are making these applications unhappy, but nothing has completely helped me other than moving from ENSEMBL GTF to UCSC refFlat as you suggested. Some official guidelines on GTF files and more meaningful error messages would be very useful.
|
![]() |
![]() |
![]() |
#3 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
Feedback from tophat/cufflinks authors would be very welcome! ![]() |
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Pittsburgh Join Date: Feb 2010
Posts: 151
|
![]()
I have been struggling with the same issue and only refflat is working. I have even posted several messages on this forum. I will join in requesting authors of Tophat and Cufflinker to please share some information even if it is some issue which cannot be corrected, so that other researchers should not waste time in trying to which is not possible and may seek some alternatives. Alternatively in the best interest of scientific community authors may like to share, if some improvement is being made which can take care of this problem made to address this issue.
Thanks. |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: University of Southern Denmark (SDU), Denmark Join Date: Apr 2009
Posts: 105
|
![]()
Have you tried renaming the chromosomes? If you are aligning against the UCSC hg19 genome chromosomes will have names such as chr1, chr2, ..., chrX, chrY, but Ensembl denotes them by 1, 2, ..., X, Y. Only exception is chrM which Ensembl denotes MT.
Try runnings this on your Ensembl GTF: $ awk '{print "chr"$0}' Homo_sapiens.GRCh37.56.gtf | sed 's/chrMT/chrM/g' > hg19.ensembl-for-tophat.gtf The UCSC gtf files work out of the box if you are using a UCSC fasta reference genome. EDIT Just saw that you had already tried with renaming the chromosomes (although in the bowtie index). Just be sure you have used the same chromosome names right from the beginning. Simply fixing the Ensembl GTF chromosome names with the above command was enough for it to work on my station. Last edited by Thomas Doktor; 01-06-2011 at 07:15 AM. |
![]() |
![]() |
![]() |
#6 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
1) I downloaded the latest gtf file for mus musculus from ftp://ftp.ensembl.org/pub/current/gtf/ 2) I ran your awk one-liner on it 3) I bowtie-inspected my mm9 index to make sure chromosome names match and I get the following output: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY chrM 4) I ran tophat (with -G and --no-novel-juncs options) using the mm9 index and the modified gtf file and got the same error as before!!! "Reading known junctions from GTF file Warning: TopHat did not find any junctions in GTF file" Has anybody had any luck running tophat using mm9 and ensembl gtf??? Thanx in advance for your help and consideration, Dado ![]() p.s.: I run bowtie v0.12.7.0 and tophat v1.1.4, the mm9 bowtie index is the one from the bowtie home page I have also run both the original and modified GTF file through this GTF validator and obtained the following output, identical for both files: Code:
Transcript ENSMUST00000146893 contains no CDS features. Missing stop_codon for transcript "ENSMUST00000132028". Transcript ENSMUST00000125766 contains no CDS features. Transcript ENSMUST00000123080 contains no CDS features. Transcript ENSMUST00000134441 contains no CDS features. Transcript ENSMUST00000119451 contains no CDS features. Missing start_codon for transcript "ENSMUST00000110970". Missing stop_codon for transcript "ENSMUST00000046878". Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000156610". Missing start_codon for transcript "ENSMUST00000057060". Missing stop_codon for transcript "ENSMUST00000130267". Missing stop_codon for transcript "ENSMUST00000125833". Missing stop_codon for transcript "ENSMUST00000136539". Missing start_codon for transcript "ENSMUST00000126611". Missing start_codon for transcript "ENSMUST00000150269". 2 short (<20bp) intron(s) found on in transcript "ENSMUST00000090094". Missing start_codon for transcript "ENSMUST00000027440". 1 short (<20bp) intron(s) found on in transcript "ENSMUST00000100381". 1 short (<20bp) intron(s) found on in transcript "ENSMUST00000111825". 1 short (<20bp) intron(s) found on in transcript "ENSMUST00000112718". 1 short (<20bp) intron(s) found on in transcript "ENSMUST00000055026". Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112673". Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000121280". Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000103276". Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000115127". Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112382". Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000120549". Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000117830". Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000093816". Start Codon length is not three in transcript "ENSMUST00000144976". Inconsistent frame accross transcript_id = "ENSMUST00000144976". Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000133425". Start Codon length is not three in transcript "ENSMUST00000139210". Inconsistent frame accross transcript_id = "ENSMUST00000139210". Start Codon length is not three in transcript "ENSMUST00000156223". Inconsistent frame accross transcript_id = "ENSMUST00000156223". Start Codon length is not three in transcript "ENSMUST00000127994". Inconsistent frame accross transcript_id = "ENSMUST00000127994". Start Codon length is not three in transcript "ENSMUST00000123108". Inconsistent frame accross transcript_id = "ENSMUST00000123108". Warnings encountered: Count Description 37227 Transcript contains no CDS features. 23 Start Codon length is not three. 4929 Transcript has no start codon. 7599 Transcript has no stop codon. 92 Stop codon location inconsistent with terminal exon location. 192 Complete transcript length not multiple of 3. 23 Inconsistant phase value accross transcript. 916 Transcript contains short (<20bp) intron(s). Statistics: 36536 genes with 88186 transcripts containing 436860 cds. Bad Genes: Last edited by marcora; 01-06-2011 at 07:53 AM. |
|
![]() |
![]() |
![]() |
#7 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]()
I tried one more thing:
1) I loaded the awk-modified GTF file as a custom track into the UCSC genome browser against NCBI37/mm9. The browser complained about annotations starting with chrNT and did not load. 2) I cleaned the GTF file using Code:
awk '/^chr[1-9XYM]|^chr1[0-9]/' mm.ncbim37.60.gtf > mm.ncbim37.60.clean.gtf ... too bad that even with the "cleaned" GTF file tophat still gives the same error! ![]() Quote:
Last edited by marcora; 01-06-2011 at 08:42 AM. |
|
![]() |
![]() |
![]() |
#8 | |
Senior Member
Location: University of Southern Denmark (SDU), Denmark Join Date: Apr 2009
Posts: 105
|
![]()
There seems to be an error (possibly a duplication) with the entries for the ENSMUST00000127664 transcript in the Ensembl mm9 gtf file.
You can correct your Ensembl file this way (I have succesfully used this with Tophat 1.1.4): $ awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | grep -v "ENSMUST00000127664" > mm9.Ensembl.gtf Note that this will of course remove the information about the ENSMUST00000127664 transcript, so it's just a quick fix, you should probably look more into the gtf file to try and correct it. EDIT TopHat itself is a bit silent about the error causing this, but if you run the junction finding script (gtf_juncs, located in the TopHat bin folder), it will report a much more meaningful error message: Quote:
Last edited by Thomas Doktor; 01-06-2011 at 09:22 AM. Reason: How to find the error |
|
![]() |
![]() |
![]() |
#9 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
I removed the offending lines as you indicated and now my GTF file passes the gtf_juncs test without warnings or errors. However, tophat still gives the same warning! ![]() |
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: University of Southern Denmark (SDU), Denmark Join Date: Apr 2009
Posts: 105
|
![]()
TopHat did not produce any errors here. What commands did you use?
|
![]() |
![]() |
![]() |
#11 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
Code:
#!/bin/bash GENOME=mm9 GENES=mm9.ensembl for f in $( ls *.fq ); do if [ -f $f ]; then mkdir $f.tophat tophat -p 8 --library-type fr-unstranded --solexa1.3-quals -G $BOWTIE_INDEXES/$GENES.gtf --no-novel-juncs -o $f.tophat $BOWTIE_INDEXES/$GENOME $f fi done |
|
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: University of Southern Denmark (SDU), Denmark Join Date: Apr 2009
Posts: 105
|
![]()
And you are sure the mm9.ensembl.gtf is the corrected one?
|
![]() |
![]() |
![]() |
#13 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such: Code:
awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf You are my hero! ![]() Thank you very much for your help. |
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Pittsburgh Join Date: Feb 2010
Posts: 151
|
![]()
Thanks macrora and Thomas,
I was able to use Enseml GTF file finally. One question (may be trivial) in the HG18 build there are several entries in the chromosome column like c6_COX, c6_QBL (haplotypes assembly) as well as NT_113944, 113923, 112884 and so on. How we deal with them in RNA seq. Should we just skip them in RNA seq analysis or there is any other suggestion please. |
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: Pittsburgh Join Date: Feb 2010
Posts: 151
|
![]()
I would suggest making this thread sticky. This problem of using Ensembl file in RNA seq analysis has been raised several times here and in other forums. So that other fellow researchers should not waste too much time and are benefited.
Last edited by honey; 01-09-2011 at 10:59 PM. |
![]() |
![]() |
![]() |
#16 |
Member
Location: UK Join Date: Oct 2009
Posts: 15
|
![]()
I would caution against messing around with the GTF file, as in the future this should break, as soon as Ensembl switches to the next version. In addition, it doesn't work for any other organisms.
I am running cufflinks and solved the problem differently: I indexed the original Ensembl genome, ran bowtie on it, converted the samfile to bam, sorted it, removed this weird transcript from the Ensembl file and ran cufflinks normally, without any awk script. here is a log file of what I did: Code:
bowtie-build -C Mus_musculus.NCBIM37.61.dna.toplevel.fa Mus_musculus.NCBIM37.61.dna.toplevel_c bowtie -f -C -m 1 -p4 --sam $(BOWTIEINDEX) $$i > $(MAPPEDREADS)/`basename $$i .txt`.sam samtools view -Sb sam/SL005_R00002_RME033_01pg_F3.csfasta.sam > sam/SL005_R00002_RME033_01pg_F3.csfasta.bam samtools sort -Sb SL005_R00002_RME033_01pg_F3.csfasta.bam SL005_R00002_RME033_01pg_F3.csfasta.sorted 512391 grep -v ENSMUST00000127664 Mus_musculus.NCBIM37.61.gtf > Mus_musculus.NCBIM37.61.corrected.gtf ~/software/cufflinks-0.9.3/cufflinks -G Mus_musculus.NCBIM37.61.corrected.gtf -v sam.old/SL005_R00002_RME033_01pg_F3.csfasta.sorted. |
![]() |
![]() |
![]() |
#17 | |
Member
Location: california Join Date: Jul 2009
Posts: 24
|
![]()
Another solution is go to tophat-1.2.0/src/gff.cpp
and change const uint GFF_MAX_LOCUS = 4000000; to const uint GFF_MAX_LOCUS = 5000000; then recompile the tophat Ensemble database indicate that this transcript ENSMUST00000127664 is 4.43Mb, bigger than previous cut-off GFF_MAX_LOCUS = 4000000. Quote:
|
|
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: Pittsburgh Join Date: Feb 2010
Posts: 151
|
![]()
Will try new approach in next run uptill now we have been changing GTF file
|
![]() |
![]() |
![]() |
#19 |
Member
Location: Kansas City Join Date: May 2008
Posts: 14
|
![]()
Just wanted to add -- as of tophat 1.4.0 -- there is another way to get this "did not find any junctions" error message. If your gtf has any entries with nonstandard strand symbols, for instance '*', parsing will apparently fail for the the entire gtf, even though all other entries are OK.
Reading gtf_juncs.log will show you the offending line. |
![]() |
![]() |
![]() |
#20 | |
Member
Location: Seoul, Korea Join Date: Feb 2013
Posts: 57
|
![]() Quote:
Please tell me , I got the same error as follows. I am using everything from UCSC hg19 and downloaded everything, unpacked them but still getting this error: What to do ? ![]() Warning: couldn't find fasta record for 'chr17_ctg5_hap1'! Warning: couldn't find fasta record for 'chr17_gl000205_random'! Warning: couldn't find fasta record for 'chr19_gl000209_random'! Warning: couldn't find fasta record for 'chr1_gl000191_random'! Warning: couldn't find fasta record for 'chr4_ctg9_hap1'! Warning: couldn't find fasta record for 'chr4_gl000193_random'! Warning: couldn't find fasta record for 'chr4_gl000194_random'! Warning: couldn't find fasta record for 'chr6_apd_hap1'! Warning: couldn't find fasta record for 'chr6_cox_hap2'! Warning: couldn't find fasta record for 'chr6_dbb_hap3'! Warning: couldn't find fasta record for 'chr6_mann_hap4'! Warning: couldn't find fasta record for 'chr6_mcf_hap5'! Warning: couldn't find fasta record for 'chr6_qbl_hap6'! Warning: couldn't find fasta record for 'chr6_ssto_hap7'! Warning: couldn't find fasta record for 'chr7_gl000195_random'! Warning: couldn't find fasta record for 'chrUn_gl000211'! Warning: couldn't find fasta record for 'chrUn_gl000212'! Warning: couldn't find fasta record for 'chrUn_gl000218'! Warning: couldn't find fasta record for 'chrUn_gl000219'! Warning: couldn't find fasta record for 'chrUn_gl000220'! Warning: couldn't find fasta record for 'chrUn_gl000222'! Warning: couldn't find fasta record for 'chrUn_gl000223'! Warning: couldn't find fasta record for 'chrUn_gl000228'! |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|