SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GFF3 to RefFlat for non UCSC genomes aner RNA Sequencing 10 09-30-2011 06:11 AM
tophat hg19 index madsaan Bioinformatics 1 06-10-2011 09:56 PM
TopHat and the GFF3 file Ender985 RNA Sequencing 13 05-28-2011 06:02 PM
SpliceMap Gene annotations file for hg19 trickytank Bioinformatics 0 01-18-2011 04:44 PM
In gene annotation table/gff3, why is same gene name appeared in different chromosome iloveneworleans Bioinformatics 1 01-14-2010 08:55 AM

Reply
 
Thread Tools
Old 05-25-2010, 02:08 PM   #1
Bio.X2Y
Member
 
Location: Europe

Join Date: Apr 2010
Posts: 46
Default TopHat GFF3 for UCSC Gene HG19

Hi,

TopHat can accept user-specified junctions via a GFF3 file, so I'm trying to find a GFF3 file that represents the UCSC Gene model for Human (hg19).

There are a lot of posts asking for similar files, and the gist of the replies seems to be that the SONG gtf2gff3 perl script can be used to convert an Ensembl GTF file to a valid GFF3, but this doesn't work on the UCSC GTF files.

Does anybody know of a reliable tool for creating GFF3 from UCSC GTF?

If I need to write my own, would anyone be comfortable enough with TopHat or the GFF3 format to help answer these:

(1) does TopHat care if each transcript is modeled independently of the other transcripts in its cluster? I suspect the proper way to create a GFF3 would be to model the UCSC clusters (from knownIsoforms) as top level gene features, with the transcripts (from knownGene) modeled as child features. A side effect of this is that exon definitions can be shared across transcripts. If I ignore the top level and model transcripts independently, will TopHat be happy?

(2) does TopHat need the GFF records to be sorted in some way?

Thanks,
Bio.X2Y
Bio.X2Y is offline   Reply With Quote
Old 05-26-2010, 03:07 PM   #2
gtb
Junior Member
 
Location: San Francisco

Join Date: May 2010
Posts: 5
Default

I'm interested in the same thing. The problem is that using the Table browser to create UCSC GTF files results in files with gene id and transcript id being the same. This trips up the gtf2gff3.pl converter script. If you download the knownGene.txt from UCSC annotations then you have this information but not in GTF format. I think one way to solve this is to use knownGene.txt to find the mapping from transcript IDs to gene IDs and use that to correct the GTF file and then use the gtf2gff3 converter script.

On another thread sdriscoll posted the following but this does not look like a proper solution:
when i was initially getting Tophat to run a few weeks ago i had a hard time getting the GFF file to work. to make my GFF file i used the knownGene table from the UCSC site and had it produce a GTF file. I found a conversion script that changed it to a GFF3 format file. on top of that I had to do a text-replacement for any occurrence of "transcript" and replaced it with "mRNA". at first this didn't work. the bowtie index i was using turned out to be the real issue. it worked fine without the gff3 file but when i included it i'd get that same "junctions database is empty" error. I was using a bowtie index that was pre-compiled and linked from the bowtie site. to resolve the issue i built a new bowtie index myself using FASTA files sorted by chromosome downloaded from the UCSC site. since my gff3 file came from there i figured maybe my bowtie index should come from there as well. sure enough that fixed it.

Last edited by gtb; 05-26-2010 at 03:32 PM. Reason: more information
gtb is offline   Reply With Quote
Old 05-27-2010, 01:12 PM   #3
gtb
Junior Member
 
Location: San Francisco

Join Date: May 2010
Posts: 5
Default

I tried my own suggestion but got several of the following error from gtf2gff3:
ERROR: strand conflict: validate_and_build_gene
and finally:
FATAL: Can't determine strand in: sort_feature_types.

Clearly there are more problems with the UCSC Table browser output. I haven't determined the exact cause yet.
gtb is offline   Reply With Quote
Old 05-27-2010, 03:04 PM   #4
Bio.X2Y
Member
 
Location: Europe

Join Date: Apr 2010
Posts: 46
Default

Thanks gtb for reporting this, I was about to try myself.

I've abandoned the GFF3 approach for now, and am instead going to provide junctions to TopHat via a "raw junctions file" (the only alternative to a GFF3 as far as I know).

I've written a perl script to create this file from UCSC hg19 (knownGene) - it's attached in case you might want to try the same approach. It takes a simple approach of walking through each transcript in isolation, so the output will contain duplicates.

I don't know if TopHat cares, but I've sorted my output and removed duplicates just in case:
sort -u -k 1,1 -k 2,2n -k 3,3n tophat.juncs.tmp > tophat.juncs
Attached Files
File Type: pl create_tophat_raw_junction_file.pl (2.9 KB, 104 views)
Bio.X2Y is offline   Reply With Quote
Old 05-27-2010, 06:31 PM   #5
gtb
Junior Member
 
Location: San Francisco

Join Date: May 2010
Posts: 5
Default

I found out that most if not all the trouble is coming from genes that have transcripts from both strands. Most of these have transcript names ending in _dupX in the UCSC Table browser files. If I remove these I can get gtf2gff3 to run to completion. There are still a few strand issues from other genes that have this problem. I will test the resulting gff file later.
gtb is offline   Reply With Quote
Old 06-07-2010, 12:43 PM   #6
gtb
Junior Member
 
Location: San Francisco

Join Date: May 2010
Posts: 5
Default

The gff3 file I created can be used as input to tophat without throwing errors. I'm still not 100% sure it is completely good.
gtb is offline   Reply With Quote
Reply

Tags
gff3, hg19, tophat, ucsc

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO