Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Non-species refseq genes in UCSC Genome Browser stephenhart Bioinformatics 0 02-14-2012 11:03 PM
GFF3 to RefFlat for non UCSC genomes aner RNA Sequencing 10 09-30-2011 07:11 AM
UCSC refSeq to rod? Kath Bioinformatics 1 01-14-2011 05:58 PM
Exons from UCSC (Refseq) khb Bioinformatics 0 12-21-2010 10:47 PM
TopHat GFF3 for UCSC Gene HG19 Bio.X2Y Bioinformatics 5 06-07-2010 01:43 PM

Thread Tools
Old 02-22-2012, 07:13 PM   #1
Senior Member
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default UCSC refseq to gff3

Hi everyone,

Does anyone know of a good script to only print one transcript from each gene in a GFF file?

I've been trying to get the Anolis NCBI annotation into a gff3 or gtf file for some time and had very little luck. I've scoured their FTP and can only find .gbk files, which have all kinds of issues converting to some kind of gff. The gives me utter nonsense. And 10's of GB of it.

So I moved on to UCSC's table annotations, the xenoRefSeq.txt files. I used the script, but now I have many, many more genes and mRNAs than should be there.

I'm not particularly concerned with getting the annotation file 100% perfect, some of those alternate splicings that are actually real can go. I'd much rather just have the best single transcript from each gene than the some 120K genes and 300K mRNAs I have now.

So, does anyone know of a good script to print just one, preferably the longest, transcript from each gene into a new non-redundant gff file?
Wallysb01 is offline   Reply With Quote
Old 03-29-2012, 07:14 AM   #2
Location: California

Join Date: Feb 2011
Posts: 16
Default one transcript from each gene in a GFF file


I was looking for a way to do this too and stumbled upon this thread. Nobody has answered so this was my solution:

1) use Ensembl BioMart to download the following to a tab-delimited text file
EnsemblGeneID GeneStart GeneEnd ChromosomeName Strand

2) convert this file to a gff with sed/awk:

tail -n +2 human_genes_ENSrel62.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print "chr"$4, "hg19_EnsGene", "CDS", $2 , $3, ".", $5, ".", "gene_id \""$1"\";"}' | sed 's/\t-1\t/\t-\t/' | sed 's/\t1\t/\t+\t/' > hg19_EnsGene.gff

dglemay is offline   Reply With Quote
Old 03-29-2012, 10:06 AM   #3
Senior Member
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178

How about the A. carolinensis GTF from the Ensembl FTP site.
kmcarr is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 03:40 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO