SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Non-species refseq genes in UCSC Genome Browser stephenhart Bioinformatics 0 02-14-2012 10:03 PM
GFF3 to RefFlat for non UCSC genomes aner RNA Sequencing 10 09-30-2011 06:11 AM
UCSC refSeq to rod? Kath Bioinformatics 1 01-14-2011 04:58 PM
Exons from UCSC (Refseq) khb Bioinformatics 0 12-21-2010 09:47 PM
TopHat GFF3 for UCSC Gene HG19 Bio.X2Y Bioinformatics 5 06-07-2010 12:43 PM

Reply
 
Thread Tools
Old 02-22-2012, 06:13 PM   #1
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default UCSC refseq to gff3

Hi everyone,

Does anyone know of a good script to only print one transcript from each gene in a GFF file?

I've been trying to get the Anolis NCBI annotation into a gff3 or gtf file for some time and had very little luck. I've scoured their FTP and can only find .gbk files, which have all kinds of issues converting to some kind of gff. The bp_genbank2gff3.pl gives me utter nonsense. And 10's of GB of it.

So I moved on to UCSC's table annotations, the xenoRefSeq.txt files. I used the UCSC_table2gff3.pl script, but now I have many, many more genes and mRNAs than should be there.

I'm not particularly concerned with getting the annotation file 100% perfect, some of those alternate splicings that are actually real can go. I'd much rather just have the best single transcript from each gene than the some 120K genes and 300K mRNAs I have now.

So, does anyone know of a good script to print just one, preferably the longest, transcript from each gene into a new non-redundant gff file?
Wallysb01 is offline   Reply With Quote
Old 03-29-2012, 06:14 AM   #2
dglemay
Member
 
Location: California

Join Date: Feb 2011
Posts: 16
Default one transcript from each gene in a GFF file

Hi,

I was looking for a way to do this too and stumbled upon this thread. Nobody has answered so this was my solution:

1) use Ensembl BioMart to download the following to a tab-delimited text file
EnsemblGeneID GeneStart GeneEnd ChromosomeName Strand

2) convert this file to a gff with sed/awk:

tail -n +2 human_genes_ENSrel62.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print "chr"$4, "hg19_EnsGene", "CDS", $2 , $3, ".", $5, ".", "gene_id \""$1"\";"}' | sed 's/\t-1\t/\t-\t/' | sed 's/\t1\t/\t+\t/' > hg19_EnsGene.gff

-Danielle
dglemay is offline   Reply With Quote
Old 03-29-2012, 09:06 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

How about the A. carolinensis GTF from the Ensembl FTP site.
kmcarr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO