Hi everyone,
Does anyone know of a good script to only print one transcript from each gene in a GFF file?
I've been trying to get the Anolis NCBI annotation into a gff3 or gtf file for some time and had very little luck. I've scoured their FTP and can only find .gbk files, which have all kinds of issues converting to some kind of gff. The bp_genbank2gff3.pl gives me utter nonsense. And 10's of GB of it.
So I moved on to UCSC's table annotations, the xenoRefSeq.txt files. I used the UCSC_table2gff3.pl script, but now I have many, many more genes and mRNAs than should be there.
I'm not particularly concerned with getting the annotation file 100% perfect, some of those alternate splicings that are actually real can go. I'd much rather just have the best single transcript from each gene than the some 120K genes and 300K mRNAs I have now.
So, does anyone know of a good script to print just one, preferably the longest, transcript from each gene into a new non-redundant gff file?
Does anyone know of a good script to only print one transcript from each gene in a GFF file?
I've been trying to get the Anolis NCBI annotation into a gff3 or gtf file for some time and had very little luck. I've scoured their FTP and can only find .gbk files, which have all kinds of issues converting to some kind of gff. The bp_genbank2gff3.pl gives me utter nonsense. And 10's of GB of it.
So I moved on to UCSC's table annotations, the xenoRefSeq.txt files. I used the UCSC_table2gff3.pl script, but now I have many, many more genes and mRNAs than should be there.
I'm not particularly concerned with getting the annotation file 100% perfect, some of those alternate splicings that are actually real can go. I'd much rather just have the best single transcript from each gene than the some 120K genes and 300K mRNAs I have now.
So, does anyone know of a good script to print just one, preferably the longest, transcript from each gene into a new non-redundant gff file?
Comment