krapulaxdoctor 12-23-2017 05:13 AM

Problem with UCSC GTF files?

I would like to ask for some opinion and advice related to the different available GTF-file sources for annotated genes.(mm10, but others as well)
I did some search to avoid duplicate entries, (sorry if It is still one).
The topic I would like to discuss is briefly mentioned at other forums, but was never discussed thoroughly that gave a satisfactory explanation.

I wanted to download GTF files (mm10) from UCSC genome browser to have reference genes and transcript variants for differential transcript variant expression and splicing analyses.

However, it looks like no matter how I was setting up the table browser (UCSC genes, NCBI refseq, etc) the obtained GTF files from UCSC browser were not suitable for such analyses.
I noticed that these GTF files (from UCSC) treat each transcript variants as a separate gene, since the "transcript ID" is identical to "gene ID" in these files. (did I do something wrong?)
For these analyses I need a GTF file where each gene ID is linked ( aka repeated ) to multiple transcript variants (if there are variants of course). The only source I found such GTF file is Gencode and Ensembl.
However, these files contain approx 50000 genes and 150000 transcript variants which I found too much due to predictions. While the UCSC has approx 38000 entries which might be less redundant and speculative? (no idea)

I would like to ask for some advice about where to find / how to make an optimal GTF file that would be suitable for differential splicing/ transc. variant expression analyses?

Would you recommend to avoid using UCSC GTF files for expression analyses in general?

Thank you for your help.


doraemon 12-25-2017 11:55 PM


I'm not an expert and my knowledge is limited to human genes ... Although I'd like to think that the principles outlined extend to mouse genes as well.

1) Refseq - transcripts are well supported by evidence and heavily used (NM_ .. for known protein coding)
2) Ensembl / Gencode Comprehensive - Contains both annotated and manually curated transcripts
3) Ensembl / Gencode Basic - Contains manually curate transcripts only

I'm not terribly familiar with UCSC. In the literature I have come across so far, the authors have almost always leaned towards using RefSeq or Ensembl.

So the choice of which transcripts annotation to go with depends on what you're trying to do.

If you're interested in performing variant analysis of transcripts and ensure that they're supported by evidence, Refseq or Gencode basic is your friend.

If you're concerned that limiting yourself to annotations that are supported by evidence - might result in missing out other possibly novel transcripts, then Gencode Comprehensive is the way to go.

These two papers go into a significant more detail as to the pros and cons of using one annotation construct vs another.

krapulaxdoctor 01-12-2018 09:06 AM

dear doraemon,

Thank you for the response. I ended up with similar conclusion. It is a bit confusing for a non-bioinformatician like me.

