Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with UCSC GTF files?

    Hi,

    I would like to ask for some opinion and advice related to the different available GTF-file sources for annotated genes.(mm10, but others as well)
    I did some search to avoid duplicate entries, (sorry if It is still one).
    The topic I would like to discuss is briefly mentioned at other forums, but was never discussed thoroughly that gave a satisfactory explanation.

    I wanted to download GTF files (mm10) from UCSC genome browser to have reference genes and transcript variants for differential transcript variant expression and splicing analyses.

    However, it looks like no matter how I was setting up the table browser (UCSC genes, NCBI refseq, etc) the obtained GTF files from UCSC browser were not suitable for such analyses.
    I noticed that these GTF files (from UCSC) treat each transcript variants as a separate gene, since the "transcript ID" is identical to "gene ID" in these files. (did I do something wrong?)
    For these analyses I need a GTF file where each gene ID is linked ( aka repeated ) to multiple transcript variants (if there are variants of course). The only source I found such GTF file is Gencode and Ensembl.
    However, these files contain approx 50000 genes and 150000 transcript variants which I found too much due to predictions. While the UCSC has approx 38000 entries which might be less redundant and speculative? (no idea)

    I would like to ask for some advice about where to find / how to make an optimal GTF file that would be suitable for differential splicing/ transc. variant expression analyses?

    Would you recommend to avoid using UCSC GTF files for expression analyses in general?

    Thank you for your help.

    Best.

  • #2
    Hi,

    I'm not an expert and my knowledge is limited to human genes ... Although I'd like to think that the principles outlined extend to mouse genes as well.

    1) Refseq - transcripts are well supported by evidence and heavily used (NM_ .. for known protein coding)
    2) Ensembl / Gencode Comprehensive - Contains both annotated and manually curated transcripts
    3) Ensembl / Gencode Basic - Contains manually curate transcripts only

    I'm not terribly familiar with UCSC. In the literature I have come across so far, the authors have almost always leaned towards using RefSeq or Ensembl.

    So the choice of which transcripts annotation to go with depends on what you're trying to do.

    If you're interested in performing variant analysis of transcripts and ensure that they're supported by evidence, Refseq or Gencode basic is your friend.

    If you're concerned that limiting yourself to annotations that are supported by evidence - might result in missing out other possibly novel transcripts, then Gencode Comprehensive is the way to go.

    These two papers go into a significant more detail as to the pros and cons of using one annotation construct vs another.

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4339237/
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4502323/

    Comment


    • #3
      dear doraemon,

      Thank you for the response. I ended up with similar conclusion. It is a bit confusing for a non-bioinformatician like me.

      Comment


      • #4
        Hi krapulaxdoctor,

        I hit the same problem as you mentioned in the thread. I think it is a bug in UCSC Table Browser. To solve this problem, I downloaded both the GTF file and the refFlat file using Table Browser, and then applied a custom PERL script "gtf_addGeneName_from_refFlat.pl" to add the gene name into the GTF file.

        For you and other people's convenience, I have put my custom PERL script in https://github.com/Qiongyi/custom_PERL_scripts
        Feel free to use if you meet similar problem.

        Usage: gtf_addGeneName_from_refFlat.pl mm10.refGene.gtf mm10.refGene.refFlat.txt output(the updated GTF file with gene ID)

        Cheers,

        Qiongyi

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X