Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat GFF3 for UCSC Gene HG19

    Hi,

    TopHat can accept user-specified junctions via a GFF3 file, so I'm trying to find a GFF3 file that represents the UCSC Gene model for Human (hg19).

    There are a lot of posts asking for similar files, and the gist of the replies seems to be that the SONG gtf2gff3 perl script can be used to convert an Ensembl GTF file to a valid GFF3, but this doesn't work on the UCSC GTF files.

    Does anybody know of a reliable tool for creating GFF3 from UCSC GTF?

    If I need to write my own, would anyone be comfortable enough with TopHat or the GFF3 format to help answer these:

    (1) does TopHat care if each transcript is modeled independently of the other transcripts in its cluster? I suspect the proper way to create a GFF3 would be to model the UCSC clusters (from knownIsoforms) as top level gene features, with the transcripts (from knownGene) modeled as child features. A side effect of this is that exon definitions can be shared across transcripts. If I ignore the top level and model transcripts independently, will TopHat be happy?

    (2) does TopHat need the GFF records to be sorted in some way?

    Thanks,
    Bio.X2Y

  • #2
    I'm interested in the same thing. The problem is that using the Table browser to create UCSC GTF files results in files with gene id and transcript id being the same. This trips up the gtf2gff3.pl converter script. If you download the knownGene.txt from UCSC annotations then you have this information but not in GTF format. I think one way to solve this is to use knownGene.txt to find the mapping from transcript IDs to gene IDs and use that to correct the GTF file and then use the gtf2gff3 converter script.

    On another thread sdriscoll posted the following but this does not look like a proper solution:
    when i was initially getting Tophat to run a few weeks ago i had a hard time getting the GFF file to work. to make my GFF file i used the knownGene table from the UCSC site and had it produce a GTF file. I found a conversion script that changed it to a GFF3 format file. on top of that I had to do a text-replacement for any occurrence of "transcript" and replaced it with "mRNA". at first this didn't work. the bowtie index i was using turned out to be the real issue. it worked fine without the gff3 file but when i included it i'd get that same "junctions database is empty" error. I was using a bowtie index that was pre-compiled and linked from the bowtie site. to resolve the issue i built a new bowtie index myself using FASTA files sorted by chromosome downloaded from the UCSC site. since my gff3 file came from there i figured maybe my bowtie index should come from there as well. sure enough that fixed it.
    Last edited by gtb; 05-26-2010, 03:32 PM. Reason: more information

    Comment


    • #3
      I tried my own suggestion but got several of the following error from gtf2gff3:
      ERROR: strand conflict: validate_and_build_gene
      and finally:
      FATAL: Can't determine strand in: sort_feature_types.

      Clearly there are more problems with the UCSC Table browser output. I haven't determined the exact cause yet.

      Comment


      • #4
        Thanks gtb for reporting this, I was about to try myself.

        I've abandoned the GFF3 approach for now, and am instead going to provide junctions to TopHat via a "raw junctions file" (the only alternative to a GFF3 as far as I know).

        I've written a perl script to create this file from UCSC hg19 (knownGene) - it's attached in case you might want to try the same approach. It takes a simple approach of walking through each transcript in isolation, so the output will contain duplicates.

        I don't know if TopHat cares, but I've sorted my output and removed duplicates just in case:
        sort -u -k 1,1 -k 2,2n -k 3,3n tophat.juncs.tmp > tophat.juncs
        Attached Files

        Comment


        • #5
          I found out that most if not all the trouble is coming from genes that have transcripts from both strands. Most of these have transcript names ending in _dupX in the UCSC Table browser files. If I remove these I can get gtf2gff3 to run to completion. There are still a few strand issues from other genes that have this problem. I will test the resulting gff file later.

          Comment


          • #6
            The gff3 file I created can be used as input to tophat without throwing errors. I'm still not 100% sure it is completely good.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            47 views
            0 likes
            Last Post seqadmin  
            Working...
            X