Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to get a gtf file for cufflinks

    Hi,

    can i ask how to get a gtf file for tophat or cufflinks?

    I just use ucsc table browser to get a gtf file, content is like this:

    Code:
    chr1	hg19_refGene	start_codon	67000042	67000044	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291";
    Is that right? Do I need to add the gene symbol to this table?

    Thanks,
    Peter

  • #2
    I recommend downloading GTF annotation from this page:


    These files were designed to go with Tophat/Cufflinks and have all the expected fields.

    Comment


    • #3
      thanks so much

      Comment


      • #4
        Hi camelbbs,

        Just saw your post and thought I'd give you a word of caution. Make sure you take a look at your annotation closely. We just started to play around with the iGenomes stuff, but I'll tell you right now that our usage of different annotations from UCSC (RefSeq, ENSEMBL, and Gencode (which should be pretty much the same as ENSEMBL) (and iGenome) have lead to very different results in different cases. Sort of depends on your question, but make sure that the annotation you are looking at is good for the stuff your most concerned about. One would hope that choice of annotation would be a robust parameter in these types of analysis, but we haven't found that to be the case. In the end, those of us who don't have the time to spend inordinate amount of time vetting these things have to take a close look at them and then make a decision to stick with. Good luck.

        Comment


        • #5
          Hi camelbbs,

          Just saw your post and thought I'd give you a word of caution. Make sure you take a look at your annotation closely. We just started to play around with the iGenomes stuff, but I'll tell you right now that our usage of different annotations from UCSC (RefSeq, ENSEMBL, and Gencode (which should be pretty much the same as ENSEMBL) (and iGenome) have lead to very different results in different cases. Sort of depends on your question, but make sure that the annotation you are looking at is good for the stuff your most concerned about. One would hope that choice of annotation would be a robust parameter in these types of analysis, but we haven't found that to be the case. In the end, those of us who don't have the time to spend inordinate amount of time vetting these things have to take a close look at them and then make a decision to stick with. Good luck.

          Comment


          • #6
            Originally posted by JueFish View Post
            Hi camelbbs,

            Just saw your post and thought I'd give you a word of caution. Make sure you take a look at your annotation closely. We just started to play around with the iGenomes stuff, but I'll tell you right now that our usage of different annotations from UCSC (RefSeq, ENSEMBL, and Gencode (which should be pretty much the same as ENSEMBL) (and iGenome) have lead to very different results in different cases. Sort of depends on your question, but make sure that the annotation you are looking at is good for the stuff your most concerned about. One would hope that choice of annotation would be a robust parameter in these types of analysis, but we haven't found that to be the case. In the end, those of us who don't have the time to spend inordinate amount of time vetting these things have to take a close look at them and then make a decision to stick with. Good luck.
            Thanks juefish,

            So would you have some recommend for gtf choice. Which one is proved by most of work. And I am curious how they link the gene annotation to alignment sequences, is it by the coordinates?

            Comment


            • #7
              Originally posted by camelbbs View Post
              Thanks juefish,

              So would you have some recommend for gtf choice. Which one is proved by most of work. And I am curious how they link the gene annotation to alignment sequences, is it by the coordinates?
              As far as I know GTF/GFF annotations are indeed related to the reference sequences (which is what I presume you mean by 'alignment sequences') solely by coordinate positions. This, of course, means that you need to be very careful to pick and use the GTF version that was created for your reference version.

              Comment


              • #8
                Well, camelbbs, "work" is an interesting way to put it. I can let you in on what little I know, but I'm still trying to work through these things myself. Someone else out there might have some more info or insight on this issue than myself. We've looked at four different annotations so far: Gencode, Ensembl, RefSeq, and IGenome (which should be a derivative of Ensembl, I think). We haven't really vetted Gencode or IGenome, because Gencode (hypothetically) should be very similar to Ensembl, while IGenome we just found and only briefly ran it through some stuff. So that brings us to RefSeq and Ensembl and I think most people find those two databases generally acceptable for whatever you would be interested in. Just from some rough calculations Ensembl appears to have about twice as many nucleotides annotated as opposed to RefSeq. This is likely because of a higher level of isoform annotation in Ensembl, so some nucleotides may be doubly annotated (NOTE: when you download the Ensembl ensGene gtf from UCSC and implement it into cufflinks, you end up with the transcript IDs for you genes not the gene IDs - can be very important to you depending on what you want). Just open up a chromosome in hg19 in UCSC genome browser and you can see how different they look. As to why they are different, again I'm not incredibly knowledgeable here, but each of these methods uses slightly different evidence to add to their respective databases. Mostly, I think they probably vary in two ways: 1) the computational methods they use for predicted gene tracks and 2) in the curation of these database. In the past, RefSeq was more submission based, so the evidence requirements would have appeared to be higher, but that all conjecture on my part. In the end, we've bascially come to view RefSeq as more conservative and gene-oriented and Ensembl as more computationally developed and more transcript-oriented. Anyone else out there have some better insight than me? I would love to hear it.

                To answer your other question, yes, coordinates, chromosome, and ID determine genomic location and annotation.

                Comment


                • #9
                  Well, camelbbs, "work" is an interesting way to put it. I can let you in on what little I know, but I'm still trying to work through these things myself. Someone else out there might have some more info or insight on this issue than myself. We've looked at four different annotations so far: Gencode, Ensembl, RefSeq, and IGenome (which should be a derivative of Ensembl, I think). We haven't really vetted Gencode or IGenome, because Gencode (hypothetically) should be very similar to Ensembl, while IGenome we just found and only briefly ran it through some stuff. So that brings us to RefSeq and Ensembl and I think most people find those two databases generally acceptable for whatever you would be interested in. Just from some rough calculations Ensembl appears to have about twice as many nucleotides annotated as opposed to RefSeq. This is likely because of a higher level of isoform annotation in Ensembl, so some nucleotides may be doubly annotated (NOTE: when you download the Ensembl ensGene gtf from UCSC and implement it into cufflinks, you end up with the transcript IDs for you genes not the gene IDs - can be very important to you depending on what you want). Just open up a chromosome in hg19 in UCSC genome browser and you can see how different they look. As to why they are different, again I'm not incredibly knowledgeable here, but each of these methods uses slightly different evidence to add to their respective databases. Mostly, I think they probably vary in two ways: 1) the computational methods they use for predicted gene tracks and 2) in the curation of these database. In the past, RefSeq was more submission based, so the evidence requirements would have appeared to be higher, but that all conjecture on my part. In the end, we've bascially come to view RefSeq as more conservative and gene-oriented and Ensembl as more computationally developed and more transcript-oriented. Anyone else out there have some better insight than me? I would love to hear it.

                  To answer your other question, yes, coordinates, chromosome, and ID determine genomic location and annotation.

                  Comment


                  • #10
                    Coordinates and ID indeed seem to be a substantial problem and require a careful look. Working on RNAseq in cattle, I downloaded the iGenome NCBI/UMD31 files from http://tophat.cbcb.umd.edu/igenomes.html (Jun 20). For the RNAseq alignment, I used the genes.gtf annotation and the whole genome sequence genome.fa files. In the analysis of the RNAseq data, I realized a problem with a incongruence of both files. While for some chromosomes things seem to fine, for BTA14 at least there is a substantial problem.
                    Example: Region 14:21,350,000: no gene is annotated in the UMD3.1 genes.gft file. However, I very clearly have a transcript comprising several exons from RNA-seq in that region. At the respective position, the alternative NCBI BTAU4.2 assembly has a gene annotated RB1CC1, which exactly matches the transcript I found. However, according the UMD3.1 genes.gtf list, this gene is annotated starting at 14:23,147,992 in the UMD3.1 assembly. Thus, I suspect that the UMD3.1 whole genome fasta file genome.fa contained in the iGenome NCBI/UMD31 download falsely contains a BTAU4.2 fasta sequence. Has anybody else experienced similar problems?

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-27-2024, 06:37 PM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-27-2024, 06:07 PM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    56 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    70 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X