Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best source for GTF file for use with TopHat/Cufflinks

    I've been grabbing the "refSeq genes" table (human, hg19) from UCSC in GTF file format for use with TopHat/Cufflinks.

    I was just curious as to what everyone else is using and if I might find a more optimal GTF file to use.

  • #2
    That's what I've been using, because I can understand how to use the UCSC table browser to output a GTF file (just change the output format). Just be mindful that it updates fairly frequently, and you might discover new annotations by looking again at something you mapped a few months ago.
    Last edited by gringer; 07-13-2011, 05:07 AM. Reason: got the URL wrong

    Comment


    • #3
      Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

      Comment


      • #4
        Originally posted by gavin.oliver View Post
        Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
        Thanks for the advise. I will try it out today.

        Comment


        • #5
          Originally posted by gavin.oliver View Post
          Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
          Did you get your genome and gtf from the ensembl website?

          Other than renaming the gtf and fasta header entries so that they matched, was there anything else that you had to do to make the gtf file work with tophat? I'm getting an error that my gtf doesn't contain junctions.

          Comment


          • #6
            I didn't have to do anything else, no.

            What command are you using to execute Tophat?

            Comment


            • #7
              Originally posted by gavin.oliver View Post
              I didn't have to do anything else, no.

              What command are you using to execute Tophat?
              I want to make sure that I'm not messing up anything too basic first.

              I grabbed the reference genome in fasta format from here --> ftp://ftp.ensembl.org/pub/release-63...o_sapiens/dna/

              I grabbed the associated GTF file from here -->ftp://ftp.ensembl.org/pub/release-63/gtf/homo_sapiens/

              I then process them so that the entry names match in both files.

              Here are a few lines from my GTF from ensembl:
              Code:
              chr18           protein_coding  exon    49501   49557   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     49501   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  start_codon     49555   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  exon    49129   49237   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     49129   49237   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  exon    48940   49050   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     48940   49050   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  exon    47390   48447   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     47393   48447   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  stop_codon      47390   47392   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           miRNA   exon    48162   48272   .       +       .        gene_id "ENSG00000221441"; transcript_id "ENST00000408514"; exon_number "1"; gene_name "AP001005.1"; transcript_name "AP001005.1-201";
              chr18           protein_coding  exon    158483  158714  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     158699  158714  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              chr18           protein_coding  start_codon     158699  158701  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  exon    163308  163453  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     163308  163453  .       +       2        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              chr18           protein_coding  exon    166787  166819  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     166787  166819  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              Here are the names of my chromosomes according to bowtie-inspect:
              Code:
              chr1
              chr2
              chr3
              chr4
              chr5
              chr6
              chr7
              chr8
              chr9
              chr10
              chr11
              chr12
              chr13
              chr14
              chr15
              chr16
              chr17
              chr18
              chr19
              chr20
              chr21
              chr22
              chrX
              chrY
              chrM
              In TopHat, I get the following error:
              Code:
              [Mon Jul 18 15:15:21 2011] Reading known junctions from GTF file
                      Warning: TopHat did not find any junctions in GTF file
              In Cufflinks, I get the following error:
              Code:
              [08:34:37] Loading reference annotation.
              Error: duplicate GFF ID 'ENST00000445581' encountered!

              Comment


              • #8
                The only difference I can see between my setup and yours is that I removed the 'chr' prefixes. There was a reason for this - but I can't remember what it was!

                Comment


                • #9
                  Well, if chromosome 22 is anything to go by, it looks like the chromosome labels in the fasta file don't include the 'chr' bit.

                  Comment


                  • #10
                    Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

                    I fixed it and all is well.

                    Comment


                    • #11
                      Originally posted by sdarko View Post
                      Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

                      I fixed it and all is well.
                      Glad to hear it

                      Comment


                      • #12
                        @sdarko Could you explain little how you went about "renaming the gtf and fasta header entries so that they matched" please.
                        I'm keen to update the gtf I use with tophat to the ensembl version.

                        many thanks for any advice you may be able to give

                        Comment


                        • #13
                          Look here: http://cufflinks.cbcb.umd.edu/igenomes.html

                          Comment


                          • #14
                            If you download both the genome FASTA and the annotation from ENSEMBL, you shouldn't need to rename anything.
                            Last edited by kopi-o; 10-24-2011, 10:31 AM. Reason: clarity

                            Comment


                            • #15
                              Hi, gavin,
                              I have analysed my RNA-seq data with the references (both genome reference and annotation reference) from UCSC and ensemble. What I found is that the map results with the reference from UCSC is much more than those with the reference from ensemble. What I don't understand is that how this is possible?
                              What is confusing me much more is that GTF from UCSC is less than half of the one in Ensemble. With a smaller reference I got more results !
                              Do you have any idea?
                              Thanks!

                              Originally posted by gavin.oliver View Post
                              Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              71 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X