Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sdarko
    Member
    • Apr 2009
    • 52

    Best source for GTF file for use with TopHat/Cufflinks

    I've been grabbing the "refSeq genes" table (human, hg19) from UCSC in GTF file format for use with TopHat/Cufflinks.

    I was just curious as to what everyone else is using and if I might find a more optimal GTF file to use.
  • gringer
    David Eccles (gringer)
    • May 2011
    • 845

    #2
    That's what I've been using, because I can understand how to use the UCSC table browser to output a GTF file (just change the output format). Just be mindful that it updates fairly frequently, and you might discover new annotations by looking again at something you mapped a few months ago.
    Last edited by gringer; 07-13-2011, 05:07 AM. Reason: got the URL wrong

    Comment

    • gavin.oliver
      Senior Member
      • Jan 2010
      • 110

      #3
      Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

      Comment

      • sdarko
        Member
        • Apr 2009
        • 52

        #4
        Originally posted by gavin.oliver View Post
        Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
        Thanks for the advise. I will try it out today.

        Comment

        • sdarko
          Member
          • Apr 2009
          • 52

          #5
          Originally posted by gavin.oliver View Post
          Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
          Did you get your genome and gtf from the ensembl website?

          Other than renaming the gtf and fasta header entries so that they matched, was there anything else that you had to do to make the gtf file work with tophat? I'm getting an error that my gtf doesn't contain junctions.

          Comment

          • gavin.oliver
            Senior Member
            • Jan 2010
            • 110

            #6
            I didn't have to do anything else, no.

            What command are you using to execute Tophat?

            Comment

            • sdarko
              Member
              • Apr 2009
              • 52

              #7
              Originally posted by gavin.oliver View Post
              I didn't have to do anything else, no.

              What command are you using to execute Tophat?
              I want to make sure that I'm not messing up anything too basic first.

              I grabbed the reference genome in fasta format from here --> ftp://ftp.ensembl.org/pub/release-63...o_sapiens/dna/

              I grabbed the associated GTF file from here -->ftp://ftp.ensembl.org/pub/release-63/gtf/homo_sapiens/

              I then process them so that the entry names match in both files.

              Here are a few lines from my GTF from ensembl:
              Code:
              chr18           protein_coding  exon    49501   49557   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     49501   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  start_codon     49555   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  exon    49129   49237   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     49129   49237   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  exon    48940   49050   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     48940   49050   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  exon    47390   48447   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     47393   48447   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  stop_codon      47390   47392   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           miRNA   exon    48162   48272   .       +       .        gene_id "ENSG00000221441"; transcript_id "ENST00000408514"; exon_number "1"; gene_name "AP001005.1"; transcript_name "AP001005.1-201";
              chr18           protein_coding  exon    158483  158714  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     158699  158714  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              chr18           protein_coding  start_codon     158699  158701  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  exon    163308  163453  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     163308  163453  .       +       2        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              chr18           protein_coding  exon    166787  166819  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     166787  166819  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              Here are the names of my chromosomes according to bowtie-inspect:
              Code:
              chr1
              chr2
              chr3
              chr4
              chr5
              chr6
              chr7
              chr8
              chr9
              chr10
              chr11
              chr12
              chr13
              chr14
              chr15
              chr16
              chr17
              chr18
              chr19
              chr20
              chr21
              chr22
              chrX
              chrY
              chrM
              In TopHat, I get the following error:
              Code:
              [Mon Jul 18 15:15:21 2011] Reading known junctions from GTF file
                      Warning: TopHat did not find any junctions in GTF file
              In Cufflinks, I get the following error:
              Code:
              [08:34:37] Loading reference annotation.
              Error: duplicate GFF ID 'ENST00000445581' encountered!

              Comment

              • gavin.oliver
                Senior Member
                • Jan 2010
                • 110

                #8
                The only difference I can see between my setup and yours is that I removed the 'chr' prefixes. There was a reason for this - but I can't remember what it was!

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  Well, if chromosome 22 is anything to go by, it looks like the chromosome labels in the fasta file don't include the 'chr' bit.

                  Comment

                  • sdarko
                    Member
                    • Apr 2009
                    • 52

                    #10
                    Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

                    I fixed it and all is well.

                    Comment

                    • gavin.oliver
                      Senior Member
                      • Jan 2010
                      • 110

                      #11
                      Originally posted by sdarko View Post
                      Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

                      I fixed it and all is well.
                      Glad to hear it

                      Comment

                      • hbt
                        Member
                        • Jan 2011
                        • 20

                        #12
                        @sdarko Could you explain little how you went about "renaming the gtf and fasta header entries so that they matched" please.
                        I'm keen to update the gtf I use with tophat to the ensembl version.

                        many thanks for any advice you may be able to give

                        Comment

                        • shurjo
                          Senior Member
                          • Jan 2009
                          • 132

                          #13
                          Look here: http://cufflinks.cbcb.umd.edu/igenomes.html

                          Comment

                          • kopi-o
                            Senior Member
                            • Feb 2008
                            • 319

                            #14
                            If you download both the genome FASTA and the annotation from ENSEMBL, you shouldn't need to rename anything.
                            Last edited by kopi-o; 10-24-2011, 10:31 AM. Reason: clarity

                            Comment

                            • HSV-1
                              Member
                              • Jul 2012
                              • 38

                              #15
                              Hi, gavin,
                              I have analysed my RNA-seq data with the references (both genome reference and annotation reference) from UCSC and ensemble. What I found is that the map results with the reference from UCSC is much more than those with the reference from ensemble. What I don't understand is that how this is possible?
                              What is confusing me much more is that GTF from UCSC is less than half of the one in Ensemble. With a smaller reference I got more results !
                              Do you have any idea?
                              Thanks!

                              Originally posted by gavin.oliver View Post
                              Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 07-02-2026, 11:08 AM
                              0 responses
                              12 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              20 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              54 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...