Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • genec
    Member
    • Oct 2009
    • 13

    Ensembl gtf to gff3 for tophat

    I found a number of questions about finding a gff3 format file for use by tophat and couldn't find any good answers. I found a few gff3 converters but they were part of larger packages or online tools. Since I'd prefer something simpler, I wrote the attached gtf to gff converter for use with Ensembl's gtf file.

    Feel free to use, modify, or distribute as you need.

    Gene
    Attached Files
  • HTS
    Member
    • Nov 2009
    • 24

    #2
    Thanks a lot for the coding effort and for sharing your script! But are you aware of this one <http://song.cvs.sourceforge.net/viewvc/song/software/scripts/gtf2gff3/>, which has been out there for quite a while? If yes, any improvements upon it? That tool works fine for me, although it does require a large amount of memory...

    -- Leo

    Comment

    • genec
      Member
      • Oct 2009
      • 13

      #3
      Yes, I had tried that gtf2gff3 script, but it wasn't working right for me. Maybe I didn't configure it correctly.

      The script I posted has trivial memory requirements since it only holds one gene's worth of data in memory at once. All the exons for a gene are assumed to be located together in the gtf file, which seems to hold true for the Ensembl file. This script won't work for non-Ensembl gtf files without modification.

      Gene

      Comment

      • HTS
        Member
        • Nov 2009
        • 24

        #4
        I see. Thanks for the explanation! The reason gtf2gff3 doesn't work for you is probably because you forgot to convert chromosome names in the Ensembl convention to the UCSC convention? I forgot that I also wrote a small script to do that (among other things to filter the downloaded GTF file to suit my needs) before running gtf2gff3 (with the default configuration). I guess the real difference is that gtf2gff3 doesn't assume any particular ordering of the lines so it loads everything into memory and tries to figure out appropriate gene models from there. Since Ensmbl GTF files do group things according to genes/transcripts, it is good to explore that property.

        Comment

        • seqfast
          Member
          • Aug 2008
          • 16

          #5
          script looks great, need help for c elegans

          Thanks for the script, looks great and works well for the human gtf. I'm working on c.elegans gtf files (from ensembl), and the ENSG* strings aren't there ... i'm not a regex expert and figured I'd ask if it was an easy fix to use the c.elegans gtf files. I like this script for it's simplicity, I could use the other one mentioned in this thread if need be. Here is a snippet, i've also attached it in case of formatting issues. Thanks!

          -sf

          I snoRNA exon 3747 3909 . - . gene_id "Y74C9A.6"; transcript_id "Y74C9A.6"; exon_number "1"; gene_name "Y74C9A.6"; transcript_name "NR_001477.2";
          I protein_coding exon 10095 10232 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
          I protein_coding CDS 10095 10148 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
          I protein_coding start_codon 10146 10148 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
          I protein_coding exon 9727 9846 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "2"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
          I protein_coding CDS 9727 9846 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "2"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
          I protein_coding exon 6037 6327 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "3"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
          I protein_coding CDS 6037 6327 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "3"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
          I protein_coding exon 5195 5296 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "4"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
          I protein_coding CDS 5195 5296 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "4"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
          I protein_coding exon 4124 4358 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "5"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
          I protein_coding CDS 4224 4358 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "5"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
          Attached Files

          Comment

          • genec
            Member
            • Oct 2009
            • 13

            #6
            See the attached updated script. I modified it to work with your C elegans file. I believe it works, but give the output a good look to make sure that everything is processed correctly.

            Gene
            Attached Files

            Comment

            • seqfast
              Member
              • Aug 2008
              • 16

              #7
              thank you!

              Thanks very much, this works well. I had something similar but was getting hung up in the details. much appreciate people making these most useful scripts available - Thanks Gene,

              -sf

              Comment

              • mdimon
                Member
                • Jan 2010
                • 10

                #8
                thank you! (and a little bug?)

                Thanks for the script! The C. elegans version is great for other GTF files downloaded from UCSC also.

                I did notice what appears to be a little bug:
                push @trs, [@exons];
                should be added before the final
                process(@trs);

                (I am not a perl expert, I'm more of a python type, so I may be wrong, but until I added this line the last record from the GTF file didn't get printed to the GFF3 file.)

                -- Michelle

                Comment

                • genec
                  Member
                  • Oct 2009
                  • 13

                  #9
                  Bug fix

                  That was a good catch, Michelle. I'm attaching a fixed gtf_to_gff.pl. The previous version dropped the very last gene in the gtf file.

                  Gene
                  Attached Files

                  Comment

                  • telos
                    Member
                    • Jan 2010
                    • 11

                    #10
                    MT -&gt; chrM

                    You've omitted changing MT in the Ensembl GTF not to chrMT but to chrM for compatibility with TopHat.

                    Comment

                    • genec
                      Member
                      • Oct 2009
                      • 13

                      #11
                      Yeah, the MT/M thing is always an issue. Both MT and M will work, so there's not one that's right, you just have to be consistent from the beginning.

                      Gene

                      Comment

                      • telos
                        Member
                        • Jan 2010
                        • 11

                        #12
                        OK, fair enough.. I encountered the problem when comparing the SAM output with the GFF file from your script. Nothing a regexp can't solve, but it would be nice nevertheless if the file produced by your script were entirely consistent with the TopHat SAM output.

                        Comment

                        • edge
                          Senior Member
                          • Sep 2009
                          • 199

                          #13
                          Hi telos,

                          Do you know that how to specify Tophat produce accepted_hits.sam?
                          After I run Tophat, why it only generate accepted_hits.bam
                          Thanks for advice.

                          Comment

                          • edge
                            Senior Member
                            • Sep 2009
                            • 199

                            #14
                            Hi telos,

                            Do you know that how to specify Tophat produce accepted_hits.sam?
                            After I run Tophat, why it only generate accepted_hits.bam
                            Thanks for advice.

                            Comment

                            • chadn737
                              Senior Member
                              • Jan 2009
                              • 392

                              #15
                              Its fairly simple to convert bam to sam using samtools.

                              $ samtools view -h -o accepted_hits.sam accepted_hits.bam

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                Yesterday, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              18 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              52 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              110 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...