Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to convert GTF to RefFlat?

    My use case is as follows:

    I have aligned RNA-Seq data using tophat2 and a GTF file I got from Ensembl. Now I would like to produce alignment stats (number of reads mapping to exons, introns, intergenic regions) and, it seems, the CollectRnaSeqMetrics script from the Piccard toolset would be the most straight-forward to use. However, it requires annotation in RefFlat format (why on Earth there have to be so many different annotation formats??). So now I would like to convert GTF to RefFlat format. There is, apparently, a tool called gtfToGenePred which is supposed to do this but I could not find a manual or documentation for it. Can anyone provide info on converting GTF files to RefFlat format? Alternatively, can you suggest a way I can get the alignment stats using a GTF annotation file?

  • #2
    If you just run gtfToGenePred without any arguments then it prints its usage instructions. This is by far the simplest method to generate the file you want.

    Comment


    • #3
      Originally posted by dpryan View Post
      If you just run gtfToGenePred without any arguments then it prints its usage instructions. This is by far the simplest method to generate the file you want.
      Thanks. I tried it but it does not seem to work:

      Code:
      localhost:Downloads nnikolo$ ls -lt gtf*
      -rwxr-xr-x@ 1 nnikolo  staff  12944032 19 Feb 12:18 gtfToGenePred
      localhost:Downloads nnikolo$ ./gtfToGenePred
      -bash: ./gtfToGenePred: cannot execute binary file
      localhost:Downloads nnikolo$ gtfToGenePred
      -bash: /Users/nnikolo/Downloads/gtfToGenePred: cannot execute binary file

      Comment


      • #4
        chmod a+x gtfToGenePred

        Comment


        • #5
          @feralBiologist: Looks like you may not have download the correct binary (or did you compile this yourself)? What OS are you using?

          Comment


          • #6
            Originally posted by dpryan View Post
            chmod a+x gtfToGenePred
            Code:
            [B]-rw[B]x[/B]r-[B]x[/B]r-[B]x[/B][/B]@ 1 nnikolo  staff  12944032 19 Feb 12:18 gtfToGenePred
            Doesn't seem to be the problem

            Comment


            • #7
              Yeah, I expect that GenoMax is correct and the wrong executable was downloaded. I should note that I just downloaded and tested that program on a GTF file on my workstation (running Ubuntu) and it worked fine.
              Last edited by dpryan; 02-19-2015, 06:24 AM.

              Comment


              • #8
                Originally posted by GenoMax View Post
                @feralBiologist: Looks like you may not have download the correct binary (or did you compile this yourself)? What OS are you using?
                I am on MacOS X. Actually, I could not find a proper download page - just opening one of the Google search results ended up automatically downloading the binary.

                Comment


                • #9
                  What about this one? http://hgdownload.cse.ucsc.edu/admin.../gtfToGenePred

                  Comment


                  • #10
                    Originally posted by sarvidsson View Post
                    Thanks - I think it's robots.txt that UCSC put which prevented Google from displaying that site.

                    Comment


                    • #11
                      I ran into troubles again. I was able to run gtfToGenePred and it produced a refFlat file from the Ensembl GTF file. However, when I tried using it with Picard's CollectRnaSeqMetrics I got

                      Code:
                      Exception in thread "main" picard.annotation.AnnotationException: Wrong number of fields in refFlat file mm75.genePred at line 1
                              at picard.annotation.RefFlatReader.load(RefFlatReader.java:80)
                              at picard.annotation.RefFlatReader.load(RefFlatReader.java:66)
                              at picard.annotation.GeneAnnotationReader.loadRefFlat(GeneAnnotationReader.java:37)
                              at picard.analysis.CollectRnaSeqMetrics.setup(CollectRnaSeqMetrics.java:105)
                              at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:98)
                              at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:53)
                              at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
                              at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
                              at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
                      So, gtfToGenePred has, apparently, produced an invalid file. Has anyone, actually, used gtfToGenePred successfully?

                      Comment


                      • #12
                        While you wait for someone to sort out the refFlat issue you could try this tool: http://rseqc.sourceforge.net/

                        Hopefully will get you stats similar to what you are looking for.

                        Comment


                        • #13
                          Originally posted by feralBiologist View Post
                          Has anyone, actually, used gtfToGenePred successfully?
                          If anyone is interested - here is the GTF I used, the refFlat produced, and here are the command line arguments I used for CollectRnaSeqMetrics:

                          Code:
                          java -jar picard.jar CollectRnaSeqMetrics INPUT=sample.sam REF_FLAT=mm75.genePred STRAND=NONE MINIMUM_LENGTH=100 CHART=chart.out OUTPUT=accepted_hits_sam.stat
                          I'd be interested if anyone can make CollectRnaSeqMetrics work using these input files.

                          Comment


                          • #14
                            Based on the answer Pierre provided: https://www.biostars.org/p/107779/

                            Try your luck with this refFlat.txt file: http://hgdownload.cse.ucsc.edu/golde...refFlat.txt.gz

                            Note: If you need Ensembl ID's then this would not help. At least this file does not produce an error with CollectRnaSeqMetrics.

                            Comment


                            • #15
                              Originally posted by GenoMax View Post
                              Based on the answer Pierre provided: https://www.biostars.org/p/107779/

                              Try your luck with this refFlat.txt file: http://hgdownload.cse.ucsc.edu/golde...refFlat.txt.gz

                              Note: If you need Ensembl ID's then this would not help. At least this file does not produce an error with CollectRnaSeqMetrics.
                              Unfortunately, I did quite a bit of work already using the Ensembl GTF - I would have to start afresh and discard all the work done so far as I can't be certain that the Ensembl version I am using (75) has an exact UCSC equivalent. Seems to me to be too much of a hassle if the only bit missing is the stats on intergenic regions and introns.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X