Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GTF reference files that work with TopHat/Cufflinks

    Maybe it's just me, but the only GTF reference file that seems to work in TopHat (when using the -G option that is) is refFlat (RefSeq) downloaded from the UCSC Table Browser. Any other GTF file I download from there (e.g., ensGene) or ENSEMBL (following the recommendation in the TopHat website) just doesn't work (i.e., I get the "TopHat did not find any junctions in GTF file" warning) .

    I know the problem has popped up in this forum multiple times. Therefore, I would like to get as much feedback as possible as to which GTF files people are using for their RNAseq analysis. And whether anybody has managed to get the ENSEMBL gene annotation to work.

    Thank you in advance for your help and consideration.
    Last edited by marcora; 12-06-2010, 01:09 PM.

  • #2
    There seem to be many conditions under which various GTF files will cause both tophat and cufflinks to unceremoniously segfault when analyzing stranded RNA-seq data. I've been bisecting my sequence files to try to find the reads that are making these applications unhappy, but nothing has completely helped me other than moving from ENSEMBL GTF to UCSC refFlat as you suggested. Some official guidelines on GTF files and more meaningful error messages would be very useful.

    Comment


    • #3
      Originally posted by bcalder View Post
      There seem to be many conditions under which various GTF files will cause both tophat and cufflinks to unceremoniously segfault when analyzing stranded RNA-seq data. I've been bisecting my sequence files to try to find the reads that are making these applications unhappy, but nothing has completely helped me other than moving from ENSEMBL GTF to UCSC refFlat as you suggested. Some official guidelines on GTF files and more meaningful error messages would be very useful.
      I don't quite understand yet why (even with the correct naming of chromosomes etc in the bowtie index) tophat and cufflinks only seem to work with refFlat against mm9, instead of with any well-formed GTF file (such as the Ensembl one).

      Feedback from tophat/cufflinks authors would be very welcome!

      Comment


      • #4
        I have been struggling with the same issue and only refflat is working. I have even posted several messages on this forum. I will join in requesting authors of Tophat and Cufflinker to please share some information even if it is some issue which cannot be corrected, so that other researchers should not waste time in trying to which is not possible and may seek some alternatives. Alternatively in the best interest of scientific community authors may like to share, if some improvement is being made which can take care of this problem made to address this issue.
        Thanks.

        Comment


        • #5
          Have you tried renaming the chromosomes? If you are aligning against the UCSC hg19 genome chromosomes will have names such as chr1, chr2, ..., chrX, chrY, but Ensembl denotes them by 1, 2, ..., X, Y. Only exception is chrM which Ensembl denotes MT.

          Try runnings this on your Ensembl GTF:
          $ awk '{print "chr"$0}' Homo_sapiens.GRCh37.56.gtf | sed 's/chrMT/chrM/g' > hg19.ensembl-for-tophat.gtf

          The UCSC gtf files work out of the box if you are using a UCSC fasta reference genome.

          EDIT
          Just saw that you had already tried with renaming the chromosomes (although in the bowtie index). Just be sure you have used the same chromosome names right from the beginning. Simply fixing the Ensembl GTF chromosome names with the above command was enough for it to work on my station.
          Last edited by Thomas Doktor; 01-06-2011, 07:15 AM.

          Comment


          • #6
            Originally posted by Thomas Doktor View Post
            Have you tried renaming the chromosomes? If you are aligning against the UCSC hg19 genome chromosomes will have names such as chr1, chr2, ..., chrX, chrY, but Ensembl denotes them by 1, 2, ..., X, Y. Only exception is chrM which Ensembl denotes MT.

            Try runnings this on your Ensembl GTF:
            $ awk '{print "chr"$0}' Homo_sapiens.GRCh37.56.gtf | sed 's/chrMT/chrM/g' > hg19.ensembl-for-tophat.gtf

            The UCSC gtf files work out of the box if you are using a UCSC fasta reference genome.

            EDIT
            Just saw that you had already tried with renaming the chromosomes (although in the bowtie index). Just be sure you have used the same chromosome names right from the beginning. Simply fixing the Ensembl GTF chromosome names with the above command was enough for it to work on my station.
            I have already tried that before, but just in case I re-tried today.

            1) I downloaded the latest gtf file for mus musculus from ftp://ftp.ensembl.org/pub/current/gtf/

            2) I ran your awk one-liner on it

            3) I bowtie-inspected my mm9 index to make sure chromosome names match and I get the following output:
            chr1
            chr2
            chr3
            chr4
            chr5
            chr6
            chr7
            chr8
            chr9
            chr10
            chr11
            chr12
            chr13
            chr14
            chr15
            chr16
            chr17
            chr18
            chr19
            chrX
            chrY
            chrM

            4) I ran tophat (with -G and --no-novel-juncs options) using the mm9 index and the modified gtf file and got the same error as before!!!

            "Reading known junctions from GTF file
            Warning: TopHat did not find any junctions in GTF file"


            Has anybody had any luck running tophat using mm9 and ensembl gtf???

            Thanx in advance for your help and consideration,

            Dado

            p.s.: I run bowtie v0.12.7.0 and tophat v1.1.4, the mm9 bowtie index is the one from the bowtie home page

            I have also run both the original and modified GTF file through this GTF validator and obtained the following output, identical for both files:

            Code:
            Transcript ENSMUST00000146893 contains no CDS features.
            Missing stop_codon for transcript "ENSMUST00000132028".
            Transcript ENSMUST00000125766 contains no CDS features.
            Transcript ENSMUST00000123080 contains no CDS features.
            Transcript ENSMUST00000134441 contains no CDS features.
            Transcript ENSMUST00000119451 contains no CDS features.
            Missing start_codon for transcript "ENSMUST00000110970".
            Missing stop_codon for transcript "ENSMUST00000046878".
            Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000156610".
            Missing start_codon for transcript "ENSMUST00000057060".
            Missing stop_codon for transcript "ENSMUST00000130267".
            Missing stop_codon for transcript "ENSMUST00000125833".
            Missing stop_codon for transcript "ENSMUST00000136539".
            Missing start_codon for transcript "ENSMUST00000126611".
            Missing start_codon for transcript "ENSMUST00000150269".
            2 short (<20bp) intron(s) found on in transcript "ENSMUST00000090094".
            Missing start_codon for transcript "ENSMUST00000027440".
            1 short (<20bp) intron(s) found on in transcript "ENSMUST00000100381".
            1 short (<20bp) intron(s) found on in transcript "ENSMUST00000111825".
            1 short (<20bp) intron(s) found on in transcript "ENSMUST00000112718".
            1 short (<20bp) intron(s) found on in transcript "ENSMUST00000055026".
            Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112673".
            Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000121280".
            Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000103276".
            Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000115127".
            Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000112382".
            Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000120549".
            Complete transcript length not a multiple of three on transcript_id = "ENSMUST00000117830".
            Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000093816".
            Start Codon length is not three in transcript "ENSMUST00000144976".
            Inconsistent frame accross transcript_id = "ENSMUST00000144976".
            Stop codon location not consistent with terminal exon locaiton in transcript_id = "ENSMUST00000133425".
            Start Codon length is not three in transcript "ENSMUST00000139210".
            Inconsistent frame accross transcript_id = "ENSMUST00000139210".
            Start Codon length is not three in transcript "ENSMUST00000156223".
            Inconsistent frame accross transcript_id = "ENSMUST00000156223".
            Start Codon length is not three in transcript "ENSMUST00000127994".
            Inconsistent frame accross transcript_id = "ENSMUST00000127994".
            Start Codon length is not three in transcript "ENSMUST00000123108".
            Inconsistent frame accross transcript_id = "ENSMUST00000123108".
            
            Warnings encountered:
            Count	Description
            37227	Transcript contains no CDS features.
            23	Start Codon length is not three.
            4929	Transcript has no start codon.
            7599	Transcript has no stop codon.
            92	Stop codon location inconsistent with terminal exon location.
            192	Complete transcript length not multiple of 3.
            23	Inconsistant phase value accross transcript.
            916	Transcript contains short (<20bp) intron(s).
            
            Statistics:
            	36536 genes with 88186 transcripts containing 436860 cds.
            Bad Genes:
            Last edited by marcora; 01-06-2011, 07:53 AM.

            Comment


            • #7
              I tried one more thing:

              1) I loaded the awk-modified GTF file as a custom track into the UCSC genome browser against NCBI37/mm9. The browser complained about annotations starting with chrNT and did not load.

              2) I cleaned the GTF file using
              Code:
              awk '/^chr[1-9XYM]|^chr1[0-9]/' mm.ncbim37.60.gtf > mm.ncbim37.60.clean.gtf
              3) Reloaded the GTF file into the UCSC genome browser... everything works fine with the "cleaned" GTF file

              ... too bad that even with the "cleaned" GTF file tophat still gives the same error!


              Originally posted by marcora View Post
              I have already tried that before, but just in case I re-tried today.

              1) I downloaded the latest gtf file for mus musculus from ftp://ftp.ensembl.org/pub/current/gtf/

              2) I ran your awk one-liner on it

              3) I bowtie-inspected my mm9 index to make sure chromosome names match and I get the following output:
              chr1
              chr2
              chr3
              chr4
              chr5
              chr6
              chr7
              chr8
              chr9
              chr10
              chr11
              chr12
              chr13
              chr14
              chr15
              chr16
              chr17
              chr18
              chr19
              chrX
              chrY
              chrM

              4) I ran tophat (with -G and --no-novel-juncs options) using the mm9 index and the modified gtf file and got the same error as before!!!

              "Reading known junctions from GTF file
              Warning: TopHat did not find any junctions in GTF file"


              Has anybody had any luck running tophat using mm9 and ensembl gtf???

              Thanx in advance for your help and consideration,

              Dado
              Last edited by marcora; 01-06-2011, 08:42 AM.

              Comment


              • #8
                There seems to be an error (possibly a duplication) with the entries for the ENSMUST00000127664 transcript in the Ensembl mm9 gtf file.

                You can correct your Ensembl file this way (I have succesfully used this with Tophat 1.1.4):
                $ awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | grep -v "ENSMUST00000127664" > mm9.Ensembl.gtf

                Note that this will of course remove the information about the ENSMUST00000127664 transcript, so it's just a quick fix, you should probably look more into the gtf file to try and correct it.

                EDIT
                TopHat itself is a bit silent about the error causing this, but if you run the junction finding script (gtf_juncs, located in the TopHat bin folder), it will report a much more meaningful error message:
                gtf_juncs v1.1.4 (1709)
                ---------------------------
                Error: duplicate GFF ID 'ENSMUST00000127664' (or exons too far apart)!
                Last edited by Thomas Doktor; 01-06-2011, 09:22 AM. Reason: How to find the error

                Comment


                • #9
                  Originally posted by Thomas Doktor View Post
                  There seems to be an error (possibly a duplication) with the entries for the ENSMUST00000127664 transcript in the Ensembl mm9 gtf file.

                  You can correct your Ensembl file this way (I have succesfully used this with Tophat 1.1.4):
                  $ awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | grep -v "ENSMUST00000127664" > mm9.Ensembl.gtf

                  Note that this will of course remove the information about the ENSMUST00000127664 transcript, so it's just a quick fix, you should probably look more into the gtf file to try and correct it.

                  EDIT
                  TopHat itself is a bit silent about the error causing this, but if you run the junction finding script (gtf_juncs, located in the TopHat bin folder), it will report a much more meaningful error message:
                  First and foremost, thank you for your help and patience.

                  I removed the offending lines as you indicated and now my GTF file passes the gtf_juncs test without warnings or errors.

                  However, tophat still gives the same warning!

                  Comment


                  • #10
                    TopHat did not produce any errors here. What commands did you use?

                    Comment


                    • #11
                      Originally posted by Thomas Doktor View Post
                      TopHat did not produce any errors here. What commands did you use?
                      Code:
                      #!/bin/bash
                      
                      GENOME=mm9
                      GENES=mm9.ensembl
                      
                      for f in $( ls *.fq ); do
                          if [ -f $f ]; then
                              mkdir $f.tophat
                              tophat -p 8 --library-type fr-unstranded --solexa1.3-quals -G $BOWTIE_INDEXES/$GENES.gtf --no-novel-juncs -o $f.tophat $BOWTIE_INDEXES/$GENOME $f
                          fi
                      done

                      Comment


                      • #12
                        And you are sure the mm9.ensembl.gtf is the corrected one?

                        Comment


                        • #13
                          Originally posted by marcora View Post
                          First and foremost, thank you for your help and patience.

                          I removed the offending lines as you indicated and now my GTF file passes the gtf_juncs test without warnings or errors.

                          However, tophat still gives the same warning!
                          My bad! This last time when I ran tophat I didn't rename the cleaned GTF file to "mm9.ensembl" and therefore tophat couldn't find it. Surprisingly, instead of reporting a missing file, tophat gave me the same exact warning as before.

                          In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such:

                          Code:
                          awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf
                          I am finally able to run tophat against the ENSEMBL annotation.

                          You are my hero!

                          Thank you very much for your help.

                          Comment


                          • #14
                            Thanks macrora and Thomas,
                            I was able to use Enseml GTF file finally. One question (may be trivial) in the HG18 build there are several entries in the chromosome column like c6_COX, c6_QBL (haplotypes assembly) as well as NT_113944, 113923, 112884 and so on. How we deal with them in RNA seq. Should we just skip them in RNA seq analysis or there is any other suggestion please.

                            Comment


                            • #15
                              Suggestion to moderator

                              I would suggest making this thread sticky. This problem of using Ensembl file in RNA seq analysis has been raised several times here and in other forums. So that other fellow researchers should not waste too much time and are benefited.
                              Last edited by honey; 01-09-2011, 10:59 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X