Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat -G gene model annotations GTF format?

    Hi

    I use -G to supply a GTF file. But tophat show:

    Warning: TopHat did not find any junctions in GTF file

    I wonder what is wrong with my GTF file...

    This is from my GTF file:

    Chr1 SZ gene 1903 9817 . + . gene_id "Os01g01010";
    Chr1 SZ transcript 1903 9817 . + . gene_id "Os01g01010"; transcript_id "Os01g01010.1";
    Chr1 SZ exon 1903 2268 . + . gene_id "Os01g01010"; transcript_id "Os01g01010.1";
    Chr1 SZ exon 2354 2448 . + . gene_id "Os01g01010"; transcript_id "Os01g01010.1";
    Chr1 SZ exon 2449 2616 . + 0 gene_id "Os01g01010"; transcript_id "Os01g01010.1";

  • #2
    Hi,

    I don't know if it matters, but the lines with exon feature in your GTF file don't have the attribute 'exon_number' in the attributes column (rightmost). I'm not sure if Tophat needs the 'exon_number' to determine where the splice junctions are. The GTF I use looks like this:

    Code:
    5	protein_coding	exon	60680	60854	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1";
    5	protein_coding	CDS	60680	60854	.	-	0	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1"; protein_id "ENSSSCP00000000001";
    5	protein_coding	exon	59106	59218	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2";
    5	protein_coding	CDS	59106	59218	.	-	2	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2"; protein_id "ENSSSCP00000000001";
    Where did you get your GTF from?

    All the best
    Dario

    Comment


    • #3
      thanks dariober,

      it seems the exon number is not a problem.

      my GTF has genes in chromosome0 (unassembled stuffs) and the reference genome (bowtie index) does not. Removing the genes in chromosome0 in GTF or adding chro0 to the reference genome solved the problem.

      Comment


      • #4
        Originally posted by dariober View Post
        Hi,

        I don't know if it matters, but the lines with exon feature in your GTF file don't have the attribute 'exon_number' in the attributes column (rightmost). I'm not sure if Tophat needs the 'exon_number' to determine where the splice junctions are. The GTF I use looks like this:

        Code:
        5	protein_coding	exon	60680	60854	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1";
        5	protein_coding	CDS	60680	60854	.	-	0	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1"; protein_id "ENSSSCP00000000001";
        5	protein_coding	exon	59106	59218	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2";
        5	protein_coding	CDS	59106	59218	.	-	2	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2"; protein_id "ENSSSCP00000000001";
        Where did you get your GTF from?

        All the best
        Dario
        Hi Dario,

        it looks like you are using the ENSEMBL gtf file from here, is that correct?

        I am trying to make it work with mm9 or m_musculus_ncbi37 bowtie indexes from the bowtie website without any luck (I am still getting the "TopHat did not find any junctions in GTF file" warning).

        What bowtie index are you using? If you made your own, could you share how?

        Thank you very much!

        Comment


        • #5
          chromosome name issue?

          Originally posted by marcora View Post
          Hi Dario,

          it looks like you are using the ENSEMBL gtf file from here, is that correct?

          I am trying to make it work with mm9 or m_musculus_ncbi37 bowtie indexes from the bowtie website without any luck (I am still getting the "TopHat did not find any junctions in GTF file" warning).

          What bowtie index are you using? If you made your own, could you share how?

          Thank you very much!
          The ENSEMBL gtf is missing the "chr" in front of the chromosome number that is present in the bowtie indexes and the reference genome (fasta format). Try adding "chr" and see if it works then.

          Comment


          • #6
            Originally posted by epigen View Post
            The ENSEMBL gtf is missing the "chr" in front of the chromosome number that is present in the bowtie indexes and the reference genome (fasta format). Try adding "chr" and see if it works then.
            This worked for me when I was trying to use a gtf from Ensembl.

            Comment


            • #7
              Originally posted by epigen View Post
              The ENSEMBL gtf is missing the "chr" in front of the chromosome number that is present in the bowtie indexes and the reference genome (fasta format). Try adding "chr" and see if it works then.
              Does that mean that you are using the mm9 prepackaged bowtie index which contains chr1,chr2,etc?

              Thank you for your suggestion.

              Comment


              • #8
                Originally posted by marcora View Post
                Does that mean that you are using the mm9 prepackaged bowtie index which contains chr1,chr2,etc?
                I don't use it, I built my own, but the Bowtie homepage says "M. musculus, UCSC mm9", which is the same genome I'm using, with chr1,chr2,etc. NCBI has the same format as far as I know, only Ensembl makes an exception.

                Comment


                • #9
                  Originally posted by epigen View Post
                  I don't use it, I built my own, but the Bowtie homepage says "M. musculus, UCSC mm9", which is the same genome I'm using, with chr1,chr2,etc. NCBI has the same format as far as I know, only Ensembl makes an exception.
                  Adding chr in front of each line of the ENSEMBL GTF file doesn't fix the problem.

                  Any other idea?

                  Comment


                  • #10
                    I have the same problem. I made my own index using the GRCh37 genome downloaded from ensembl. The chromosome names, when a check with bowtie-inspect -n, are 1,2,3...X,Y, and the names in the ensembl GTF file are the same, but I get the same error message (Warning: TopHat did not find any junctions in GTF file) .I have used ucsc index and gtf file too and it works. This is the ensembl GTF file:


                    11 pseudogene exon 75780 76143 . + . gene_id "ENSG00000253826"; transcript_id "ENST00000519787"; exon
                    _number "1"; gene_name "RP11-304M2.1"; transcript_name "RP11-304M2.1-001";
                    11 processed_transcript exon 86612 87605 . - . gene_id "ENSG00000224777"; transcript_id "ENST0000052119
                    6"; exon_number "1"; gene_name "AC069287.4"; transcript_name "AC069287.4-002";
                    11 processed_transcript exon 86649 87586 . - . gene_id "ENSG00000224777"; transcript_id "ENST0000042404
                    7"; exon_number "1"; gene_name "AC069287.4"; transcript_name "AC069287.4-001";
                    11 protein_coding exon 129060 129388 . - . gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon
                    _number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201";
                    11 protein_coding CDS 129060 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon
                    _number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201"; protein_id "ENSP00000372234";
                    11 protein_coding start_codon 129386 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST0000038278
                    4"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201";
                    11 protein_coding exon 127926 128376 . - . gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon
                    _number "2"; gene_name "AC069287.3"; transcript_name "AC069287.3-201";
                    11 protein_coding CDS 127929 128376 . - 1 gene_id "ENSG00000230724"; transcript_id "ENST0
                    and this is the UCSC:

                    chr1 hg19_ensGene exon 66999066 66999090 0.000000 + . gene_id "ENST00000237247"; transcript_id
                    "ENST00000237247";
                    chr1 hg19_ensGene start_codon 67000042 67000044 0.000000 + . gene_id "ENST00000237247"; transc
                    ript_id "ENST00000237247";
                    chr1 hg19_ensGene CDS 67000042 67000051 0.000000 + 0 gene_id "ENST00000237247"; transcript_id
                    "ENST00000237247";
                    chr1 hg19_ensGene exon 66999929 67000051 0.000000 + . gene_id "ENST00000237247"; transcript_id
                    "ENST00000237247";
                    chr1 hg19_ensGene CDS 67091530 67091593 0.000000 + 2 gene_id "ENST00000237247"; transcript_id
                    "ENST00000237247";
                    chr1 hg19_ensGene exon 67091530 67091593 0.000000 + . gene_id "ENST00000237247"; transcript_id
                    "ENST00000237247";
                    chr1 hg19_ensGene CDS 67098753 67098777 0.000000 + 1 gene_id "ENST00000237247"; transcript_id
                    "ENST00000237247";
                    chr1 hg19_ensGene exon 67098753 67098777 0.000000 + . gene_id "ENST00000237247"; transcript_id
                    "ENST00000237247";
                    Despite the chromosome names and the attributes in the rightmost column, all field are the same, excepting the 6th column that is a dot in ensembl GTF and "0.00000" in the UCSC one, but I do not know if this field is important or not.

                    Does anyone use Ensembl GTF file with success?

                    Thanks

                    Comment


                    • #11
                      @Bacilo:

                      I'm not sure if I understrand, but did you try changing the chromosome field in the Ensembl gtf to "chrX"?

                      Comment


                      • #12
                        The index and the GFT file have the same chromosome names, both without "chr" but I am going to try to change both.

                        thanks

                        Comment


                        • #13
                          For me, it definitely fixed the problem by adding "chr" to the chromosome field.

                          Comment


                          • #14
                            I will tell you if that works. thanks

                            Comment


                            • #15
                              Originally posted by Bacilo View Post
                              Does anyone use Ensembl GTF file with success?
                              After much struggling and with the help of a member of this forum I have finally been able to use Ensembl GTF files with TopHat.

                              Please find a detailed answer to your problem here!

                              Good luck!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              57 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              56 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X