SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat - aligning to known gene annotations whuzzy RNA Sequencing 0 02-09-2012 12:04 AM
SpliceMap Gene annotations file for hg19 trickytank Bioinformatics 0 01-18-2011 04:44 PM
how to select gene model from different gene predictions zwzhu Bioinformatics 0 01-13-2011 05:47 AM
free tool for gene model? yvan.wenger Bioinformatics 0 11-20-2010 07:30 AM
tophat with a list of gene model annotations. fabrice Bioinformatics 2 10-13-2010 06:44 AM

Reply
 
Thread Tools
Old 11-30-2010, 08:43 PM   #1
silin284
Member
 
Location: ny

Join Date: Jul 2009
Posts: 23
Default tophat -G gene model annotations GTF format?

Hi

I use -G to supply a GTF file. But tophat show:

Warning: TopHat did not find any junctions in GTF file

I wonder what is wrong with my GTF file...

This is from my GTF file:

Chr1 SZ gene 1903 9817 . + . gene_id "Os01g01010";
Chr1 SZ transcript 1903 9817 . + . gene_id "Os01g01010"; transcript_id "Os01g01010.1";
Chr1 SZ exon 1903 2268 . + . gene_id "Os01g01010"; transcript_id "Os01g01010.1";
Chr1 SZ exon 2354 2448 . + . gene_id "Os01g01010"; transcript_id "Os01g01010.1";
Chr1 SZ exon 2449 2616 . + 0 gene_id "Os01g01010"; transcript_id "Os01g01010.1";
silin284 is offline   Reply With Quote
Old 12-01-2010, 01:03 AM   #2
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Hi,

I don't know if it matters, but the lines with exon feature in your GTF file don't have the attribute 'exon_number' in the attributes column (rightmost). I'm not sure if Tophat needs the 'exon_number' to determine where the splice junctions are. The GTF I use looks like this:

Code:
5	protein_coding	exon	60680	60854	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1";
5	protein_coding	CDS	60680	60854	.	-	0	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1"; protein_id "ENSSSCP00000000001";
5	protein_coding	exon	59106	59218	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2";
5	protein_coding	CDS	59106	59218	.	-	2	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2"; protein_id "ENSSSCP00000000001";
Where did you get your GTF from?

All the best
Dario
dariober is offline   Reply With Quote
Old 12-03-2010, 10:40 AM   #3
silin284
Member
 
Location: ny

Join Date: Jul 2009
Posts: 23
Default

thanks dariober,

it seems the exon number is not a problem.

my GTF has genes in chromosome0 (unassembled stuffs) and the reference genome (bowtie index) does not. Removing the genes in chromosome0 in GTF or adding chro0 to the reference genome solved the problem.
silin284 is offline   Reply With Quote
Old 12-06-2010, 12:14 PM   #4
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by dariober View Post
Hi,

I don't know if it matters, but the lines with exon feature in your GTF file don't have the attribute 'exon_number' in the attributes column (rightmost). I'm not sure if Tophat needs the 'exon_number' to determine where the splice junctions are. The GTF I use looks like this:

Code:
5	protein_coding	exon	60680	60854	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1";
5	protein_coding	CDS	60680	60854	.	-	0	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "1"; protein_id "ENSSSCP00000000001";
5	protein_coding	exon	59106	59218	.	-	.	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2";
5	protein_coding	CDS	59106	59218	.	-	2	 gene_id "ENSSSCG00000000001"; transcript_id "ENSSSCT00000000001"; exon_number "2"; protein_id "ENSSSCP00000000001";
Where did you get your GTF from?

All the best
Dario
Hi Dario,

it looks like you are using the ENSEMBL gtf file from here, is that correct?

I am trying to make it work with mm9 or m_musculus_ncbi37 bowtie indexes from the bowtie website without any luck (I am still getting the "TopHat did not find any junctions in GTF file" warning).

What bowtie index are you using? If you made your own, could you share how?

Thank you very much!
marcora is offline   Reply With Quote
Old 12-07-2010, 04:23 AM   #5
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default chromosome name issue?

Quote:
Originally Posted by marcora View Post
Hi Dario,

it looks like you are using the ENSEMBL gtf file from here, is that correct?

I am trying to make it work with mm9 or m_musculus_ncbi37 bowtie indexes from the bowtie website without any luck (I am still getting the "TopHat did not find any junctions in GTF file" warning).

What bowtie index are you using? If you made your own, could you share how?

Thank you very much!
The ENSEMBL gtf is missing the "chr" in front of the chromosome number that is present in the bowtie indexes and the reference genome (fasta format). Try adding "chr" and see if it works then.
epigen is offline   Reply With Quote
Old 12-07-2010, 05:34 AM   #6
AdamB
Member
 
Location: uk

Join Date: Apr 2010
Posts: 43
Default

Quote:
Originally Posted by epigen View Post
The ENSEMBL gtf is missing the "chr" in front of the chromosome number that is present in the bowtie indexes and the reference genome (fasta format). Try adding "chr" and see if it works then.
This worked for me when I was trying to use a gtf from Ensembl.
AdamB is offline   Reply With Quote
Old 12-07-2010, 06:34 AM   #7
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by epigen View Post
The ENSEMBL gtf is missing the "chr" in front of the chromosome number that is present in the bowtie indexes and the reference genome (fasta format). Try adding "chr" and see if it works then.
Does that mean that you are using the mm9 prepackaged bowtie index which contains chr1,chr2,etc?

Thank you for your suggestion.
marcora is offline   Reply With Quote
Old 12-07-2010, 09:32 AM   #8
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default

Quote:
Originally Posted by marcora View Post
Does that mean that you are using the mm9 prepackaged bowtie index which contains chr1,chr2,etc?
I don't use it, I built my own, but the Bowtie homepage says "M. musculus, UCSC mm9", which is the same genome I'm using, with chr1,chr2,etc. NCBI has the same format as far as I know, only Ensembl makes an exception.
epigen is offline   Reply With Quote
Old 12-07-2010, 02:20 PM   #9
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by epigen View Post
I don't use it, I built my own, but the Bowtie homepage says "M. musculus, UCSC mm9", which is the same genome I'm using, with chr1,chr2,etc. NCBI has the same format as far as I know, only Ensembl makes an exception.
Adding chr in front of each line of the ENSEMBL GTF file doesn't fix the problem.

Any other idea?
marcora is offline   Reply With Quote
Old 01-10-2011, 03:05 AM   #10
Bacilo
Junior Member
 
Location: Madrid

Join Date: May 2010
Posts: 5
Default

I have the same problem. I made my own index using the GRCh37 genome downloaded from ensembl. The chromosome names, when a check with bowtie-inspect -n, are 1,2,3...X,Y, and the names in the ensembl GTF file are the same, but I get the same error message (Warning: TopHat did not find any junctions in GTF file) .I have used ucsc index and gtf file too and it works. This is the ensembl GTF file:


Quote:
11 pseudogene exon 75780 76143 . + . gene_id "ENSG00000253826"; transcript_id "ENST00000519787"; exon
_number "1"; gene_name "RP11-304M2.1"; transcript_name "RP11-304M2.1-001";
11 processed_transcript exon 86612 87605 . - . gene_id "ENSG00000224777"; transcript_id "ENST0000052119
6"; exon_number "1"; gene_name "AC069287.4"; transcript_name "AC069287.4-002";
11 processed_transcript exon 86649 87586 . - . gene_id "ENSG00000224777"; transcript_id "ENST0000042404
7"; exon_number "1"; gene_name "AC069287.4"; transcript_name "AC069287.4-001";
11 protein_coding exon 129060 129388 . - . gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon
_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201";
11 protein_coding CDS 129060 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon
_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201"; protein_id "ENSP00000372234";
11 protein_coding start_codon 129386 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST0000038278
4"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201";
11 protein_coding exon 127926 128376 . - . gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon
_number "2"; gene_name "AC069287.3"; transcript_name "AC069287.3-201";
11 protein_coding CDS 127929 128376 . - 1 gene_id "ENSG00000230724"; transcript_id "ENST0
and this is the UCSC:

Quote:
chr1 hg19_ensGene exon 66999066 66999090 0.000000 + . gene_id "ENST00000237247"; transcript_id
"ENST00000237247";
chr1 hg19_ensGene start_codon 67000042 67000044 0.000000 + . gene_id "ENST00000237247"; transc
ript_id "ENST00000237247";
chr1 hg19_ensGene CDS 67000042 67000051 0.000000 + 0 gene_id "ENST00000237247"; transcript_id
"ENST00000237247";
chr1 hg19_ensGene exon 66999929 67000051 0.000000 + . gene_id "ENST00000237247"; transcript_id
"ENST00000237247";
chr1 hg19_ensGene CDS 67091530 67091593 0.000000 + 2 gene_id "ENST00000237247"; transcript_id
"ENST00000237247";
chr1 hg19_ensGene exon 67091530 67091593 0.000000 + . gene_id "ENST00000237247"; transcript_id
"ENST00000237247";
chr1 hg19_ensGene CDS 67098753 67098777 0.000000 + 1 gene_id "ENST00000237247"; transcript_id
"ENST00000237247";
chr1 hg19_ensGene exon 67098753 67098777 0.000000 + . gene_id "ENST00000237247"; transcript_id
"ENST00000237247";
Despite the chromosome names and the attributes in the rightmost column, all field are the same, excepting the 6th column that is a dot in ensembl GTF and "0.00000" in the UCSC one, but I do not know if this field is important or not.

Does anyone use Ensembl GTF file with success?

Thanks
Bacilo is offline   Reply With Quote
Old 01-10-2011, 03:18 AM   #11
AdamB
Member
 
Location: uk

Join Date: Apr 2010
Posts: 43
Default

@Bacilo:

I'm not sure if I understrand, but did you try changing the chromosome field in the Ensembl gtf to "chrX"?
AdamB is offline   Reply With Quote
Old 01-10-2011, 03:23 AM   #12
Bacilo
Junior Member
 
Location: Madrid

Join Date: May 2010
Posts: 5
Default

The index and the GFT file have the same chromosome names, both without "chr" but I am going to try to change both.

thanks
Bacilo is offline   Reply With Quote
Old 01-10-2011, 03:29 AM   #13
AdamB
Member
 
Location: uk

Join Date: Apr 2010
Posts: 43
Default

For me, it definitely fixed the problem by adding "chr" to the chromosome field.
AdamB is offline   Reply With Quote
Old 01-10-2011, 03:30 AM   #14
Bacilo
Junior Member
 
Location: Madrid

Join Date: May 2010
Posts: 5
Default

I will tell you if that works. thanks
Bacilo is offline   Reply With Quote
Old 01-10-2011, 04:02 AM   #15
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Smile

Quote:
Originally Posted by Bacilo View Post
Does anyone use Ensembl GTF file with success?
After much struggling and with the help of a member of this forum I have finally been able to use Ensembl GTF files with TopHat.

Please find a detailed answer to your problem here!

Good luck!
marcora is offline   Reply With Quote
Old 04-21-2011, 06:26 AM   #16
Littlema
Junior Member
 
Location: France

Join Date: Apr 2011
Posts: 1
Default

Hi everybody,
i apologize to you in advance if my question is stupid and may be it's not the right place to ask it, but if you can take few minutes to answer me it would be nice.
In gtf file, what is the column 2 (protein_coding or processed_transcript ...) ?

Thanks
Littlema is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO