SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Updated How to convert .txt file to .bed .GFF or .BAR file format, forevermark4 Bioinformatics 2 06-30-2014 05:02 AM
tophat gff file error repinementer Bioinformatics 2 07-20-2010 03:28 AM
Convert segemehl *.map file into gff file. satheshsiva Bioinformatics 0 07-16-2010 04:40 AM
problem of tophat gff file syslm01 Bioinformatics 0 05-14-2010 07:12 AM
GFF file for TopHat joseph RNA Sequencing 2 06-15-2009 12:46 AM

Reply
 
Thread Tools
Old 10-28-2010, 11:27 AM   #1
KeithD
Junior Member
 
Location: Davis

Join Date: Oct 2010
Posts: 3
Default Looking for Tophat GFF file (mm9)

Hello,

Does anyone know where I can download a GTF file that will work using Tophat and their provided mm9 build? I downloaded the version from ftp://ftp.ensembl.org/pub/current/gtf/mus_musculus/ and keep getting the following error:

[Thu Oct 28 12:08:01 2010] Reading known junctions from GFF file
Warning: TopHat did not find any junctions in GFF file

I have even tried reformatting the file by adding "chr" in front of everything in the first column of each line (this changes the notation of X of 18 to chrX or chr18). At this point I would prefer downloading a GTF build that works with Tophat v1.1.1 but I can also try to modify the file I have now if someone knows what needs to be changed

A sample of one line of the GTF file:
18 protein_coding CDS 30483176 30483260 . + 0 gene_id "ENSMUSG00000033628"; transcript_id "ENSMUST00000115811"; exon_number "20"; gene_name "Pik3c3"; transcript_name "Pik3c3-004"; protein_id "ENSMUSP00000111478";

-Keith
KeithD is offline   Reply With Quote
Old 10-28-2010, 01:38 PM   #2
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by KeithD View Post
Hello,

Does anyone know where I can download a GTF file that will work using Tophat and their provided mm9 build? I downloaded the version from ftp://ftp.ensembl.org/pub/current/gtf/mus_musculus/ and keep getting the following error:

[Thu Oct 28 12:08:01 2010] Reading known junctions from GFF file
Warning: TopHat did not find any junctions in GFF file

I have even tried reformatting the file by adding "chr" in front of everything in the first column of each line (this changes the notation of X of 18 to chrX or chr18). At this point I would prefer downloading a GTF build that works with Tophat v1.1.1 but I can also try to modify the file I have now if someone knows what needs to be changed

A sample of one line of the GTF file:
18 protein_coding CDS 30483176 30483260 . + 0 gene_id "ENSMUSG00000033628"; transcript_id "ENSMUST00000115811"; exon_number "20"; gene_name "Pik3c3"; transcript_name "Pik3c3-004"; protein_id "ENSMUSP00000111478";

-Keith
1) Go to UCSC table browser
http://genome.ucsc.edu/cgi-bin/hgTab...a_doMainPage=1

2) Select mouse genome assmbly mm9

3) Select Genes and Gene Prediction Tracks in the Group section

4) Select the Ensemble Genes track

5) Under output format select GTF

6) Give the output file a name

7) Get output
RockChalkJayhawk is offline   Reply With Quote
Old 10-28-2010, 02:02 PM   #3
KeithD
Junior Member
 
Location: Davis

Join Date: Oct 2010
Posts: 3
Default

Quote:
Originally Posted by RockChalkJayhawk View Post
1) Go to UCSC table browser
http://genome.ucsc.edu/cgi-bin/hgTab...a_doMainPage=1

2) Select mouse genome assmbly mm9

3) Select Genes and Gene Prediction Tracks in the Group section

4) Select the Ensemble Genes track

5) Under output format select GTF

6) Give the output file a name

7) Get output
I did this and got exactly the same error in output. The file I downloaded had this format:
chr1 mm9_ensGene start_codon 134212807 134212809 0.000000 + . gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";


Other information that might be helpful, versions of programs I am using:
Tophat: 1.1.1
Bowtie: 0.12.7
cufflinks: 0.9.1
myrna: 1.0.9
samtools: 0.1.8
KeithD is offline   Reply With Quote
Old 10-28-2010, 02:27 PM   #4
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by KeithD View Post
I did this and got exactly the same error in output. The file I downloaded had this format:
chr1 mm9_ensGene start_codon 134212807 134212809 0.000000 + . gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";


Other information that might be helpful, versions of programs I am using:
Tophat: 1.1.1
Bowtie: 0.12.7
cufflinks: 0.9.1
myrna: 1.0.9
samtools: 0.1.8
Can you post 10 lines of the GTF and the command you are putting into TopHat?

I just followed those instructions and it worked fine.
RockChalkJayhawk is offline   Reply With Quote
Old 10-28-2010, 02:49 PM   #5
KeithD
Junior Member
 
Location: Davis

Join Date: Oct 2010
Posts: 3
Default

Quote:
Originally Posted by RockChalkJayhawk View Post
Can you post 10 lines of the GTF and the command you are putting into TopHat?

I just followed those instructions and it worked fine.
The tophat command I used was:

tophat -p 4 -o DMSO_tophat_test -G /home/lab/Downloads/ENSEMBLE.genes.gtf --no-novel-juncs /home/lab/Tools/bowtie-0.12.7/indexes/mm9 /home/lab/Data/DMSO_Run/s_6_sequence.fq

and the first 10 lines of the GTF file are:

chr1 mm9_ensGene start_codon 134212807 134212809 0.000000 + . gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene CDS 134212807 134213049 0.000000 + 0 gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene exon 134212703 134213049 0.000000 + . gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene CDS 134221530 134221650 0.000000 + 0 gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene exon 134221530 134221650 0.000000 + . gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene CDS 134222783 134222806 0.000000 + 2 gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene exon 134222783 134222806 0.000000 + . gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene CDS 134224274 134224425 0.000000 + 2 gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene exon 134224274 134224425 0.000000 + . gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
chr1 mm9_ensGene CDS 134224708 134224773 0.000000 + 0 gene_id "ENSMUST00000072177"; transcript_id "ENSMUST00000072177";
KeithD is offline   Reply With Quote
Old 10-29-2010, 06:33 AM   #6
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Unhappy

Keith,

I was able to reproduce your error using the GTF lines that you supplied.
However, using my human data, the process I described above works just fine. Something else you can try is to use this GTF instead.

Try it and post back your results.
RockChalkJayhawk is offline   Reply With Quote
Old 11-01-2010, 04:51 AM   #7
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by RockChalkJayhawk View Post
Keith,

I was able to reproduce your error using the GTF lines that you supplied.
However, using my human data, the process I described above works just fine. Something else you can try is to use this GTF instead.

Try it and post back your results.
Also, you will need to run this command to make it match your bowtie index:
Code:
awk '{print "chr"$0}' Homo_sapiens.GRCh37.59.gtf > ENSEMBLE.gtf
RockChalkJayhawk is offline   Reply With Quote
Old 11-11-2010, 03:35 AM   #8
nkwuji
Member
 
Location: Dublin

Join Date: Mar 2010
Posts: 19
Default

Quote:
Originally Posted by RockChalkJayhawk View Post
1) Go to UCSC table browser
http://genome.ucsc.edu/cgi-bin/hgTab...a_doMainPage=1

2) Select mouse genome assmbly mm9

3) Select Genes and Gene Prediction Tracks in the Group section

4) Select the Ensemble Genes track

5) Under output format select GTF

6) Give the output file a name

7) Get output
I fount the GTF file built by UCSC genome browser tends to have errors for stop_codon coordinates, then refused by cufflinks. This error was caused by some spliced stop_codons. I have tg write my own script to transform the UCSC data table to GTF file.
nkwuji is offline   Reply With Quote
Old 11-12-2010, 04:49 PM   #9
GKM
Member
 
Location: Pasadena, CA

Join Date: May 2009
Posts: 45
Default

You are probably better off just inputing it a junctions file of the simple chr / left / right / strand variety. Those always work, and it is relatively trivial to generate them from any annotation format. I have had gtf files rejected too so I have switched to that format completely for all genomes I work with when mapping with TopHat
GKM is offline   Reply With Quote
Old 11-15-2010, 02:47 AM   #10
nkwuji
Member
 
Location: Dublin

Join Date: Mar 2010
Posts: 19
Default

Quote:
Originally Posted by GKM View Post
You are probably better off just inputing it a junctions file of the simple chr / left / right / strand variety. Those always work, and it is relatively trivial to generate them from any annotation format. I have had gtf files rejected too so I have switched to that format completely for all genomes I work with when mapping with TopHat
It is a good practice to use juncs file instead. But for the last step in cufflinks--cuffdiff also requires a good GTF to calculate the geneexp.diff. So it is hard to get around the bad GTF.
nkwuji is offline   Reply With Quote
Old 11-16-2010, 03:08 AM   #11
Pawan Noel
Junior Member
 
Location: Paris

Join Date: Nov 2010
Posts: 4
Default

Does anyone have a quick list of the most used SAMtool command lines

I'm really new to using UNIX:Linux and I would greatly appreciate if someone could share a pdf/doc for the SAMtool commands

Thank you very much and hv a nice day

Pawan
Pawan Noel is offline   Reply With Quote
Old 11-23-2010, 02:00 PM   #12
mbom777
Junior Member
 
Location: Tucson, AZ

Join Date: Oct 2010
Posts: 4
Default

I had the same error message as the original post. In the logs subdirectory I found a file called "gtf_juncs.log" with the contents:

Code:
gtf_juncs v1.1.4 (1709)
---------------------------
Error: duplicate GFF ID 'ENSMUST00000127664' (or exons too far apart)!
Removing the corresponding line in the Ensembl GTF file fixed it.
mbom777 is offline   Reply With Quote
Old 05-29-2011, 04:23 PM   #13
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

hi KeithD,

Do you figure out the error message about "Warning: TopHat did not find any junctions in GTF file"?
I'm facing the same error message as well
Thanks for advice and sharing
edge is offline   Reply With Quote
Old 05-29-2011, 06:47 PM   #14
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi nkwuji,
mind to share the script that you written to transform the UCSC data table to GTF file?
I'm facing the same error message in Cufflink as well
Thanks in advance.
edge is offline   Reply With Quote
Old 05-29-2011, 06:50 PM   #15
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi nkwuji,
Is it we need to prepare the junction file based on the annotate gtf file from Ensembl or UCSC?
Thanks.
edge is offline   Reply With Quote
Old 07-31-2014, 07:16 AM   #16
Graeme
Junior Member
 
Location: UK

Join Date: Jul 2014
Posts: 1
Default

In Linux the one liner below will create a new file with only rows containing CDS and exon in the third column.

awk '$3=="CDS" || $3=="exon"' myFile.gff3 > new_myFile.gff3
Graeme is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO