SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEG analysis without gff/gtf file syintel87 Bioinformatics 6 01-11-2013 01:06 AM

Reply
 
Thread Tools
Old 06-19-2013, 07:29 AM   #1
jmwhitha
Senior Member
 
Location: NC State, Raleigh, NC

Join Date: Mar 2013
Posts: 107
Default How do I get a genes.gtf file for RNA Seq Analysis

Hello all,

I am interested in using the TopHat Cufflinks workflow to analyze my RNA seq data.

The first step's command line is:
tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1.fq

From the NCBI website, I can get a multi-fasta file of the coding sequence, but how can I get a genes.gtf?

God bless,
Jason
jmwhitha is offline   Reply With Quote
Old 06-19-2013, 07:36 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

If you are working with a genome that is available from iGenomes site you will find the gtf file in the big tarball there. Also bundled in are various sequence and corresponding index files (bwa, bowtie etc) so you may want to stick with this for entire analysis.

http://cufflinks.cbcb.umd.edu/igenomes.html

What genome are you working with?

Last edited by GenoMax; 06-19-2013 at 07:40 AM.
GenoMax is offline   Reply With Quote
Old 06-19-2013, 05:47 PM   #3
jmwhitha
Senior Member
 
Location: NC State, Raleigh, NC

Join Date: Mar 2013
Posts: 107
Default

I ended up converting the genbank file to a GTF one with the python script from the link below. It worked great! Thanks so much GenoMax for all your extremely helpful advice!!

http://www.biostars.org/p/15138/
jmwhitha is offline   Reply With Quote
Old 06-20-2013, 10:30 AM   #4
jmwhitha
Senior Member
 
Location: NC State, Raleigh, NC

Join Date: Mar 2013
Posts: 107
Default

Well, I thought it worked...

A GTF file was created and looks similar to the one from the example data (FruitFly). The first line of the head command for my gtf file gives:
CP001666.1 gb2gtf gene 101 1453 . + . gene_id "<firstgene>"; transcript_id "<firsttranscript>"; gene_id "dnaA"
<The gene_id and transcript_id were left out above.>
Notice that the value of the first column was "CP001666.1", and gb2gtf is the program used to do the conversion.

The first line of the head of the example .gtf file is:
2L protein_coding exon 7529 8116 . + . exon_number "1"; gene_id "FBgn0031208"; gene_name "CG11023"; p_id "P9062"; transcript_id "FBtr0300689"; transcript_name "CG11023-RB"; tss_id "TSS8382";
In this file, "2L" is the first column for each line of data. Here they use "protein_coding" instead of "gene".

Now transitioning from the example .gtf file to the example .fa file.
Notice that the example .fa file has a "2L" as the first line.
jmwhitha@jmwhitha-OptiPlex-755:~/my_rnaseq_exp$ head genome.fa
>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
ATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTT
GATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCC
GCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGC
GAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCT
ATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC
TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAG
CAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGAT

My .fa file had a long string of identification so I manually changed it to CP001666.1.

Why is this important?
According to the TopHat manual, "the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat."
So did I do that correctly? I'm not sure.

What I know is that before I made that change the tophat command would give the following warning. Then, it would give an error and quit after a while of processing.

tophat -p 8 -G <mygtf>.gtf -o 0111-01_1_thout genome *0111-01_1.fq
# Bowtie version: 1.0.0.0
#[2013-06-19 22:22:06] Checking for Samtools
# Samtools version: 0.1.19.0
#[2013-06-19 22:22:06] Checking for Bowtie index files
#[2013-06-19 22:22:06] Checking for reference FASTA file
#[2013-06-19 22:22:06] Generating SAM header for genome
# format: fastq
# quality scale: phred33 (default)
#[2013-06-19 22:22:06] Reading known junctions from GTF file
# Warning: TopHat did not find any junctions in GTF file

Though this warning still showed up after I altered my .fa file, instead of quitting, it continued on with the following processing:

#[2013-06-20 08:59:15] Preparing reads
# left reads: min. length=100, max. length=100, 24223083 kept reads (10522 discarded)
#[2013-06-20 09:08:52] Creating transcriptome data files..
#[2013-06-20 09:08:52] Building Bowtie index from <myfasta>.fa
#[2013-06-20 09:08:59] Mapping left_kept_reads to transcriptome <mybase> with Bowtie
#[2013-06-20 10:14:48] Resuming TopHat pipeline with unmapped reads
#[2013-06-20 10:14:48] Mapping left_kept_reads.m2g_um to genome genome with Bowtie
#[2013-06-20 10:33:05] Mapping left_kept_reads.m2g_um_seg1 to genome genome with Bowtie (1/4)
#[2013-06-20 10:37:18] Mapping left_kept_reads.m2g_um_seg2 to genome genome with Bowtie (2/4)
#[2013-06-20 10:42:02] Mapping left_kept_reads.m2g_um_seg3 to genome genome with Bowtie (3/4)
#[2013-06-20 10:46:34] Mapping left_kept_reads.m2g_um_seg4 to genome genome with Bowtie (4/4)
#[2013-06-20 10:50:18] Searching for junctions via segment mapping
#[2013-06-20 10:55:31] Retrieving sequences for splices
#[2013-06-20 10:55:32] Indexing splices
#[2013-06-20 10:55:35] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie (1/4)
#[2013-06-20 10:57:48] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie (2/4)
#[2013-06-20 11:00:36] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie (3/4)
#[2013-06-20 11:03:20] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie (4/4)
#[2013-06-20 11:05:31] Joining segment hits
#[2013-06-20 11:07:08] Reporting output tracks
#-----------------------------------------------
#[2013-06-20 11:35:01] Run complete: 02:35:45 elapsed

After this process was done, I had a new .thout file and a new transcriptone_data directory with known.3.ebwt, known.4.ebwt, known.fa, and known.gff. I was surprised there was no known.1.ebwt or known.2.ebwt. I'm not sure, but this may be the reason why I had trouble with the following step.

According to the TopHat manual:
"When providing TopHat with a known transcript file (-G/--GTF option above), a transcriptome sequence file is built and a Bowtie index has to be created for it in order to align the reads to the known transcripts. Creating this Bowtie index can be time consuming and in many cases the same transcriptome data is being used for aligning multiple samples with TopHat. A transcriptome index and the associated data files (the original GFF file) can be thus reused for multiple TopHat runs with this option, so these files are only created for the first run with a given set of transcripts. If multiple TopHat runs are planned with the same transcriptome data, TopHat should be first run with the -G option and with the --transcriptome-index option pointing to a directory and a name prefix which will indicate where the transcriptome data files will be stored. Then subsequent TopHat runs using the same --transcriptome-index option value will directly use the transcriptome data created in the first run (no -G option needed for subsequent runs).
For example the first TopHat run could look like this:
tophat -o out_sample1 -G known_genes.gtf \
--transcriptome-index=transcriptome_data/known \
hg19 sample1_1.fq.z
In this example the first run will create the transcriptome_data directory if it doesn't exist, and files known.fa, known.gff and known.*ebwt (Bowtie index files) will be generated in that directory. Then for subsequent runs with the same genome and known transcripts but different reads (e.g. sample2_2.fq.z etc.), TopHat will no longer spend time building the transcriptome index because it can directly use the previously built transcriptome index, so the -G option can be discarded for subsequent runs (however using it again will not force TopHat to build the transcriptome index files again if they are already present)
tophat -o out_sample2 --transcriptome-index=transcriptome_data/known hg19 sample2_1.fq.z"

Therefore, instead of the command I used for the first one I tried:
(original command) tophat -p 8 -G <mygtf>.gtf -o 0111-01_1_thout genome *0111-01_1.fq
# <mygtf> is just a place holder
(new one) tophat -p 8 --transcriptome-index=transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
(variant of the new one) tophat -p 8 transcriptome-index=transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
(variant of the new one) tophat -p 8 transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
(variant of the new one) tophat -p 8 -o 0111-02_1_thout transcriptome_data/known *0112-01_1.fq
(variant of the new one) tophat -p 8 -o 0111-02_1_thout transcriptome-index=transcriptome_data/known *0112-01_1.fq
(variant of the new one) tophat -p 8 -o 0111-02_1_thout ./transcriptome_data/known *0111-02_1.fq

A similar error message keeps comes up:
Error: Could not find Bowtie index files (transcriptome-index=transcriptome_data/known.*.ebwt)

When I use the original command that worked, as far as I can tell, Bowtie is building another transcriptome_data directory (or writing over it). Each time it takes several hours.

Any ideas on why this is happening?

Thanks and God bless,
Jason
jmwhitha is offline   Reply With Quote
Old 06-20-2013, 10:56 AM   #5
jmwhitha
Senior Member
 
Location: NC State, Raleigh, NC

Join Date: Mar 2013
Posts: 107
Default

By the way all,

I am working with a bacterial genome.
jmwhitha is offline   Reply With Quote
Old 06-20-2013, 11:07 AM   #6
jmwhitha
Senior Member
 
Location: NC State, Raleigh, NC

Join Date: Mar 2013
Posts: 107
Default

Wait, I think I'm figuring out the problem...

I noticed that when making a transcriptome index, TopHat only used "left reads". This might be why only a known.3.ebwt and a known.4.ebwt.

When I searched known.3.ebwt and known.4.ebwt, I discovered the bowtie-build options. I had to use bowtie-build since I don't have ebwt files. But I should have used the -r option.

"No reference indexes. Do not build the NAME.3.ebwt and NAME.4.ebwt portions of the index. Used only for paired-end alignment. [off]"

This is probably why known.ebwt files were made. I am therefore going to redo the bowtie-build and start over.

Let you know what happens!
jmwhitha is offline   Reply With Quote
Old 06-20-2013, 11:07 AM   #7
jmwhitha
Senior Member
 
Location: NC State, Raleigh, NC

Join Date: Mar 2013
Posts: 107
Default

To be clear. I have single reads, not paired end.
jmwhitha is offline   Reply With Quote
Old 06-20-2013, 11:35 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Did you use a fasta formatted genome sequence file for the bowtie-build?

Last edited by GenoMax; 06-20-2013 at 11:39 AM.
GenoMax is offline   Reply With Quote
Old 06-20-2013, 12:13 PM   #9
jmwhitha
Senior Member
 
Location: NC State, Raleigh, NC

Join Date: Mar 2013
Posts: 107
Default

Yes, it is the whole genome fasta from NCBI (not multifasta coding sequence).
jmwhitha is offline   Reply With Quote
Reply

Tags
cufflinks, genes.gtf, gtf, rna seq, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO