Seqanswers Leaderboard Ad

**GenoMax** · 06-19-2013, 07:36 AM

If you are working with a genome that is available from iGenomes site you will find the gtf file in the big tarball there. Also bundled in are various sequence and corresponding index files (bwa, bowtie etc) so you may want to stick with this for entire analysis.

404 Not Found

http://cufflinks.cbcb.umd.edu/igenomes.html

What genome are you working with?

**jmwhitha** · 06-19-2013, 05:47 PM

I ended up converting the genbank file to a GTF one with the python script from the link below. It worked great! Thanks so much GenoMax for all your extremely helpful advice!!

Genbank To Gtf Converter

http://www.biostars.org/p/15138/

**jmwhitha** · 06-20-2013, 10:30 AM

Well, I thought it worked...

A GTF file was created and looks similar to the one from the example data (FruitFly). The first line of the head command for my gtf file gives:
CP001666.1 gb2gtf gene 101 1453 . + . gene_id "<firstgene>"; transcript_id "<firsttranscript>"; gene_id "dnaA"
<The gene_id and transcript_id were left out above.>
Notice that the value of the first column was "CP001666.1", and gb2gtf is the program used to do the conversion.

The first line of the head of the example .gtf file is:
2L protein_coding exon 7529 8116 . + . exon_number "1"; gene_id "FBgn0031208"; gene_name "CG11023"; p_id "P9062"; transcript_id "FBtr0300689"; transcript_name "CG11023-RB"; tss_id "TSS8382";
In this file, "2L" is the first column for each line of data. Here they use "protein_coding" instead of "gene".

Now transitioning from the example .gtf file to the example .fa file.
Notice that the example .fa file has a "2L" as the first line.
jmwhitha@jmwhitha-OptiPlex-755:~/my_rnaseq_exp$ head genome.fa
>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
ATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTT
GATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCC
GCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGC
GAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCT
ATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC
TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAG
CAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGAT

My .fa file had a long string of identification so I manually changed it to CP001666.1.

Why is this important?
According to the TopHat manual, "the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat."
So did I do that correctly? I'm not sure.

What I know is that before I made that change the tophat command would give the following warning. Then, it would give an error and quit after a while of processing.

tophat -p 8 -G <mygtf>.gtf -o 0111-01_1_thout genome *0111-01_1.fq
# Bowtie version: 1.0.0.0
#[2013-06-19 22:22:06] Checking for Samtools
# Samtools version: 0.1.19.0
#[2013-06-19 22:22:06] Checking for Bowtie index files
#[2013-06-19 22:22:06] Checking for reference FASTA file
#[2013-06-19 22:22:06] Generating SAM header for genome
# format: fastq
# quality scale: phred33 (default)
#[2013-06-19 22:22:06] Reading known junctions from GTF file
# Warning: TopHat did not find any junctions in GTF file

Though this warning still showed up after I altered my .fa file, instead of quitting, it continued on with the following processing:

#[2013-06-20 08:59:15] Preparing reads
# left reads: min. length=100, max. length=100, 24223083 kept reads (10522 discarded)
#[2013-06-20 09:08:52] Creating transcriptome data files..
#[2013-06-20 09:08:52] Building Bowtie index from <myfasta>.fa
#[2013-06-20 09:08:59] Mapping left_kept_reads to transcriptome <mybase> with Bowtie
#[2013-06-20 10:14:48] Resuming TopHat pipeline with unmapped reads
#[2013-06-20 10:14:48] Mapping left_kept_reads.m2g_um to genome genome with Bowtie
#[2013-06-20 10:33:05] Mapping left_kept_reads.m2g_um_seg1 to genome genome with Bowtie (1/4)
#[2013-06-20 10:37:18] Mapping left_kept_reads.m2g_um_seg2 to genome genome with Bowtie (2/4)
#[2013-06-20 10:42:02] Mapping left_kept_reads.m2g_um_seg3 to genome genome with Bowtie (3/4)
#[2013-06-20 10:46:34] Mapping left_kept_reads.m2g_um_seg4 to genome genome with Bowtie (4/4)
#[2013-06-20 10:50:18] Searching for junctions via segment mapping
#[2013-06-20 10:55:31] Retrieving sequences for splices
#[2013-06-20 10:55:32] Indexing splices
#[2013-06-20 10:55:35] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie (1/4)
#[2013-06-20 10:57:48] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie (2/4)
#[2013-06-20 11:00:36] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie (3/4)
#[2013-06-20 11:03:20] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie (4/4)
#[2013-06-20 11:05:31] Joining segment hits
#[2013-06-20 11:07:08] Reporting output tracks
#-----------------------------------------------
#[2013-06-20 11:35:01] Run complete: 02:35:45 elapsed

After this process was done, I had a new .thout file and a new transcriptone_data directory with known.3.ebwt, known.4.ebwt, known.fa, and known.gff. I was surprised there was no known.1.ebwt or known.2.ebwt. I'm not sure, but this may be the reason why I had trouble with the following step.

According to the TopHat manual:
"When providing TopHat with a known transcript file (-G/--GTF option above), a transcriptome sequence file is built and a Bowtie index has to be created for it in order to align the reads to the known transcripts. Creating this Bowtie index can be time consuming and in many cases the same transcriptome data is being used for aligning multiple samples with TopHat. A transcriptome index and the associated data files (the original GFF file) can be thus reused for multiple TopHat runs with this option, so these files are only created for the first run with a given set of transcripts. If multiple TopHat runs are planned with the same transcriptome data, TopHat should be first run with the -G option and with the --transcriptome-index option pointing to a directory and a name prefix which will indicate where the transcriptome data files will be stored. Then subsequent TopHat runs using the same --transcriptome-index option value will directly use the transcriptome data created in the first run (no -G option needed for subsequent runs).
For example the first TopHat run could look like this:
tophat -o out_sample1 -G known_genes.gtf \
--transcriptome-index=transcriptome_data/known \
hg19 sample1_1.fq.z
In this example the first run will create the transcriptome_data directory if it doesn't exist, and files known.fa, known.gff and known.*ebwt (Bowtie index files) will be generated in that directory. Then for subsequent runs with the same genome and known transcripts but different reads (e.g. sample2_2.fq.z etc.), TopHat will no longer spend time building the transcriptome index because it can directly use the previously built transcriptome index, so the -G option can be discarded for subsequent runs (however using it again will not force TopHat to build the transcriptome index files again if they are already present)
tophat -o out_sample2 --transcriptome-index=transcriptome_data/known hg19 sample2_1.fq.z"

Therefore, instead of the command I used for the first one I tried:
(original command) tophat -p 8 -G <mygtf>.gtf -o 0111-01_1_thout genome *0111-01_1.fq
# <mygtf> is just a place holder
(new one) tophat -p 8 --transcriptome-index=transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
(variant of the new one) tophat -p 8 transcriptome-index=transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
(variant of the new one) tophat -p 8 transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
(variant of the new one) tophat -p 8 -o 0111-02_1_thout transcriptome_data/known *0112-01_1.fq
(variant of the new one) tophat -p 8 -o 0111-02_1_thout transcriptome-index=transcriptome_data/known *0112-01_1.fq
(variant of the new one) tophat -p 8 -o 0111-02_1_thout ./transcriptome_data/known *0111-02_1.fq

A similar error message keeps comes up:
Error: Could not find Bowtie index files (transcriptome-index=transcriptome_data/known.*.ebwt)

When I use the original command that worked, as far as I can tell, Bowtie is building another transcriptome_data directory (or writing over it). Each time it takes several hours.

Any ideas on why this is happening?

Thanks and God bless,
Jason

**jmwhitha** · 06-20-2013, 10:56 AM

By the way all,

I am working with a bacterial genome.

**jmwhitha** · 06-20-2013, 11:07 AM

Wait, I think I'm figuring out the problem...

I noticed that when making a transcriptome index, TopHat only used "left reads". This might be why only a known.3.ebwt and a known.4.ebwt.

When I searched known.3.ebwt and known.4.ebwt, I discovered the bowtie-build options. I had to use bowtie-build since I don't have ebwt files. But I should have used the -r option.

"No reference indexes. Do not build the NAME.3.ebwt and NAME.4.ebwt portions of the index. Used only for paired-end alignment. [off]"

This is probably why known.ebwt files were made. I am therefore going to redo the bowtie-build and start over.

Let you know what happens!

**jmwhitha** · 06-20-2013, 11:07 AM

To be clear. I have single reads, not paired end.

**GenoMax** · 06-20-2013, 11:35 AM

Did you use a fasta formatted genome sequence file for the bowtie-build?

**jmwhitha** · 06-20-2013, 12:13 PM

Yes, it is the whole genome fasta from NCBI (not multifasta coding sequence).

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

How do I get a genes.gtf file for RNA Seq Analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News