SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
indexing tophat bam files stormin Bioinformatics 13 12-08-2014 12:38 PM
tophat: Indexing splices crh Bioinformatics 0 07-04-2014 10:45 AM
TopHat Error: Splice sequence indexing failed with err =255 EugeneID Bioinformatics 1 09-27-2012 10:29 AM
Tophat v1.4.0 (Transcriptome Alignment) Jon_Keats Bioinformatics 6 02-21-2012 05:31 PM
tophat indexing madsaan Bioinformatics 0 01-17-2011 01:13 PM

Reply
 
Thread Tools
Old 03-03-2016, 09:32 AM   #1
aprice67
Member
 
Location: New York

Join Date: Nov 2012
Posts: 49
Default Tophat transcriptome indexing

Hi,

I'm trying to use tophat on some data and am having an issue. I want to make a transcriptome mapping file like this:

Code:
 tophat -G CMCP6_adjusted.gff3 --transcriptome-index=CMCP6_adjusted CMCP6_adjusted
The output from this command, through many variations, looks like this:
Code:
[2016-03-03 12:58:49] Building transcriptome files with TopHat v2.1.1
-----------------------------------------------
[2016-03-03 12:58:49] Checking for Bowtie
                  Bowtie version:        2.1.0.0
[2016-03-03 12:58:49] Checking for Bowtie index files (genome)..
[2016-03-03 12:58:49] Checking for reference FASTA file
[2016-03-03 12:58:49] Building transcriptome data files CMCP6_adjusted/CMCP6_adjusted
        [FAILED]
 Error: gtf_to_fasta returned an error.
So I've read the manual for tophat and many similar issues on these forums and others. I'm guessing there's an issue with the chromosome names. So I started working on that. Here are the steps I've taken and what I've tried to do to address them so far.

Build the bowtie2 index:
Code:
bowtie2-build CMCP6.fa CMCP6
Here is the head of the fasta file I'm using:
Code:
>gi|326423644|ref|NC_004459.3| Vibrio vulnificus CMCP6 chromosome I, complete sequence
TTACTTTAAATAATAAAAATCGTATAACGTTCGGGTAATTTCTGCA...
I have both gff3 and a gtf I made from it available. Here are the first couple of lines from each in case it's important.
GFF3:
Code:
NC_004459       GenBank chromosome      1       3281866 .       +       .       ID=NC_004459;Name=NC_004459;Alias=I;Dbxref=taxon:216895;Note=Vibrio vulnificus CMCP6 chromosome I%2C complete sequence.;chromosome=I;comment1=REFSEQ INFORMATION: The reference sequence was derived from AE016795. On Mar 18%2C 2011 this sequence version replaced gi:117956319. Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Date :: 08/18/2015 01:03:26 Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline Annotation Method :: Best-placed reference protein set%3B GeneMarkS+ Annotation Software revision :: 3.0 Features Annotated :: Gene%3B CDS%3B rRNA%3B tRNA%3B ncRNA%3B repeat_region Genes :: 4%2C578 CDS 4%2C374 Pseudo Genes :: 64 rRNAs :: 10%2C 9%2C 9 (5S%2C 16S%2C 23S) complete rRNAs :: 10%2C 9%2C 9 (5S%2C 16S%2C 23S) tRNAs :: 111 ncRNA :: 1 Frameshifted Genes :: 43 Frameshifted Genes On Monomer Runs :: 8 Frameshifted Genes Not On Monomer Runs :: 1 ##Genome-Annotation-Data-END## COMPLETENESS: full length. ;date=18-AUG-2015;mol_type=genomic DNA;organism=Vibrio vulnificus CMCP6;strain=CMCP6
NC_004459       GenBank gene    1       1002    .       -       .       ID=VV1_RS00005;Name=VV1_RS00005;locus_tag=VV1_RS00005;old_locus_t
GTF:
Code:
NC_004459       GenBank chromosome      1       3281866 .       +       .       ID "NC_004459"; Name "NC_004459"; Alias "I"; Dbxref "taxon:216895"; Note "Vibrio vulnificus CMCP6 chromosome I, complete sequence."; chromosome "I"; comment1 "REFSEQ INFORMATION: The reference sequence was derived from AE016795. On Mar 18, 2011 this sequence version replaced gi:117956319. Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Date :: 08/18/2015 01:03:26 Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline Annotation Method :: Best-placed reference protein set; GeneMarkS+ Annotation Software revision :: 3.0 Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA; repeat_region Genes :: 4,578 CDS 4,374 Pseudo Genes :: 64 rRNAs :: 10, 9, 9 (5S, 16S, 23S) complete rRNAs :: 10, 9, 9 (5S, 16S, 23S) tRNAs :: 111 ncRNA :: 1 Frameshifted Genes :: 43 Frameshifted Genes On Monomer Runs :: 8 Frameshifted Genes Not On Monomer Runs :: 1 ##Genome-Annotation-Data-END## COMPLETENESS: full length."; date "18-AUG-2015"; mol_type "genomic DNA"; organism "Vibrio vulnificus CMCP6"; strain "CMCP6";
NC_004459       GenBank gene    1       1002    .       -       .       ID "VV1_RS00005"; Name "VV1_RS00005"; locus_tag "VV1_RS00005"; old_locus_tag "VV1_0001";
So according to the tophat manual, when you use the --transcriptome-index, "...please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command..."

Here is the result of that command:
Code:
bowtie2-inspect --names CMCP6
gi|326423644|ref|NC_004459.3| Vibrio vulnificus CMCP6 chromosome I, complete sequence
Using some command line tools, here's what I have in the gtf file for chromosome:
Code:
cut -f1 CMCP6.gtf | sort -u
##date 2016-02-26
##gff-version 2
NC_004459
So I see the fasta file has the NC_004459.3 and the gtf file has just NC_004459. So I change the fasta file to NC_004459, rebuild the index, and try again.
Code:
bowtie2-inspect --names CMCP6
gi|326423644|ref|NC_004459| Vibrio vulnificus CMCP6 chromosome I, complete sequence
Now, I try to build the transcriptome file:
Code:
tophat -G CMCP6.gtf --transcriptome-index=transcriptome/CMCP6 CMCP6

[2016-03-03 13:21:48] Building transcriptome files with TopHat v2.1.1
-----------------------------------------------
[2016-03-03 13:21:48] Checking for Bowtie
                  Bowtie version:        2.1.0.0
[2016-03-03 13:21:48] Checking for Bowtie index files (genome)..
[2016-03-03 13:21:48] Checking for reference FASTA file
[2016-03-03 13:21:48] Building transcriptome data files transcriptome/CMCP6
[2016-03-03 13:21:48] Building Bowtie index from CMCP6.fa
        [FAILED]
Error: Couldn't build bowtie index with err = 1
So no change there, something still doesn't match but I don't know what. The tophat manual said the names should match exactly, so I've tried several variations in both files for the chromosome name, but no matter what I do I end up with the same thing.

The g2f.err file shows the following:
Code:
Warning: invalid GTF record, transcript_id not found:
NC_004459       GenBank mRNA    1       1002    .       -       .       ID "VV1_RS00005.t01"; Name "VV1_RS00005.t01"; Parent "VV1_RS00005";
Warning: invalid GTF record, transcript_id not found:
NC_004459       GenBank CDS     1       1002    .       -       0       ID "VV1_RS00005.p01"; Name "VV1_RS00005.p01"; Dbxref "GI:499390662"; Note "Der$
Warning: invalid GTF record, transcript_id not found:
NC_004459       GenBank exon    1       1002    .       -       .        Parent "VV1_RS00005.t01";
This makes me think there is some id matching issue going on here, but I'm unsure how to correct it. A folder called 'transcriptome' is also produced by my command, and it contains a file, CMCP6.fa, that is empty, if that says anything about the point at which tophat is failing.

I've read many threads here and elsewhere, as well as the manual for tophat, but am still unsure how to correct this. Also, in other places people often suggest getting different versions of my files, but I don't think that's going to be an option for me for this bacteria, and the next one I have to do the same thing to. I would appreciate any help anyone can offer on this. Thanks very much in advance!
aprice67 is offline   Reply With Quote
Old 03-03-2016, 10:11 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,784
Default

Why not simplify the headers for the fasta file so only the accession number remains: NC_004459.3?

Since this is bacterial RNAseq data you don't need to worry about splicing. You could just map using an aligner of your choice (BBMap would be easiest) and then go to featurecounts/htseq-count for counting.
GenoMax is offline   Reply With Quote
Old 03-03-2016, 10:45 AM   #3
aprice67
Member
 
Location: New York

Join Date: Nov 2012
Posts: 49
Default

Code:
Why not simplify the headers for the fasta file so only the accession number remains: NC_004459.3?
I've actually tried this. It doesn't change anything as it turns out.


Code:
 You could just map using an aligner of your choice (BBMap would be easiest) and then go to featurecounts/htseq-count for counting.
I'm actually trying to reproduce some methodology from a paper and so I wanted to use the same methods. I was planning to do htseq-count/DESeq if I can't get this working as a last resort though. I may end up doing this though.
aprice67 is offline   Reply With Quote
Old 03-03-2016, 10:47 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,784
Default

Quote:
Originally Posted by aprice67 View Post
I've actually tried this. It doesn't change anything as it turns out.
Did you re-do the entire pipeline since you would need to re-index the fasta, remap etc.
GenoMax is offline   Reply With Quote
Old 03-03-2016, 10:50 AM   #5
aprice67
Member
 
Location: New York

Join Date: Nov 2012
Posts: 49
Default

Code:
Did you re-do the entire pipeline since you would need to re-index the fasta, remap etc.
I did. I double checked with 'bowtie2-inspect --names CMCP6' too to confirm.
aprice67 is offline   Reply With Quote
Reply

Tags
fasta, gff, gtf, tophat, transcriptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:22 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO