Hi,
I'm trying to use tophat on some data and am having an issue. I want to make a transcriptome mapping file like this:
The output from this command, through many variations, looks like this:
So I've read the manual for tophat and many similar issues on these forums and others. I'm guessing there's an issue with the chromosome names. So I started working on that. Here are the steps I've taken and what I've tried to do to address them so far.
Build the bowtie2 index:
Here is the head of the fasta file I'm using:
I have both gff3 and a gtf I made from it available. Here are the first couple of lines from each in case it's important.
GFF3:
GTF:
So according to the tophat manual, when you use the --transcriptome-index, "...please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command..."
Here is the result of that command:
Using some command line tools, here's what I have in the gtf file for chromosome:
So I see the fasta file has the NC_004459.3 and the gtf file has just NC_004459. So I change the fasta file to NC_004459, rebuild the index, and try again.
Now, I try to build the transcriptome file:
So no change there, something still doesn't match but I don't know what. The tophat manual said the names should match exactly, so I've tried several variations in both files for the chromosome name, but no matter what I do I end up with the same thing.
The g2f.err file shows the following:
This makes me think there is some id matching issue going on here, but I'm unsure how to correct it. A folder called 'transcriptome' is also produced by my command, and it contains a file, CMCP6.fa, that is empty, if that says anything about the point at which tophat is failing.
I've read many threads here and elsewhere, as well as the manual for tophat, but am still unsure how to correct this. Also, in other places people often suggest getting different versions of my files, but I don't think that's going to be an option for me for this bacteria, and the next one I have to do the same thing to. I would appreciate any help anyone can offer on this. Thanks very much in advance!
I'm trying to use tophat on some data and am having an issue. I want to make a transcriptome mapping file like this:
Code:
tophat -G CMCP6_adjusted.gff3 --transcriptome-index=CMCP6_adjusted CMCP6_adjusted
Code:
[2016-03-03 12:58:49] Building transcriptome files with TopHat v2.1.1 ----------------------------------------------- [2016-03-03 12:58:49] Checking for Bowtie Bowtie version: 2.1.0.0 [2016-03-03 12:58:49] Checking for Bowtie index files (genome).. [2016-03-03 12:58:49] Checking for reference FASTA file [2016-03-03 12:58:49] Building transcriptome data files CMCP6_adjusted/CMCP6_adjusted [FAILED] Error: gtf_to_fasta returned an error.
Build the bowtie2 index:
Code:
bowtie2-build CMCP6.fa CMCP6
Code:
>gi|326423644|ref|NC_004459.3| Vibrio vulnificus CMCP6 chromosome I, complete sequence TTACTTTAAATAATAAAAATCGTATAACGTTCGGGTAATTTCTGCA...
GFF3:
Code:
NC_004459 GenBank chromosome 1 3281866 . + . ID=NC_004459;Name=NC_004459;Alias=I;Dbxref=taxon:216895;Note=Vibrio vulnificus CMCP6 chromosome I%2C complete sequence.;chromosome=I;comment1=REFSEQ INFORMATION: The reference sequence was derived from AE016795. On Mar 18%2C 2011 this sequence version replaced gi:117956319. Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Date :: 08/18/2015 01:03:26 Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline Annotation Method :: Best-placed reference protein set%3B GeneMarkS+ Annotation Software revision :: 3.0 Features Annotated :: Gene%3B CDS%3B rRNA%3B tRNA%3B ncRNA%3B repeat_region Genes :: 4%2C578 CDS 4%2C374 Pseudo Genes :: 64 rRNAs :: 10%2C 9%2C 9 (5S%2C 16S%2C 23S) complete rRNAs :: 10%2C 9%2C 9 (5S%2C 16S%2C 23S) tRNAs :: 111 ncRNA :: 1 Frameshifted Genes :: 43 Frameshifted Genes On Monomer Runs :: 8 Frameshifted Genes Not On Monomer Runs :: 1 ##Genome-Annotation-Data-END## COMPLETENESS: full length. ;date=18-AUG-2015;mol_type=genomic DNA;organism=Vibrio vulnificus CMCP6;strain=CMCP6 NC_004459 GenBank gene 1 1002 . - . ID=VV1_RS00005;Name=VV1_RS00005;locus_tag=VV1_RS00005;old_locus_t
Code:
NC_004459 GenBank chromosome 1 3281866 . + . ID "NC_004459"; Name "NC_004459"; Alias "I"; Dbxref "taxon:216895"; Note "Vibrio vulnificus CMCP6 chromosome I, complete sequence."; chromosome "I"; comment1 "REFSEQ INFORMATION: The reference sequence was derived from AE016795. On Mar 18, 2011 this sequence version replaced gi:117956319. Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Date :: 08/18/2015 01:03:26 Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline Annotation Method :: Best-placed reference protein set; GeneMarkS+ Annotation Software revision :: 3.0 Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA; repeat_region Genes :: 4,578 CDS 4,374 Pseudo Genes :: 64 rRNAs :: 10, 9, 9 (5S, 16S, 23S) complete rRNAs :: 10, 9, 9 (5S, 16S, 23S) tRNAs :: 111 ncRNA :: 1 Frameshifted Genes :: 43 Frameshifted Genes On Monomer Runs :: 8 Frameshifted Genes Not On Monomer Runs :: 1 ##Genome-Annotation-Data-END## COMPLETENESS: full length."; date "18-AUG-2015"; mol_type "genomic DNA"; organism "Vibrio vulnificus CMCP6"; strain "CMCP6"; NC_004459 GenBank gene 1 1002 . - . ID "VV1_RS00005"; Name "VV1_RS00005"; locus_tag "VV1_RS00005"; old_locus_tag "VV1_0001";
Here is the result of that command:
Code:
bowtie2-inspect --names CMCP6 gi|326423644|ref|NC_004459.3| Vibrio vulnificus CMCP6 chromosome I, complete sequence
Code:
cut -f1 CMCP6.gtf | sort -u ##date 2016-02-26 ##gff-version 2 NC_004459
Code:
bowtie2-inspect --names CMCP6 gi|326423644|ref|NC_004459| Vibrio vulnificus CMCP6 chromosome I, complete sequence
Code:
tophat -G CMCP6.gtf --transcriptome-index=transcriptome/CMCP6 CMCP6 [2016-03-03 13:21:48] Building transcriptome files with TopHat v2.1.1 ----------------------------------------------- [2016-03-03 13:21:48] Checking for Bowtie Bowtie version: 2.1.0.0 [2016-03-03 13:21:48] Checking for Bowtie index files (genome).. [2016-03-03 13:21:48] Checking for reference FASTA file [2016-03-03 13:21:48] Building transcriptome data files transcriptome/CMCP6 [2016-03-03 13:21:48] Building Bowtie index from CMCP6.fa [FAILED] Error: Couldn't build bowtie index with err = 1
The g2f.err file shows the following:
Code:
Warning: invalid GTF record, transcript_id not found: NC_004459 GenBank mRNA 1 1002 . - . ID "VV1_RS00005.t01"; Name "VV1_RS00005.t01"; Parent "VV1_RS00005"; Warning: invalid GTF record, transcript_id not found: NC_004459 GenBank CDS 1 1002 . - 0 ID "VV1_RS00005.p01"; Name "VV1_RS00005.p01"; Dbxref "GI:499390662"; Note "Der$ Warning: invalid GTF record, transcript_id not found: NC_004459 GenBank exon 1 1002 . - . Parent "VV1_RS00005.t01";
I've read many threads here and elsewhere, as well as the manual for tophat, but am still unsure how to correct this. Also, in other places people often suggest getting different versions of my files, but I don't think that's going to be an option for me for this bacteria, and the next one I have to do the same thing to. I would appreciate any help anyone can offer on this. Thanks very much in advance!
Comment