SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bowtie index problem (bowtie-build and then bowtie-inspect) tgenahmet Bioinformatics 4 09-10-2013 11:51 AM
Annotation in GFF3 Daniel Fernandez Bioinformatics 2 06-11-2012 03:51 PM
Tophat2 job fails with an out of memory error - how to restart? chris Bioinformatics 2 06-08-2012 06:47 AM
cuffcompare fails to produce gtf and tracking chrisbala Bioinformatics 1 08-12-2010 07:11 AM
BWA building index of full human (ensembl) fails inijman Bioinformatics 4 12-23-2009 05:00 AM

Reply
 
Thread Tools
Old 11-21-2013, 07:10 AM   #1
seeker
Member
 
Location: Zurich

Join Date: Jan 2011
Posts: 26
Default Tophat2 with GFF3 annotation fails to produce Bowtie index.

Hi all.

I'm trying to map paired end illumina reads to a reference genome with a GFF3 file for annotation info. I compiled the genome sequence from separate files of each linkage groups plus the scaffolds which couldn't be assigned to linkage groups.

But running tophat2 like so:

Code:
tophat -p 8 -G ~/path/to/annotation.gff3 index_name CAA_l1_1.fq.gz CAA_l1_2.fq.gz
ends up giving me this error:

Code:
[2013-11-21 12:19:21] Building transcriptome data files..
[2013-11-21 12:19:24] Building Bowtie index from annotation.fa
        [FAILED]
Error: Couldn't build bowtie index with err = 1
I thought that maybe the names were off but it all looks like it matches.

Code:
bowtie2-inspect -n 
gi|339751252|ref|NC_015762.1| Bombus terrestris linkage group LG B01, Bter_1.0 chromosome, whole genome shotgun sequence
...
Code:
bowtie2-inspect -s 
Flags   1
Reverse flags   5
Colorspace      0
2.0-compatible  1
SA-Sample       1 in 16
FTab-Chars      10
Sequence-1      gi|339751252|ref|NC_015762.1| Bombus terrestris linkage group LG B01, Bter_1.0 chromosome, whole genome shotgun sequence        17153651
...
and my GFF3 file looks like so:

Code:
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_015762.1 1 17153651
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=30195
NC_015762.1     RefSeq  region  1       17153651        .       +       .       ID=id0;Dbxref=taxon:30195;gbkey=Src;genome=chromosome;linkage-group=LG B01;mol_type=genomic DNA;note=haploid drones;sex=male
NC_015762.1     RefSeq  gene    2279    19877   .       -       .       ID=gene0;Name=LOC100649911;Dbxref=GeneID:100649911;gbkey=Gene;gene=LOC100649911
...
Any suggestions here as to what I'm doing wrong would be most appreciated.
seeker is offline   Reply With Quote
Old 11-21-2013, 07:38 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Have you pre-build the genome index for the genome you are searching against?

This guide is helpful: http://www.nature.com/nprot/journal/....2012.016.html

Last edited by GenoMax; 11-21-2013 at 07:45 AM.
GenoMax is online now   Reply With Quote
Old 11-21-2013, 07:42 AM   #3
seeker
Member
 
Location: Zurich

Join Date: Jan 2011
Posts: 26
Default

Thanks Geno, I'll take a look at that paper. I did build the genome index ahead of running tophat.

I have a hunch that it might be because sequence names are all filled with extra info. I'll try cleaning those out and rebuilding the genome index again.
seeker is offline   Reply With Quote
Old 11-21-2013, 07:45 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Did you call your genome index "index_name" because that is what your command line is suggesting?

Can you post the command line you used to build the index?
GenoMax is online now   Reply With Quote
Old 11-21-2013, 07:49 AM   #5
seeker
Member
 
Location: Zurich

Join Date: Jan 2011
Posts: 26
Default

I called it that in the post because it might be unclear otherwise.

index created by first catenating all the .fa files then running:
#real name of the index
bowtie2-build Bter_gDNA.fa Bter_gDNA

BTW, it seems to do the mapping step if you exclude the GFF part. Also, running bowtie alone works fine. It seems to really be an issue with the GFF matching.
seeker is offline   Reply With Quote
Old 11-21-2013, 08:00 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Got it.

Can you validate your GFF file to make sure it is ok: http://www.raetschlab.org/suppl/gff-tools
GenoMax is online now   Reply With Quote
Old 11-21-2013, 11:14 PM   #7
seeker
Member
 
Location: Zurich

Join Date: Jan 2011
Posts: 26
Default

Hadn't seen those gff-tools. Will take a good look for future reference. Turns out it was an issue with the long names in the fasta file and short names in the GFF3 file. A silly issue really!
seeker is offline   Reply With Quote
Old 12-07-2013, 09:12 PM   #8
bw.
Member
 
Location: San Francisco, CA

Join Date: Mar 2012
Posts: 21
Default

I recently ran into the same error:
Code:
Couldn't build bowtie index with err = 1
I ran bowtie2-inspect -n but the name exactly matched the name in the 1st column of the GTF. I then ran bowtie2-inspect genome_index > new.fa (without -n) to regenerate the fasta. This fixed it.

Ps. A diff of the original genome.fa vs new.fa indicated that the white-space was different (I checked previously that there were no spaces after the sequence name, but apparently the new-line character was different). Also, the regenerated fasta had a different number of bases per line and no extra blank line at the end of the file. I'm not sure which of these differences was causing the error.
bw. is offline   Reply With Quote
Old 03-28-2014, 11:34 PM   #9
seeker
Member
 
Location: Zurich

Join Date: Jan 2011
Posts: 26
Default

Good to know bw, thanks.
seeker is offline   Reply With Quote
Old 05-15-2014, 09:46 AM   #10
filmonhg
Junior Member
 
Location: Murfreesboro, TN

Join Date: May 2014
Posts: 1
Default

hello ev'one,
I also run into the same problem
Error: Couldn't build bowtie index with err = 1
* I created my index from my refernce genome "ref_maize.fa" and created those files :
maize_ebtw.1.bt2 maize_ebtw.3.bt2 maize_ebtw.rev.1.bt2
maize_ebtw.2.bt2 maize_ebtw.4.bt2 maize_ebtw.rev.2.bt2
which are all in one directory "bowtie_build2"
And then I run tophat with the following command:
tophat -p 7 -o /nfshome/fhg2a/nature_maize/tophat_results_base -G ZmB73_5a.59_WGS.gff --no-novel-juncs bowtie_build2/maize_ebtw reads/SRR039501.fastq >tophat_base.log

now I got the error with the following:
[2014-05-06 12:15:54] Building Bowtie index from ZmB73_5a.59_WGS.fa
[FAILED]
Error: Couldn't build bowtie index with err = 1
MY QUESTION: I DONT' EVEN HAVE FILE CALLED "ZmB73_5a.59_WGS.fa"? YOUR help is appreciated.

my file looks like this :
ref_maize.fa
head ref_maize.fa
>chr1
GAATTCCAAAGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTCCTATCATTCTTTCTGATGTTGAAAATGATATTAAG
CCTAGGATTCGTGAATGGGAGAAGGTATTTTTGTTCATGGTAGTCATTGGAACCTGCTAGATTGTACACTTGACAATAAC
ATATATTAATATTAGTGACCCCATTTTTAAATTTCCTAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAA
GTATCCATGGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAAAAAGGGTAAGTA
AAAAAGAATACTTACTATGCTGTGCCTCAAGTTCATGTTAATTTGTTTGCCGTGTCTTGCCTTCCTTTTGCTGTTGGGAG
TATTCAACTTTTTCCTTTCAGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCTCCTC
TTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAAAAATGGTGGTTCCAGTCATTACAT
CAAGTCCACAGTTATTACTGAGAAAACCTGATCAGTTTATGCAGGTGTGTAACGATTATTGAGGTTGCATTTATATATTT
AAACTTCATTGGTAATGGATATAAACTATTTTTGGCTGCATATAAGTTTCGAAGCAAATTGGAACCAGAGTTCAGCAAAG

ANNOTATION FILE
head ZmB73_5a.59_WGS.gff
9 ensembl chromosome 1 156750706 . . . ID=9;Name=chromosome:AGPv2:9:1:156750706:1
9 ensembl gene 19970 20093 . + . ID=GRMZM2G581216;Name=GRMZM2G581216;biotype=transposable_element
9 ensembl mRNA 19970 20093 . + . ID=GRMZM2G581216_T01;Parent=GRMZM2G581216;Name=GRMZM2G581216_T01;biotype=protein_coding
9 ensembl exon 19970 20093 . + . Parent=GRMZM2G581216_T01;Name=GRMZM2G581216_E01
9 ensembl CDS 19970 20092 . + 0 Parent=GRMZM2G581216_T01;Name=CDS.2
9 ensembl gene 23314 26371 . + . ID=GRMZM2G163722;Name=GRMZM2G163722;biotype=transposable_element
9 ensembl mRNA 23314 26371 . + . ID=GRMZM2G163722_T01;Parent=GRMZM2G163722;Name=GRMZM2G163722_T01;biotype=protein_coding
9 ensembl intron 23496 23939 . + . Parent=GRMZM2G163722_T01;Name=intron.3
9 ensembl intron 24061 24283 . + . Parent=GRMZM2G163722_T01;Name=intron.4
9 ensembl intron 24472 24540 . + . Parent=GRMZM2G163722_T01;Name=intron.5
filmonhg is offline   Reply With Quote
Old 05-15-2014, 09:59 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

@filmonhg: Do yourself a favor and grab a copy of maize data from iGenomes (will work unless your genome is non-standard) https://support.illumina.com/sequenc...e/igenome.ilmn. This way you would have sequence, annotation, indexes that are all coordinated (chromosome names etc) and will work together.

You are going to run into other issues even if you were to try renaming your GTF file to read maize_ebtw.gtf.

See the following quote from TopHat site.

Quote:
Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat. You can get a list of the sequence names in a Bowtie index by typing:

bowtie-inspect --names your_index


So before using a known annotation file with this option please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command above.
GenoMax is online now   Reply With Quote
Old 05-19-2015, 07:17 AM   #12
rodrigo.duarte88
Member
 
Location: London

Join Date: Jan 2015
Posts: 10
Default

Although this is very obvious, it took me a couple of days to realise that I couldn't simply create my own reference file from the genomic region of my interest (I was humbly downloading the sequence from UCSC browser by simply selecting the region of interest and clicking TOOLS - GET DNA haha :P ).

So, just to clarify for newbies - as myself -, the fasta file to create your bowtie index needs to inform which base in which chromosome that sequence refers to), and this information needs to match in terms of format and in coordinates with the information in the GTF file.

P.S.: I gave up of trying to minimise the genomic region to which my reads would map. First, it's not straightforward and I think it requires some programming, and second, you'll introduce biases in your analysis (reads that shouldn't map there may end up mapping there).

Last edited by rodrigo.duarte88; 05-19-2015 at 07:20 AM.
rodrigo.duarte88 is offline   Reply With Quote
Reply

Tags
bowtie, illumina, mapping, ngs, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:48 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO