Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat2 with GFF3 annotation fails to produce Bowtie index.

    Hi all.

    I'm trying to map paired end illumina reads to a reference genome with a GFF3 file for annotation info. I compiled the genome sequence from separate files of each linkage groups plus the scaffolds which couldn't be assigned to linkage groups.

    But running tophat2 like so:

    Code:
    tophat -p 8 -G ~/path/to/annotation.gff3 index_name CAA_l1_1.fq.gz CAA_l1_2.fq.gz
    ends up giving me this error:

    Code:
    [2013-11-21 12:19:21] Building transcriptome data files..
    [2013-11-21 12:19:24] Building Bowtie index from annotation.fa
            [FAILED]
    Error: Couldn't build bowtie index with err = 1
    I thought that maybe the names were off but it all looks like it matches.

    Code:
    bowtie2-inspect -n 
    gi|339751252|ref|NC_015762.1| Bombus terrestris linkage group LG B01, Bter_1.0 chromosome, whole genome shotgun sequence
    ...
    Code:
    bowtie2-inspect -s 
    Flags   1
    Reverse flags   5
    Colorspace      0
    2.0-compatible  1
    SA-Sample       1 in 16
    FTab-Chars      10
    Sequence-1      gi|339751252|ref|NC_015762.1| Bombus terrestris linkage group LG B01, Bter_1.0 chromosome, whole genome shotgun sequence        17153651
    ...
    and my GFF3 file looks like so:

    Code:
    #!gff-spec-version 1.20
    #!processor NCBI annotwriter
    ##sequence-region NC_015762.1 1 17153651
    ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=30195
    NC_015762.1     RefSeq  region  1       17153651        .       +       .       ID=id0;Dbxref=taxon:30195;gbkey=Src;genome=chromosome;linkage-group=LG B01;mol_type=genomic DNA;note=haploid drones;sex=male
    NC_015762.1     RefSeq  gene    2279    19877   .       -       .       ID=gene0;Name=LOC100649911;Dbxref=GeneID:100649911;gbkey=Gene;gene=LOC100649911
    ...
    Any suggestions here as to what I'm doing wrong would be most appreciated.

  • #2
    Have you pre-build the genome index for the genome you are searching against?

    This guide is helpful: http://www.nature.com/nprot/journal/....2012.016.html
    Last edited by GenoMax; 11-21-2013, 08:45 AM.

    Comment


    • #3
      Thanks Geno, I'll take a look at that paper. I did build the genome index ahead of running tophat.

      I have a hunch that it might be because sequence names are all filled with extra info. I'll try cleaning those out and rebuilding the genome index again.

      Comment


      • #4
        Did you call your genome index "index_name" because that is what your command line is suggesting?

        Can you post the command line you used to build the index?

        Comment


        • #5
          I called it that in the post because it might be unclear otherwise.

          index created by first catenating all the .fa files then running:
          #real name of the index
          bowtie2-build Bter_gDNA.fa Bter_gDNA

          BTW, it seems to do the mapping step if you exclude the GFF part. Also, running bowtie alone works fine. It seems to really be an issue with the GFF matching.

          Comment


          • #6
            Got it.

            Can you validate your GFF file to make sure it is ok: http://www.raetschlab.org/suppl/gff-tools

            Comment


            • #7
              Hadn't seen those gff-tools. Will take a good look for future reference. Turns out it was an issue with the long names in the fasta file and short names in the GFF3 file. A silly issue really!

              Comment


              • #8
                I recently ran into the same error:
                Code:
                Couldn't build bowtie index with err = 1
                I ran bowtie2-inspect -n but the name exactly matched the name in the 1st column of the GTF. I then ran bowtie2-inspect genome_index > new.fa (without -n) to regenerate the fasta. This fixed it.

                Ps. A diff of the original genome.fa vs new.fa indicated that the white-space was different (I checked previously that there were no spaces after the sequence name, but apparently the new-line character was different). Also, the regenerated fasta had a different number of bases per line and no extra blank line at the end of the file. I'm not sure which of these differences was causing the error.

                Comment


                • #9
                  Good to know bw, thanks.

                  Comment


                  • #10
                    hello ev'one,
                    I also run into the same problem
                    Error: Couldn't build bowtie index with err = 1
                    * I created my index from my refernce genome "ref_maize.fa" and created those files :
                    maize_ebtw.1.bt2 maize_ebtw.3.bt2 maize_ebtw.rev.1.bt2
                    maize_ebtw.2.bt2 maize_ebtw.4.bt2 maize_ebtw.rev.2.bt2
                    which are all in one directory "bowtie_build2"
                    And then I run tophat with the following command:
                    tophat -p 7 -o /nfshome/fhg2a/nature_maize/tophat_results_base -G ZmB73_5a.59_WGS.gff --no-novel-juncs bowtie_build2/maize_ebtw reads/SRR039501.fastq >tophat_base.log

                    now I got the error with the following:
                    [2014-05-06 12:15:54] Building Bowtie index from ZmB73_5a.59_WGS.fa
                    [FAILED]
                    Error: Couldn't build bowtie index with err = 1
                    MY QUESTION: I DONT' EVEN HAVE FILE CALLED "ZmB73_5a.59_WGS.fa"? YOUR help is appreciated.

                    my file looks like this :
                    ref_maize.fa
                    head ref_maize.fa
                    >chr1
                    GAATTCCAAAGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTCCTATCATTCTTTCTGATGTTGAAAATGATATTAAG
                    CCTAGGATTCGTGAATGGGAGAAGGTATTTTTGTTCATGGTAGTCATTGGAACCTGCTAGATTGTACACTTGACAATAAC
                    ATATATTAATATTAGTGACCCCATTTTTAAATTTCCTAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAA
                    GTATCCATGGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAAAAAGGGTAAGTA
                    AAAAAGAATACTTACTATGCTGTGCCTCAAGTTCATGTTAATTTGTTTGCCGTGTCTTGCCTTCCTTTTGCTGTTGGGAG
                    TATTCAACTTTTTCCTTTCAGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCTCCTC
                    TTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAAAAATGGTGGTTCCAGTCATTACAT
                    CAAGTCCACAGTTATTACTGAGAAAACCTGATCAGTTTATGCAGGTGTGTAACGATTATTGAGGTTGCATTTATATATTT
                    AAACTTCATTGGTAATGGATATAAACTATTTTTGGCTGCATATAAGTTTCGAAGCAAATTGGAACCAGAGTTCAGCAAAG

                    ANNOTATION FILE
                    head ZmB73_5a.59_WGS.gff
                    9 ensembl chromosome 1 156750706 . . . ID=9;Name=chromosome:AGPv2:9:1:156750706:1
                    9 ensembl gene 19970 20093 . + . ID=GRMZM2G581216;Name=GRMZM2G581216;biotype=transposable_element
                    9 ensembl mRNA 19970 20093 . + . ID=GRMZM2G581216_T01;Parent=GRMZM2G581216;Name=GRMZM2G581216_T01;biotype=protein_coding
                    9 ensembl exon 19970 20093 . + . Parent=GRMZM2G581216_T01;Name=GRMZM2G581216_E01
                    9 ensembl CDS 19970 20092 . + 0 Parent=GRMZM2G581216_T01;Name=CDS.2
                    9 ensembl gene 23314 26371 . + . ID=GRMZM2G163722;Name=GRMZM2G163722;biotype=transposable_element
                    9 ensembl mRNA 23314 26371 . + . ID=GRMZM2G163722_T01;Parent=GRMZM2G163722;Name=GRMZM2G163722_T01;biotype=protein_coding
                    9 ensembl intron 23496 23939 . + . Parent=GRMZM2G163722_T01;Name=intron.3
                    9 ensembl intron 24061 24283 . + . Parent=GRMZM2G163722_T01;Name=intron.4
                    9 ensembl intron 24472 24540 . + . Parent=GRMZM2G163722_T01;Name=intron.5

                    Comment


                    • #11
                      @filmonhg: Do yourself a favor and grab a copy of maize data from iGenomes (will work unless your genome is non-standard) https://support.illumina.com/sequenc...e/igenome.ilmn. This way you would have sequence, annotation, indexes that are all coordinated (chromosome names etc) and will work together.

                      You are going to run into other issues even if you were to try renaming your GTF file to read maize_ebtw.gtf.

                      See the following quote from TopHat site.

                      Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat. You can get a list of the sequence names in a Bowtie index by typing:

                      bowtie-inspect --names your_index


                      So before using a known annotation file with this option please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command above.

                      Comment


                      • #12
                        Although this is very obvious, it took me a couple of days to realise that I couldn't simply create my own reference file from the genomic region of my interest (I was humbly downloading the sequence from UCSC browser by simply selecting the region of interest and clicking TOOLS - GET DNA haha :P ).

                        So, just to clarify for newbies - as myself -, the fasta file to create your bowtie index needs to inform which base in which chromosome that sequence refers to), and this information needs to match in terms of format and in coordinates with the information in the GTF file.

                        P.S.: I gave up of trying to minimise the genomic region to which my reads would map. First, it's not straightforward and I think it requires some programming, and second, you'll introduce biases in your analysis (reads that shouldn't map there may end up mapping there).
                        Last edited by rodrigo.duarte88; 05-19-2015, 07:20 AM.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X