Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bowtie reference genome index: help required

    Dear all,

    We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible. Is there a possible solution without modification of the source code?

    Of course, we would like to consider source code modification as a last resort. In any case, we would also appreciate any insights as to how we can modify the source code to handle a 6billion character genome.

    Regards,
    Kevin

  • #2
    Originally posted by kevlim83 View Post
    Dear all,

    We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible. Is there a possible solution without modification of the source code?

    Of course, we would like to consider source code modification as a last resort. In any case, we would also appreciate any insights as to how we can modify the source code to handle a 6billion character genome.

    Regards,
    Kevin
    I am guessing it has something to do with 32-bit integers, and so you would have to change the index source code to store 64-bit integers, which would double the index size instantly.

    Could you split your reference and align to each separately and merge the results? This is not as faithful to the bowtie algorithm but seems like a practical solution.

    Comment


    • #3
      Hi,

      Thanks for the reply.

      Can anyone guide me as to where the pointers I need to change are located?

      Regards,
      Kevin

      Comment


      • #4
        Hi Kevin,

        Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:

        Because bowtie-build uses 32-bit pointers internally, it can handle up to a theoretical maximum of 2^32-1 (somewhat more than 4 billion) characters in an index, though, with other constraints, the actual ceiling is somewhat less than that. If your reference exceeds 2^32-1 characters, bowtie-build will print an error message and abort. To resolve this, divide your reference sequences into smaller batches and/or chunks and build a separate index for each.

        If your computer has more than 3-4 GB of memory and you would like to exploit that fact to make index building faster, use a 64-bit version of the bowtie-build binary. The 32-bit version of the binary is restricted to using less than 4 GB of memory. If a 64-bit pre-built binary does not yet exist for your platform on the sourceforge download site, you will need to build one from source.
        Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

        If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.

        Comment


        • #5
          Yes, we also think that messing around with source code is a cumbersome task indeed.

          However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

          Hence, we are left with the last resort which is to modify the source code.

          Any form of help is truly appreciated here. Thanks.

          Regards,
          Kevin

          Originally posted by sperry View Post
          Hi Kevin,

          Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:



          Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

          If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.

          Comment


          • #6
            Originally posted by kevlim83 View Post
            Yes, we also think that messing around with source code is a cumbersome task indeed.

            However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

            Hence, we are left with the last resort which is to modify the source code.

            Any form of help is truly appreciated here. Thanks.

            Regards,
            Kevin
            What about using a different aligner?

            Comment


            • #7
              Hi Kevin,

              Take a look at the ebwt.h file in the bowtie source distribution. This file outlines the ebwt-related classes. Searching for 'int', 'uint32_t', and 'int32_t' should give you an idea of where you can start to modify the code.

              You might also find it useful to compile bowtie using the '-ggdb' flag, and then try invoking bowtie-build with your large reference sequence within gdb to see exactly where things are breaking down.

              -Scott

              Originally posted by kevlim83 View Post
              Yes, we also think that messing around with source code is a cumbersome task indeed.

              However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

              Hence, we are left with the last resort which is to modify the source code.

              Any form of help is truly appreciated here. Thanks.

              Regards,
              Kevin
              Last edited by sperry; 03-01-2010, 08:21 AM.

              Comment


              • #8
                An old thread, but I am currently in a similar situation. I have a polyploid genome of >10 Gbs that I have to work with. Anybody have any recommendations on altering bowtie for this?

                Alternatively, any good strategies at post-processing data aligned to individual chunks to achieve the same result?

                Comment


                • #9
                  I think BWA can handle larger genomes, that'd be the easiest solution.

                  BTW, you can split a genome, map all the reads to each of the chunks with bowtie2, and then process the results to produce results equivalent to what would have been produced had you aligned to the genome as a whole with bowtie2, but it's not completely trivial. This is effectively how bisulfite-seq aligners work (see the source code for Bison if you really want to see how to do this).

                  Comment


                  • #10
                    This is for bisulphite-sequencing. The problem being, that my lab uses a specific pipeline for our analysis, we work closely with the developers. Bowtie is a standard part of that protocol and I have already used this pipeline for analyzing A LOT of data, this being the first time I have run into problems. I really would like to avoid using any other aligner, because then the effort put into achieving identical results with Bowtie will be a headache in itself.

                    That being said, I think I have successfully modified bowtie-build...whether or not this works I can't say until its finished and I have had a chance to align some data. But it seems to be working.

                    Comment


                    • #11
                      Originally posted by kevlim83 View Post
                      We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible.
                      I know this is a four year old question, but bowtie-2 says it can now deal with this (Current version is Bowtie2 2.2.4):

                      Small and large indexes

                      bowtie2-build can index reference genomes of any size. For genomes less than about 4 billion nucleotides in length, bowtie2-build builds a "small" index using 32-bit numbers in various parts of the index. When the genome is longer, bowtie2-build builds a "large" index using 64-bit numbers. Small indexes are stored in files with the .bt2 extension, and large indexes are stored in files with the .bt2l extension. The user need not worry about whether a particular index is small or large; the wrapper scripts will automatically build and use the appropriate index.

                      Comment


                      • #12
                        Hi,
                        I have to map yeast genome using bowtie2. For this from where I can download genome.


                        The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.

                        The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.


                        Where I can reference genome?

                        Best Regards
                        Zillur

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X