SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
TopHat Error: Could not find Bowtie index files /bowtie-0.12.5/indexes/. rebrendi Bioinformatics 11 06-22-2016 09:55 AM
bowtie index problem (bowtie-build and then bowtie-inspect) tgenahmet Bioinformatics 4 09-10-2013 11:51 AM
BWA error and reference index satishkumar Introductions 1 11-19-2010 07:22 AM
Upload genome index to Galaxy for Bowtie alignment? jjw14 Bioinformatics 0 06-08-2010 08:22 AM
Reference genome for MAQ - split reference genome by chromosome or not? inesdesantiago Bioinformatics 4 02-18-2009 08:44 AM

Reply
 
Thread Tools
Old 02-24-2010, 01:19 AM   #1
kevlim83
Junior Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 9
Default bowtie reference genome index: help required

Dear all,

We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible. Is there a possible solution without modification of the source code?

Of course, we would like to consider source code modification as a last resort. In any case, we would also appreciate any insights as to how we can modify the source code to handle a 6billion character genome.

Regards,
Kevin
kevlim83 is offline   Reply With Quote
Old 02-24-2010, 09:00 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by kevlim83 View Post
Dear all,

We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible. Is there a possible solution without modification of the source code?

Of course, we would like to consider source code modification as a last resort. In any case, we would also appreciate any insights as to how we can modify the source code to handle a 6billion character genome.

Regards,
Kevin
I am guessing it has something to do with 32-bit integers, and so you would have to change the index source code to store 64-bit integers, which would double the index size instantly.

Could you split your reference and align to each separately and merge the results? This is not as faithful to the bowtie algorithm but seems like a practical solution.
nilshomer is offline   Reply With Quote
Old 02-24-2010, 06:04 PM   #3
kevlim83
Junior Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 9
Default

Hi,

Thanks for the reply.

Can anyone guide me as to where the pointers I need to change are located?

Regards,
Kevin
kevlim83 is offline   Reply With Quote
Old 02-26-2010, 09:32 AM   #4
sperry
Junior Member
 
Location: Nova Scotia

Join Date: Feb 2010
Posts: 7
Default

Hi Kevin,

Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:

Quote:
Because bowtie-build uses 32-bit pointers internally, it can handle up to a theoretical maximum of 2^32-1 (somewhat more than 4 billion) characters in an index, though, with other constraints, the actual ceiling is somewhat less than that. If your reference exceeds 2^32-1 characters, bowtie-build will print an error message and abort. To resolve this, divide your reference sequences into smaller batches and/or chunks and build a separate index for each.

If your computer has more than 3-4 GB of memory and you would like to exploit that fact to make index building faster, use a 64-bit version of the bowtie-build binary. The 32-bit version of the binary is restricted to using less than 4 GB of memory. If a 64-bit pre-built binary does not yet exist for your platform on the sourceforge download site, you will need to build one from source.
Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.
sperry is offline   Reply With Quote
Old 02-28-2010, 05:48 PM   #5
kevlim83
Junior Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 9
Default

Yes, we also think that messing around with source code is a cumbersome task indeed.

However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

Hence, we are left with the last resort which is to modify the source code.

Any form of help is truly appreciated here. Thanks.

Regards,
Kevin

Quote:
Originally Posted by sperry View Post
Hi Kevin,

Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:



Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.
kevlim83 is offline   Reply With Quote
Old 02-28-2010, 05:59 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by kevlim83 View Post
Yes, we also think that messing around with source code is a cumbersome task indeed.

However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

Hence, we are left with the last resort which is to modify the source code.

Any form of help is truly appreciated here. Thanks.

Regards,
Kevin
What about using a different aligner?
nilshomer is offline   Reply With Quote
Old 03-01-2010, 06:58 AM   #7
sperry
Junior Member
 
Location: Nova Scotia

Join Date: Feb 2010
Posts: 7
Default

Hi Kevin,

Take a look at the ebwt.h file in the bowtie source distribution. This file outlines the ebwt-related classes. Searching for 'int', 'uint32_t', and 'int32_t' should give you an idea of where you can start to modify the code.

You might also find it useful to compile bowtie using the '-ggdb' flag, and then try invoking bowtie-build with your large reference sequence within gdb to see exactly where things are breaking down.

-Scott

Quote:
Originally Posted by kevlim83 View Post
Yes, we also think that messing around with source code is a cumbersome task indeed.

However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

Hence, we are left with the last resort which is to modify the source code.

Any form of help is truly appreciated here. Thanks.

Regards,
Kevin

Last edited by sperry; 03-01-2010 at 07:21 AM.
sperry is offline   Reply With Quote
Old 01-31-2014, 07:53 AM   #8
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

An old thread, but I am currently in a similar situation. I have a polyploid genome of >10 Gbs that I have to work with. Anybody have any recommendations on altering bowtie for this?

Alternatively, any good strategies at post-processing data aligned to individual chunks to achieve the same result?
chadn737 is offline   Reply With Quote
Old 01-31-2014, 10:47 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I think BWA can handle larger genomes, that'd be the easiest solution.

BTW, you can split a genome, map all the reads to each of the chunks with bowtie2, and then process the results to produce results equivalent to what would have been produced had you aligned to the genome as a whole with bowtie2, but it's not completely trivial. This is effectively how bisulfite-seq aligners work (see the source code for Bison if you really want to see how to do this).
dpryan is offline   Reply With Quote
Old 01-31-2014, 10:52 AM   #10
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

This is for bisulphite-sequencing. The problem being, that my lab uses a specific pipeline for our analysis, we work closely with the developers. Bowtie is a standard part of that protocol and I have already used this pipeline for analyzing A LOT of data, this being the first time I have run into problems. I really would like to avoid using any other aligner, because then the effort put into achieving identical results with Bowtie will be a headache in itself.

That being said, I think I have successfully modified bowtie-build...whether or not this works I can't say until its finished and I have had a chance to align some data. But it seems to be working.
chadn737 is offline   Reply With Quote
Old 11-27-2014, 01:24 PM   #11
Timothy Amos
Junior Member
 
Location: Sydney

Join Date: Aug 2014
Posts: 4
Default

Quote:
Originally Posted by kevlim83 View Post
We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible.
I know this is a four year old question, but bowtie-2 says it can now deal with this (Current version is Bowtie2 2.2.4):

Quote:
Small and large indexes

bowtie2-build can index reference genomes of any size. For genomes less than about 4 billion nucleotides in length, bowtie2-build builds a "small" index using 32-bit numbers in various parts of the index. When the genome is longer, bowtie2-build builds a "large" index using 64-bit numbers. Small indexes are stored in files with the .bt2 extension, and large indexes are stored in files with the .bt2l extension. The user need not worry about whether a particular index is small or large; the wrapper scripts will automatically build and use the appropriate index.
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
Timothy Amos is offline   Reply With Quote
Old 12-09-2014, 06:36 PM   #12
zillur
Senior Member
 
Location: Puerto Rico

Join Date: Sep 2014
Posts: 106
Default

Hi,
I have to map yeast genome using bowtie2. For this from where I can download genome.
http://www.ebi.ac.uk/ena/data/search?query=yeast
http://downloads.yeastgenome.org/seq...nome_releases/
http://www.yeastgenome.org/download-data/sequence
http://www.yeastgenome.org/strain/S288C/overview

Where I can reference genome?

Best Regards
Zillur
zillur is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO