Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • shawn.mek
    Member
    • Feb 2013
    • 12

    Use Bowtie Index to get sequences using locations

    We have the fasta files (obviously) for the hg19 genome, we used them to create a big Bowtie index.

    I was hoping not to have to keep the fasta file. Instead just look up sequences in the Bowtie index when I get chromosome locations.

    I know when the alignment comes back it tells me where the alignment occurs and which fasta record (header) that it came from. So all the info is there, but I can't figure out how to pull out a sequence given a location.

    Does anyone know if this is possible, or know much about the index format (perhaps I could write a little program to fish out a sequence)?


    Thanks
  • winsettz
    Member
    • Sep 2012
    • 91

    #2
    Originally posted by shawn.mek View Post
    We have the fasta files (obviously) for the hg19 genome, we used them to create a big Bowtie index.

    I was hoping not to have to keep the fasta file. Instead just look up sequences in the Bowtie index when I get chromosome locations.

    I know when the alignment comes back it tells me where the alignment occurs and which fasta record (header) that it came from. So all the info is there, but I can't figure out how to pull out a sequence given a location.

    Does anyone know if this is possible, or know much about the index format (perhaps I could write a little program to fish out a sequence)?


    Thanks
    You should be able to extract that information from the sam output. I've not used bowtie2-inspect before, but it could be what you are looking for.

    Code:
    bowtie2-inspect
    No index name given!
    Bowtie 2 version 2.1.0 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
    Usage: bowtie2-inspect [options]* <bt2_base>
      <bt2_base>         bt2 filename minus trailing .1.bt2/.2.bt2
    
      By default, prints FASTA records of the indexed nucleotide sequences to
      standard out.  With -n, just prints names.  With -s, just prints a summary of
      the index parameters and sequences.  With -e, preserves colors if applicable.
    
    Options:
      -a/--across <int>  Number of characters across in FASTA output (default: 60)
      -n/--names         Print reference sequence names only
      -s/--summary       Print summary incl. ref names, lengths, index properties
      -e/--bt2-ref      Reconstruct reference from .bt2 (slow, preserves colors)
      -v/--verbose       Verbose output (for debugging)
      -h/--help          print detailed description of tool and its options
      --help             print this usage message

    Comment

    • shawn.mek
      Member
      • Feb 2013
      • 12

      #3
      Just to clarify, I mean using the index - giving it a chromosome name (fasta header) and location numbers, and getting back a sequence.

      I don't want to run an alignment, just pull out the sequence. So no SAM output.

      For this I'm using bowtie, not bowtie2. But of bowtie2 can do this...

      Thanks

      Comment

      • shawn.mek
        Member
        • Feb 2013
        • 12

        #4
        The bowtie-inspect thing does get all the info out, but thats 3gb of info since I can't select a location

        Comment

        • lh3
          Senior Member
          • Feb 2008
          • 686

          #5
          Although bowtie index essentially keeps the genome, I doubt it is optimized or designed for your purpose. Use faidx if you only want to retrieve a few regions.

          Comment

          • shawn.mek
            Member
            • Feb 2013
            • 12

            #6
            I want to retrieve lots of regions efficiently, but thanks for pointing me to faidx, I'll see how it works.

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              If you really have a LOT of positions, then it's best to read the genome into memory. samtools faidx is great for a smallish number of sites, but it grabs the sequence from disk, making it a bit slow for a large number of queries.

              Comment

              • shawn.mek
                Member
                • Feb 2013
                • 12

                #8
                yeah, I'm torn on holding it in memory or not. Toy with different workflows

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  Although bowtie index essentially keeps the genome, I doubt it is optimized or designed for your purpose.
                  The bowtie index is optimised for searching, but it's an overkill (and inefficient) for getting subsequences. If you want compressed indexed storage for just DNA sequence retrieval, then the 2bit format is probably best:



                  The code points to a way to retrieve ranges:


                  Code:
                  /* Parse a .2bit file and sequence spec into an object.
                   * The spec is a string in the form:
                   *
                   *    file/path/input.2bit[:seqSpec1][,seqSpec2,...]
                   *
                   * where seqSpec is either
                   *     seqName
                   *  or
                   *     seqName:start-end
                  So there's probably a program somewhere for getting subsequences out of that file using seqName:start-end notation.

                  edit: indeed, BLAT has such functions included. See here for a bit of discussion about 2bit retrieval using Perl:

                  Last edited by gringer; 10-31-2013, 03:43 PM.

                  Comment

                  • ctseto
                    Member
                    • Oct 2013
                    • 44

                    #10
                    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                    If you just need to retrieve known regions.

                    Comment

                    • tatianaorli
                      Junior Member
                      • May 2018
                      • 4

                      #11
                      Originally posted by winsettz View Post

                      You should be able to extract that information from the sam output. I've not used bowtie2-inspect before, but it could be what you are looking for.

                      Code:
                      bowtie2-inspect
                      No index name given!
                      Bowtie 2 version 2.1.0 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
                      Usage: bowtie2-inspect [options]* <bt2_base>
                      <bt2_base> bt2 filename minus trailing .1.bt2/.2.bt2
                      
                      By default, prints FASTA records of the indexed nucleotide sequences to
                      standard out. With -n, just prints names. With -s, just prints a summary of
                      the index parameters and sequences. With -e, preserves colors if applicable.
                      
                      Options:
                      -a/--across <int> Number of characters across in FASTA output (default: 60)
                      -n/--names Print reference sequence names only
                      -s/--summary Print summary incl. ref names, lengths, index properties
                      -e/--bt2-ref Reconstruct reference from .bt2 (slow, preserves colors)
                      -v/--verbose Verbose output (for debugging)
                      -h/--help print detailed description of tool and its options
                      --help print this usage message
                      This was extremely useful to me and my lab mates today, thank you so much for this comment back in 2013! I had a bowtie2 index file I got from a collaborator and extracted the original fasta sequences out of it using that program from the bowtie2 package. I would have never thought about doing it if I had not seen your answer here. I

                      Comment

                      Latest Articles

                      Collapse

                      • GATTACAT
                        Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by GATTACAT
                        Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                        07-01-2026, 11:43 AM
                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, 07-02-2026, 11:08 AM
                      0 responses
                      17 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-30-2026, 05:37 AM
                      0 responses
                      18 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-26-2026, 11:10 AM
                      0 responses
                      21 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      54 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...