Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting Reference Sequence from a bam File

    I'm currently using the bamtools API for a program where I'm trying to extract and print the reference sequence from a .bam file, but can't figure out how I could get to the reference sequence. Is there any way to determine the reference sequence directly from the bam file, or is there a better way to do this?

    I've searched but haven't come up with anything.

    Thanks for any help,
    Andy

  • #2
    You can't

    If you read the SAM/BAM file format definition, you'll see they don't actually contain the reference sequences. All the BAM file header contains is a SAM header (optional chunk of embedded text), number of references, their names and lengths.

    You would normally have a FASTA file to accompany the SAM/BAM file.

    Comment


    • #3
      Should've seen that, thanks.

      Do you have any advice on how to rapidly extract a subsequence from a fasta file using an index file?
      Last edited by andy11; 12-13-2010, 06:18 AM.

      Comment


      • #4
        I've never understood the decision not to (optionally) bundle the reference into the BAM/SAM file. It seems that for most secondgen datasets that the space that the reference would take up would be a trivial amount while the amount of headaches avoided by not having to chase down the proper reference would be large. But, alas, I was not consulted. :-)

        Getting back to 'andy11's question "how to rapidly extract a subsequence from a fasta file using an index file?" we will need a bit more information. Which program created the index file? I am assuming that you are not talking about the BAM/SAM index since that had nothing to do with the reference file.

        Comment


        • #5
          Originally posted by westerman View Post
          I've never understood the decision not to (optionally) bundle the reference into the BAM/SAM file. It seems that for most secondgen datasets that the space that the reference would take up would be a trivial amount while the amount of headaches avoided by not having to chase down the proper reference would be large. But, alas, I was not consulted. :-)
          I agree with you that optionally including the reference sequence could be very useful, especially for non-model organisms. The SAM/BAM design is clearly designed more for mapping and re-sequencing than for de novo assembly.

          Comment


          • #6
            There are many small alignment files far smaller than the compressed human reference genome (e.g. most of 1000 genomes alignments).

            Yes, for users, it would always be good to have more options, but for developers, having more options is a burden which has to be evaluated carefully; the adoption of SAM would also be delayed that way.

            Comment


            • #7
              Originally posted by andy11 View Post
              Should've seen that, thanks.

              Do you have any advice on how to rapidly extract a subsequence from a fasta file using an index file?
              I think USEQ has a utility to specify bed file with coordinates, and it extracts the fasta sequences from a larger fasta file!
              --
              bioinfosm

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              29 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X