Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to build a BGZF

    Hi,

    I would know how to build correctly a BGZF header using bgzf.c or bam.c functions.
    I want to generate a BAM file using BGZF but I always get errors.

    I'm trying this for 1 week now, but no matter how I try to build the header, I always get this error.

    Code:
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).

  • #2
    Are you familiar with hexdump? It is very useful for actually looking at the raw file and (for example) checking the final 28 bytes of the BAM/BGZF file to see if there is an EOF marker.

    Comment


    • #3
      No i'm not, I will give a try. But when i checked with gedit, I didn't saw this marker. I only saw the beginning marker (BAM\001)

      Comment


      • #4
        Have you looked at the bam_header_write function? Remembering that bam_write() is just bgzf_write(), if you want to avoid the bam_* functions for some reason. Have a look at bgzf_close() for how the EOF is added (it's not explicitly mentioned, but it's the empty block that's added there).

        Comment


        • #5
          Originally posted by granzanimo View Post
          No i'm not, I will give a try.
          I find using hexdump -C most helpful, often piped into head or tail.
          Originally posted by granzanimo View Post
          But when i checked with gedit, I didn't saw this marker. I only saw the beginning marker (BAM\001)
          That is a problem - the 'naked' BAM files starts like that once you have removed the BGZF (GZIP) compression. You should see a (special) GZIP header at the start of the file.

          Comment


          • #6
            @granzanimo: In case you're still not getting things to work, this would be the absolute minimum BAM file:

            Code:
            #include "bam.h"
            int main(int argc, char *argv[]) {
                BGZF *fp = bgzf_open("foo.bam", "w");
                bam_header_t *header = bam_header_init();
                bam_header_write(fp,header);
                bgzf_close(fp);
                bam_header_destroy(header);
                return(0);
            }
            You could instead have a bit of a header with something like:
            Code:
            #include "bam.h"
            int main(int argc, char *argv[]) {
                BGZF *fp = bgzf_open("foo.bam", "w");
                bam_header_t *header = bam_header_init();
                char *text = malloc(sizeof(char) * 128);
            
                //Add a chromosome to the header
                header->n_targets=1;
                header->target_name = malloc(sizeof(char *));
                header->target_name[0] = malloc(5*sizeof(char));
                header->target_name[0] = strcpy(header->target_name[0], "chr1");
                header->target_len = malloc(sizeof(uint32_t));
                header->target_len[0] = 1000;
                //The @HD line
                text = strcpy(text, "@HD\tVN:1.0\tSO:unsorted\n");
                header->text = text;
                header->l_text = strlen(text);
            
                bam_header_write(fp,header);
                bgzf_close(fp);
                bam_header_destroy(header);
                return(0);
            }
            You can also add the @SQ lines to header->text, though it's not strictly required.

            Comment


            • #7
              Originally posted by dpryan View Post
              @granzanimo: In case you're still not getting things to work, this would be the absolute minimum BAM file:

              Code:
              #include "bam.h"
              int main(int argc, char *argv[]) {
                  BGZF *fp = bgzf_open("foo.bam", "w");
                  bam_header_t *header = bam_header_init();
                  bam_header_write(fp,header);
                  bgzf_close(fp);
                  bam_header_destroy(header);
                  return(0);
              }
              You could instead have a bit of a header with something like:
              Code:
              #include "bam.h"
              int main(int argc, char *argv[]) {
                  BGZF *fp = bgzf_open("foo.bam", "w");
                  bam_header_t *header = bam_header_init();
                  char *text = malloc(sizeof(char) * 128);
              
                  //Add a chromosome to the header
                  header->n_targets=1;
                  header->target_name = malloc(sizeof(char *));
                  header->target_name[0] = malloc(5*sizeof(char));
                  header->target_name[0] = strcpy(header->target_name[0], "chr1");
                  header->target_len = malloc(sizeof(uint32_t));
                  header->target_len[0] = 1000;
                  //The @HD line
                  text = strcpy(text, "@HD\tVN:1.0\tSO:unsorted\n");
                  header->text = text;
                  header->l_text = strlen(text);
              
                  bam_header_write(fp,header);
                  bgzf_close(fp);
                  bam_header_destroy(header);
                  return(0);
              }
              You can also add the @SQ lines to header->text, though it's not strictly required.

              Just a short tip taken from dpryan earlier post, but somehow not popping up in google results on top how to compile the above code snippets:
              Code:
              gcc -Wall -o  mysnippet  mysnippet.c -lbam -lz -lpthread -I/path/2/samtools_src 
              -L//path/2/samtools_src
              Tested with samtools_0.1.19 and gcc 4.9.2

              Hope it helps.

              Darek

              Comment


              • #8
                It's good that you mentioned that! I have the unfortunate habit of assuming that everyone else knows how to compile C code manually...which is understandably not the case.

                Want to convert that code to work with htslib too (it's probably just modifying the included headers)?

                Comment


                • #9
                  Originally posted by dpryan View Post
                  It's good that you mentioned that! I have the unfortunate habit of assuming that everyone else knows how to compile C code manually...which is understandably not the case.

                  Want to convert that code to work with htslib too (it's probably just modifying the included headers)?
                  Well, catching up with htslib is one thing, but going from test BAM with
                  Code:
                  @HD	VN:1.0	SO:unsorted
                  @SQ	SN:chr1	LN:1000
                  towards one where one can:
                  * create BAM with real world header taken from conventionally created BAM file (samtools view -H real.bam > real.header.txt; read the real.header.txt and create a test1.bam)
                  * add even a tiny chunk with 10 unpaired but mapped reads
                  * if possible, add another minimal chunk and get say test3.bam
                  * create bam with a real world header plus two already sorted SAM files taken from 2 different chromosomes from real.bam

                  If for time being these steps are doable only using old branch of samtools, so be it. As long as the resulting BAMs are readable by the htslib, it is not a software archaeology

                  Anyway, many thanks for posting the code. I think it may be worthy to start putting it in say github repo bam_hacks_for_nonC_ppl

                  Darek

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X