Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with BGZF and Samtools

    Hi,

    I'm a college student working on a parallel version of Samtools project.
    I have to use BGZF to compress blocks of data in order to create a .bam file.

    But I'm facing a problem. When I try to read the bam file I created with the Samtools view, I have the following errors:
    Code:
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    How did I do wrong?
    How can I fix it?

    I'm here to any further questions.
    Thanks for your help !

  • #2
    If you are streaming/piping the data into samtools, v0.1.18 and v0.1.19 would wrongly give this warning message as a false alarm: https://github.com/samtools/samtools/issues/18

    Have you actually written the 28 byte empty BGZF block at the end of the file? This was only formally added to the specification in Dec 2013 but was in use long before that.

    Comment


    • #3
      Are you piping the file into samtools when you get that warning? That's a common occurence and can be ignored (some versions of samtools mistakenly try to check for an end of file (EOF) marker when using pipes).

      If not, it's likely your file is truncated for some reason (we'd have to know more about how you created the file in the first place to guess how).

      Edit: I should have refreshed the tab, I see Peter beat me to it!

      Comment


      • #4
        Hi guys, thanks for your answers.

        Yes, I was piping the file into Samtools.

        I checked into bgzf.c and I found that the bgzf_close() function add this 28 bytes EOF marker.

        But when I try to use this function, I got a segmentation fault ...And I don't know why

        Comment


        • #5
          Please clarify: Are you getting a segmentation fault from your own code when writing the EOF marker - or are you getting a segmentation fault from samtools view when reading your BAM file?

          Comment


          • #6
            It would also be helpful if you posted some of the code that's causing the problem.

            Comment


            • #7
              Sorry, here is the code

              Code:
              void compressData(MPI_File *in, MPI_File *out, const int rank, const int num_proc, const int overlap,
                             char ***lines, int *nlines) {
              
                  MPI_Offset filesize;
                  MPI_Offset localsize;
                  MPI_Offset start;
                  MPI_Offset end;
                  char *chunk;
                  uint8_t *dunk;
                  BGZF *fp;
                  int *offset_tab;
                  /* figure out who reads what */
              
                  MPI_File_get_size(*in, &filesize);
                  localsize = filesize/num_proc;
                  start = rank * localsize;
                  end   = start + localsize - 1;
              
                  /* add overlap to the end of everyone's chunk... */
                  end += overlap;
              
                  /* except the last processor, of course */
                  if (rank == num_proc-1) end = filesize;
              
                  localsize =  end - start + 1;
                  /* allocate memory */
                  chunk = malloc( (localsize)*sizeof(char));
              
                  /* everyone reads in their part */
                  printf("Rank %d we read data!! \n", rank);
                  MPI_File_read_at_all(*in, start, chunk, localsize, MPI_CHAR, MPI_STATUS_IGNORE);
                  //chunk[localsize] = '\0';
              
                  int dlen;
                  int slen = strlen(chunk);
                  printf("Rank %d size of the data read %d !! \n", rank, slen);
                  printf("Rank %d start compression!! \n", rank);
              
              
              
                  bam_header_t *head = bam_header_init();
                  bam_header_write(fp, head);
              
                  fp = bgzf_write_init(Z_DEFAULT_COMPRESSION);
                  memcpy(fp->uncompressed_block, chunk, localsize);
                  int comp_size = deflate_block(fp, slen);
              	
              	
                  if(!bgzf_close(fp)){
                  	printf("Error for CPU number %d", rank);
                  	exit(2);
                  }
              
              }
              I got a segmentation fault when using bgzf_close(), while I got nothing when I don't call it.

              I used this function to add the EOF marker at the end of the blocks.

              Comment


              • #8
                Originally posted by granzanimo View Post
                Sorry, here is the code

                Code:
                    bam_header_t *head = bam_header_init();
                    bam_header_write(fp, head);
                You're writing an initialized but otherwise empty struct to an uninitialized file pointer...

                Code:
                    fp = bgzf_write_init(Z_DEFAULT_COMPRESSION);
                Now you have an initialized BGZF struct, though it still has no file association.

                Code:
                    if(!bgzf_close(fp)){
                Since "fp->fp" points to uninitialized memory, this will segfault in the internal fclose(fp->fp) step.

                Firstly, there's usually no reason to manually add the EOF to a BAM file, since you're probably lying to yourself that the contents aren't corrupt. Secondly why are you trying to do this with MPI? The bottle-neck here is usually IO, which is often saturated with 4 or so compression threads.

                Comment


                • #9
                  Thanks for your answer

                  The MPI code is here because I'm trying to run this program on a 800 CPU cluster.
                  I'm trying to compress each block of data with each CPU.

                  So I didn't understand, what do I need to fix in my code?

                  Comment


                  • #10
                    Firstly, you need to

                    Code:
                    fp = bgzf_write_init(Z_DEFAULT_COMPRESSION);
                    before you can
                    Code:
                    bam_header_write(fp, head);
                    Secondly, the above line is problematic. While the header will likely fit into the buffer and not cause a call to bgzf_flush, if it doesn't fit you'll get a segfault, since fp->fp isn't initialized.

                    Code:
                    if(!bgzf_close(fp)){
                    You'll have to write your own close function, since bgzf_close() will first try to flush the buffer, which it can't due to the aforementioned fp->fp issue ... causing the segfault that you saw.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X