Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parsing error in BAM header

    Hi~ I am pretty new to this area, having a hard time with these huge files..

    I wanted to use IndelGenotyperV2 (from GATK) with my newly built BAM files. When I executed the command I found an error like below..

    java.lang.RuntimeException: net.sf.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair.

    And the @PG line is like below:

    @PG ID:illumina_export2sam.pl VN:2.0.0 CL:/opt/GOAT/CASAVA_1.7.0a6/bin/illumina_export2sam.pl --read1=s_7_1_export.txt --read2=s_7_2_export.txt

    I don't figure out what the problem is here.. There are all three tags (ID, VN, and CL).

    One hint is that I can find a warning when I check the header part with samtools (samtools view myfile.bam -H) like below:

    The tag '--' present (at least) twice on line [@PG ID:illumina_export2sam.pl VN:2.0.0 CL:/opt/GOAT/CASAVA_1.7.0a6/bin/illumina_export2sam.pl --read1=s_7_1_export.txt --read2=s_7_2_export.txt]

    Is this a cause of this error? or there's any other problem in my file?

    Thanks,

  • #2
    I would recommend that you "reheader" your sam file with "samtools reheader" if you would like to try and isolate the cause of this problem.
    Basically use "samtools view -H file.bam > header.txt". Edit header.txt and perhaps remove the "--" from all your header lines, then use "samtools reheader" with your new header file and the bam file.

    Comment


    • #3
      The hint from samtools view is telling you that '--' is being interpreted as a tag, which means that it is immediately following a tab character each time it appears.

      SAM header fields are delimited by tabs, so header field values of course cannot themselves contain tabs. Your CL: value has the words of the command line separated by tabs rather than spaces, leading to parsing confusion.

      If illumina_export2sam.pl is generating this CL: value with tabs inside it, then that is a bug in illumina_export2sam.pl -- it should be replacing tabs with spaces (or doing something similar) to ensure that it is outputting a valid SAM header.

      You may be able to reheader your BAM file so as to replace these spurious tabs with spaces yourself. Or when you produce the SAM file it would be easy to replace them with sed or a text editor. Or it should be easy to fix illumina_export2sam.pl yourself if it is a Perl script -- just search for /@PG/ and/or /CL:/ which most likely appear exactly once in the script.

      Comment


      • #4
        Thank you zee and jmarshall.
        As in your postings, I extracted original headers from my BAM files using "samtools view -H myfile.bam > output.file". And using picard "ReplaceSamHeader", I successfully replaced the modified header.

        So, now the IndelGenotyperV2 does not arise an error message about bam headers.
        Now it is complaining about memory.. haha (althogh I gave 2g to him)

        Thank you both anyway

        Comment


        • #5
          Originally posted by zee View Post
          I would recommend that you "reheader" your sam file with "samtools reheader" if you would like to try and isolate the cause of this problem.
          Basically use "samtools view -H file.bam > header.txt". Edit header.txt and perhaps remove the "--" from all your header lines, then use "samtools reheader" with your new header file and the bam file.
          I also have a problem of the BAM file header
          "Error parsing SAM header. Problem parsing @PG key:value pair. Line:
          @PG TopHat VN:1.0.13"
          when i use "samtools view -H file.bam > header.txt" and Edit header.txt then use "samtools reheader" ,it comes out gibberish.how to use the "samtools reheader "??

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          58 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          45 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X