Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard-Tools ValidateSamFile Error Question

    hey all,

    So I've been banging my head against this for a little while now and I hope someone can help. My .bam file was generated by using BWA-MEM to align the data to the reference, sorted and indexed using samtools, and now I'm trying to validate the file using picard-tools.

    The reporting file is returning a VERY large number of errors which (it seems to me) are that the .bam sequence data is not matching up to the reference. Why is it that I am encountering these errors when the sequence information was aligned to that reference using BWA?

    I'm not sure if there's been a problem with the actual alignment, or conversion, or with picard itself. Any input would be much appreciated.

    Example of the errors below:

    ---------------
    ERROR: Record 811, Read name HWI-ST0733:209:C0CDKACXX:2:1106:13607:160009, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR: Record 812, Read name HWI-ST0733:209:C0CDKACXX:2:1301:12877:138617, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR: Record 813, Read name HWI-ST0733:209:C0CDKACXX:2:1204:12454:19349, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ---------------

    Command for validation:
    Code:
    picard-tools ValidateSamFile INPUT=C0CDKACXX-2_aln-pe_sorted_backconverted_withreadgroup_AddOrReplace.bam OUTPUT=C0CDKACXX-2_aln-pe_sorted_backconverted_withreadgroup_AddOrReplace_Validated.bam REFERENCE_SEQUENCE=/data/reference/Oarv3.1.alldna.fasta
    Cheers

  • #2
    Some more information concerning the problem:

    I went back and carried out a new alignment using BWA (BWA MEM) with some extra options enabled (-R to insert the readgroup info in the alignment step rather than using addorremovereadgroups from picard, -M to allow picard compatibility for MarkDuplicates), then converted from sam to bam, sorted and indexed using picardtools, and then carried out a new validation with picard tools. I'm receiving the same errors at the same rate.

    Commands below (placeholder filenames used for clarity):

    BWA MEM alignment
    Code:
    bwa mem -t 3 -M -R '@RG\tReadGroupInformationStuff' /PathToFasta/Oarv3.1.alldna.fasta /PathToFastq1.fastq /PathToFastq2.fastq > Samfile.sam
    No errors spit out, seems to have run well.

    Picard-tools conversion, sort and index

    Code:
    picard-tools SortSam SO=coordinate INPUT=Input.sam OUTPUT=Output.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true TMP_DIR=/data/temp
    No errors spit out, seems to have run well.

    Picard-tools validation

    Code:
    picard-tools ValidateSamFile INPUT=Input.bam OUTPUT=Output.txt REFERENCE_SEQUENCE=Oarv3.1.alldna.fasta MAX_OUTPUT=2000 VALIDATION_STRINGENCY=LENIENT
    Error examples (in .txt file, program runs fine)

    ERROR: Record 792, Read name HWI-ST821:140:C0CCBACXX:6:1102:8846:8356, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR: Record 793, Read name HWI-ST821:140:C0CCBACXX:6:1103:14261:92896, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR: Record 794, Read name HWI-ST821:140:C0CCBACXX:6:1103:19946:198992, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR: Record 795, Read name HWI-ST821:140:C0CCBACXX:6:1105:16865:95882, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 796, Read name HWI-ST821:140:C0CCBACXX:6:1107:6186:154236, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR: Record 797, Read name HWI-ST821:140:C0CCBACXX:6:1108:12468:5005, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR: Record 798, Read name HWI-ST821:140:C0CCBACXX:6:1201:10882:125403, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 799, Read name HWI-ST821:140:C0CCBACXX:6:1206:14102:101564, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 800, Read name HWI-ST821:140:C0CCBACXX:6:1208:15077:154380, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 801, Read name HWI-ST821:140:C0CCBACXX:6:1302:15597:122894, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 802, Read name HWI-ST821:140:C0CCBACXX:6:1303:18803:55187, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR: Record 803, Read name HWI-ST821:140:C0CCBACXX:6:1305:13626:18399, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 804, Read name HWI-ST821:140:C0CCBACXX:6:1307:5874:86321, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 805, Read name HWI-ST821:140:C0CCBACXX:6:1307:9816:28744, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR: Record 806, Read name HWI-ST821:140:C0CCBACXX:6:2101:19656:87203, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR: Record 807, Read name HWI-ST821:140:C0CCBACXX:6:2102:12433:76269, NM tag (nucleotide differences) in file [1] does not match reality [2]
    Now this error seems to be related to the edit distance (NM Edit distance to the reference, including ambiguous bases but excluding clipping - http://samtools.sourceforge.net/SAM1.pdf)

    And every error I've grep'ed the read out of the related sam file has a soft clip associated with it BUUUT not all of the soft clip reads appear in the error report.

    In checking the flags of the reads, it seems that in general they map well to the reference.

    Is there any suggestion as to either:
    (a) overcome this problem (perhaps with another alignment which excludes soft clips) or
    (b) if it is safe to ignore these error reports and move further down the GATK pipeline?

    Comment


    • #3
      Hi,

      Did you ever find an answer to this question? I am using both bwa-mem and bwa-aln to align human genome sequencing data and get similar errors using Picard ValidateSamFile.jar at the end.

      Thanks
      A

      Comment


      • #4
        I've just gone back and re-read my notes. I did end up resolving the problems. Which specific errors are you getting? 'not matching reality'?
        Last edited by JezSupreme; 08-28-2013, 12:02 AM. Reason: extra information

        Comment


        • #5
          Hi,
          I am interested in how to solve this problem, any news ?
          Muriel

          Comment


          • #6
            Hi,
            I am having the same problem
            "NM tag (nucleotide differences) in file [5] does not match reality [6]"

            Could you explain us how did you solve the problem

            Thanks

            Suleyman

            Comment


            • #7
              Hi!
              I would also like to know what this error means… Any news?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X