Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inconsistencies between evidence and SAM files

    Hi everyone,

    The evidence2sam command which produces the SAM file, contains records/lines where a every read is repeated twice mapping to two respective different locations, such as:

    GS27657-FS3-L02-8:1660459 179 chr14 19089795 16 12M1I1P4I6M6N5M1I4M = 19090178 383 CCTAATTCTTATTTTTATTTTTTTATTTATTTT 9::::656877887;<<<:::<;6-47737783 RG:Z:NA19238-L2-200-37-ASM-chr14 GC:Z:3S2G28S GS:Z:AAAA GQ:Z:::4-
    GS27657-FS3-L02-8:1660459 115 chr14 19090178 16 10M5N23M = 19089795 -383 TTCATGAGAGGGTCCACTATTTTTCCCTTGTTA .08877587857;*;1<<9778877871;;::7 RG:Z:NA19238-L2-200-37-ASM-chr14 GC:Z:28S2G3S GS:Z:TGTG GQ:Z:77;;
    Here, the read 'GS27657-FS3-L02-8:1660459' maps to the same chromosome (chr14) at two adjacent locations.

    Following are my doubts about the Complete Genomics(CG) data:
    1) Why is every read ID in the SAM file repeated twice mapping to different locations? Is there a way to resolve this? What does this signify?
    2) It is observed that the Evidence file contains reads of uniform length - 70 bp. Post conversion using evidence2sam from cgatools, why does the read length reduce to 33 bp only?
    3) Also, if the read sequence is compared between the evidence file and the SAM file,the read sequence is not at all matching to any part of the 70 bp sequence. Is this an error?

    If anybody could help, it would be great.

    Thanks in advance.

  • #2
    We create mate-pair libraries for sequencing, what you are seeing is each (~35bp) arm of the mate-pair represented in the sam file. While the SAM specification (PDF) is not exactly fun reading, it does contain the information to figure out what is going on here (from the spec):

    QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME ‘*’ indicates the information is unavailable.
    You can also use the flags for each read to figure out what is going on too, the explain SAM flags page is helpful for this. For example the first read shown has the following properties according to its flag (179): read paired, read mapped in proper pair, read reverse strand, mate reverse strand, second in pair.

    The reads we generate are not always precisely 35bp in length. Each read is composed of sub-reads, which may contain both positive gaps and negative gaps (~ 2bp overlap). The overlapping bases are represented in the evidence files, but this feature is not supported by the SAM specification so the overlap is collapsed during conversion making the reads in the SAM files slightly shorter. There are additional flags generated by evidence2sam that specify which of the bases in the read were collapsed (GC/GS/GQ).

    Using the explain SAM flags page, you can see that the reads are on the reverse strand, which is why they are not exactly the same between evidence and SAM files.

    Your previous post, had an error generated by GATK when parsing the SAM files from the evidence2sam command in cgatools:

    ERROR MESSAGE: SAM/BAM file SAMFileReader{CGA_test/originalSAM/output_sorted.bam} is malformed: Adjacent I/D events in read GS27657-FS3-L02-8:1660459

    This appears to be a GATK issue, discussed here (starting at post #23) and reported here. What was the specific GATK command you were using that generated this error ? And are you using public data or is this your own data ?

    I haven't done much work on this particular error, but I would guess GATK is complaining about the first read with cigar 12M1I1P4I6M6N5M1I4M, which contains a number of insertions. I'm confident this is not an error with the read, as the local de novo assembly process can generate complex variant calls that may consider adjacent insertion/deletion events.

    Greg
    Bioinformatics Applications, Europe
    Lifetech Inc. http://www.lifetech.com/

    Comment


    • #3
      Oh, Okay. So what I finally understand is that the SAM file consists of a split-up of the mate-pair read, i.e. 2 sequences of each of the mate-pair. And even if the entire SAM file consists of such duplicate entries, it is normal and expected. I hope I am talking sense and correct.

      Thank you for giving the in depth information about the SAM file tag - was really helpful and interesting.

      Moving to the GATK error, I went to the forums which you have pointed to - I used -rf BadCigar option with GATK. It still gives me the error. I will not bother you with GATK error as it seems to be related to the tool; will get it solved independently.

      Thanks once again for the prompt responses to my issues

      Comment


      • #4
        I hope I am talking sense and correct.
        You are correct.

        It is more appropriate to post issues with GATK to their support forum, however I would still like to know which GATK command you were using. The reason I ask is that in the forum thread, the error seemed to arise with the DINDEL (indel detection) command. Were you using this command, or simply parsing the BAM files ?

        You are welcome to continue to ask questions regarding use of GATK with Complete data, it is not a bother, and we very much welcome the feedback on issues encountered with third party tools. This feedback helps us improve compatibility, which is not perfect at the moment, but we genuinely want to see improve.
        Bioinformatics Applications, Europe
        Lifetech Inc. http://www.lifetech.com/

        Comment


        • #5
          Following is the GATK command I used:

          java -jar GenomeAnalysisTK-1.4-37-g0b29d54/GenomeAnalysisTK.jar -T UnifiedGenotyper -I output_sorted.bam -R hg1to24.fa -glm BOTH -o GATK_out -log logfile.txt
          I have tried to run -glm SNP - it works. There lies a problem only with -glm INDEL and BOTH options. I used ValidateSam from Picard to check for errors - It said that "Padded characters are not allowed at the start or end of the CIGAR string". I am stuck at this point - whether to remove such records from the file and proceed or is there some solution? I have already posted this query to GATK forum. Will wait for the answer. If anyone could help, it would be great. Thanks Greg.

          Comment


          • #6
            Excellent, thank you for the informative discussion and letting us know the commands you have been using. As I suspected, the issue is with using the INDEL detection methods in GATK. The "Padded characters are not allowed at the start..." issue was discussed in the previous thread, local de novo assembly of reads can introduce P operators at the beginning of the cigar string to deal with indels. My understanding is the SAM spec does not explicitly disallow P operators at the beginning of a cigar string, but most tools (GATK/samtools) do not support this.

            I do not expect that the results of processing the evidence reads via GATK will produce better results than the native CG pipeline. Keep in mind, that the evidence reads have already been "realigned" (local de novo assembly) and in theory are the optimal set of alignments at a given position in the genome supporting a variant (SNP, SUB, INDEL). I encourage you to try processing these reads via GATK and reporting the results of this experiment, after comparing back to our results.
            Bioinformatics Applications, Europe
            Lifetech Inc. http://www.lifetech.com/

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            50 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X