Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discrepancy between Bismark SAM file and bismark methylation extractor

    Hi,

    I'm flummoxed by a discrepancy between the Bismark SAM file and the results of bismark methylation extractor.

    bismark_methylation_extractory counts 11 reads at the location 20,665,340 on chromosome Y.
    When I view the SAM file in IGV (after conversion to the BAM format and indexing), I can see that there are 14 reads at that location.
    Given that four of the reads are two pairs of overlapping reads, I would expect that the count for bismark methylation extractor would be 12, since bismark_methalytation_extractor only counts the first member of overlapping reads.
    According to my understanding of bismark_methylation_extractor, it should therefore count 12 reads.
    Why does it count 11 reads?
    On what criteria does it exclude another read?
    I've put in attachment a screenshot of IGV for the location of interest.
    I converted the cov file to the bigWig format, but the results are identical if I go back to the original cov file.

    [alexis@lg-1r17-n03 methylation_extractor]$ grep 20665340 AtTneo_Y_sorted_read_name.deduplicated.bismark.cov
    Y 20665340 20665340 90.9090909090909 10 1

    Here is also the original command.

    bismark_methylation_extractor \
    --paired-end \
    --bedGraph \
    --buffer_size 8100M \
    --CX_context \
    --zero_based \
    --merge_non_CpG \
    --comprehensive \
    --output ../../results/bismark/AtTneo/Y/methylation_extractor \
    ../../results/bismark/AtTneo/Y/AtTneo_Y_sorted_read_name.deduplicated.bam

    Thank you for your help,

    Alexis
    Attached Files

  • #2
    I've simplified the problem, to make the troubleshooting easier.
    I've run bismark_methylation_extractor a second time, including overlaps.
    I get 13 bases aligning at position 20665340, whereas I count 14 when I view the SAM generated with Bismark in IGV.
    So, why the discrepancy? Is it the bottom read that appears to be truncated that is not counted? I don't see any mention in the documentation of reads being skipped. Or is it a bug?

    I've put in attachment the IGV screenshot, this time not showing paired reads together, to make the counting easier.

    [blancha@lg-1r17-n04 methylation_extractor_with_overlap]$ grep 20665340 *bismark.cov
    Y 20665340 20665340 92.3076923076923 12 1

    bismark_methylation_extractor \
    --include_overlap \
    --paired-end \
    --bedGraph \
    --buffer_size 8100M \
    --CX_context \
    --zero_based \
    --merge_non_CpG \
    --comprehensive \
    --output ../../results/bismark/AtTneo/Y/methylation_extractor_with_overlap \
    ../../results/bismark/AtTneo/Y/AtTneo_Y_sorted_read_name.deduplicated.bam
    Attached Files

    Comment


    • #3
      We'd need to see the lines from the SAM file to give you a real answer. My guess is that one of the alignments is to the original top strand while the others are to the original bottom.

      Comment


      • #4
        Thank you.
        You are absolutely correct.

        I have put in attachment the IGV screenshot with the reads colored by the XG tag. 13 reads have the XG tag set to GA, while one read has the tag set to CT.
        So, I suppose that one read with the tag set to CT is excluded from the bismark_methylation_extractor count.
        I've put at the botttom the details for the one read that appears not to be counted by bismark_methylation_extractor.
        I'm still not absolutely clear on why the read is excluded from the methylation counts based on the tag, but I'll think about it a a bit more. I suppose that bismark_methylation_extractor has concluded that the original sequenced base was a G and not a C, and therefore could not be methylated?

        It's a bit frustrating to get different results from the Bismark SAM file, the bismark_methylation_extractor, and methylKit, but I've made progress in understanding why. bismark_methylation_extractor excludes overlapping reads by default. methylKit's read.bismark function excludes bases with a Phred quality score below 20 by default. bismark_methylation_extractor appears to exclude certain reads based on their XG and XR tags, according to criteria that I do not fully understand yet (not counted if sequenced base was a G?).

        Read name = HWI-ST915:46:C6ANNACXX:5:2112:8591:53033_1:N:0:
        ----------------------
        Location = chrY:20,665,340
        Alignment start = 20,665,340 (+)
        Cigar = 53M
        Mapped = yes
        Mapping quality = 15
        Secondary = no
        Supplementary = no
        Duplicate = no
        Failed QC = no
        ----------------------
        Base = G
        Base phred quality = 33
        ----------------------
        Mate is mapped = yes
        Mate start = chrY:20665426 (-)
        Insert size = 186
        First in pair
        Pair orientation = F1R2
        ----------------------
        MD = 5C0C1C0C1C4C7C14C0C0C2C0C2C1C1G0
        XG = CT
        NM = 15
        XM = .....hh.hh.x....h.......x..............hhh..hh..h.x..
        XR = CT
        -------------------
        Alignment start position = chrY:20665340
        GTTTTTTTTTTTTGATTTATATAATTGGAAAAATAATAATTTTTTTTTTTTTT
        Attached Files

        Comment


        • #5
          That alignment is to the top strand, so a G will always just be a G (if the read had an A there then it'd be either a SNP or a sequencing error). Alignments to the top strand give information on C->T transitions. Alignments to the bottom strand give information on G->A transition. No alignment (or pair) will give information on both.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Working...
          X