Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discrepancy between Bismark SAM file and bismark methylation extractor

    Hi,

    I'm flummoxed by a discrepancy between the Bismark SAM file and the results of bismark methylation extractor.

    bismark_methylation_extractory counts 11 reads at the location 20,665,340 on chromosome Y.
    When I view the SAM file in IGV (after conversion to the BAM format and indexing), I can see that there are 14 reads at that location.
    Given that four of the reads are two pairs of overlapping reads, I would expect that the count for bismark methylation extractor would be 12, since bismark_methalytation_extractor only counts the first member of overlapping reads.
    According to my understanding of bismark_methylation_extractor, it should therefore count 12 reads.
    Why does it count 11 reads?
    On what criteria does it exclude another read?
    I've put in attachment a screenshot of IGV for the location of interest.
    I converted the cov file to the bigWig format, but the results are identical if I go back to the original cov file.

    [alexis@lg-1r17-n03 methylation_extractor]$ grep 20665340 AtTneo_Y_sorted_read_name.deduplicated.bismark.cov
    Y 20665340 20665340 90.9090909090909 10 1

    Here is also the original command.

    bismark_methylation_extractor \
    --paired-end \
    --bedGraph \
    --buffer_size 8100M \
    --CX_context \
    --zero_based \
    --merge_non_CpG \
    --comprehensive \
    --output ../../results/bismark/AtTneo/Y/methylation_extractor \
    ../../results/bismark/AtTneo/Y/AtTneo_Y_sorted_read_name.deduplicated.bam

    Thank you for your help,

    Alexis
    Attached Files

  • #2
    I've simplified the problem, to make the troubleshooting easier.
    I've run bismark_methylation_extractor a second time, including overlaps.
    I get 13 bases aligning at position 20665340, whereas I count 14 when I view the SAM generated with Bismark in IGV.
    So, why the discrepancy? Is it the bottom read that appears to be truncated that is not counted? I don't see any mention in the documentation of reads being skipped. Or is it a bug?

    I've put in attachment the IGV screenshot, this time not showing paired reads together, to make the counting easier.

    [blancha@lg-1r17-n04 methylation_extractor_with_overlap]$ grep 20665340 *bismark.cov
    Y 20665340 20665340 92.3076923076923 12 1

    bismark_methylation_extractor \
    --include_overlap \
    --paired-end \
    --bedGraph \
    --buffer_size 8100M \
    --CX_context \
    --zero_based \
    --merge_non_CpG \
    --comprehensive \
    --output ../../results/bismark/AtTneo/Y/methylation_extractor_with_overlap \
    ../../results/bismark/AtTneo/Y/AtTneo_Y_sorted_read_name.deduplicated.bam
    Attached Files

    Comment


    • #3
      We'd need to see the lines from the SAM file to give you a real answer. My guess is that one of the alignments is to the original top strand while the others are to the original bottom.

      Comment


      • #4
        Thank you.
        You are absolutely correct.

        I have put in attachment the IGV screenshot with the reads colored by the XG tag. 13 reads have the XG tag set to GA, while one read has the tag set to CT.
        So, I suppose that one read with the tag set to CT is excluded from the bismark_methylation_extractor count.
        I've put at the botttom the details for the one read that appears not to be counted by bismark_methylation_extractor.
        I'm still not absolutely clear on why the read is excluded from the methylation counts based on the tag, but I'll think about it a a bit more. I suppose that bismark_methylation_extractor has concluded that the original sequenced base was a G and not a C, and therefore could not be methylated?

        It's a bit frustrating to get different results from the Bismark SAM file, the bismark_methylation_extractor, and methylKit, but I've made progress in understanding why. bismark_methylation_extractor excludes overlapping reads by default. methylKit's read.bismark function excludes bases with a Phred quality score below 20 by default. bismark_methylation_extractor appears to exclude certain reads based on their XG and XR tags, according to criteria that I do not fully understand yet (not counted if sequenced base was a G?).

        Read name = HWI-ST915:46:C6ANNACXX:5:2112:8591:53033_1:N:0:
        ----------------------
        Location = chrY:20,665,340
        Alignment start = 20,665,340 (+)
        Cigar = 53M
        Mapped = yes
        Mapping quality = 15
        Secondary = no
        Supplementary = no
        Duplicate = no
        Failed QC = no
        ----------------------
        Base = G
        Base phred quality = 33
        ----------------------
        Mate is mapped = yes
        Mate start = chrY:20665426 (-)
        Insert size = 186
        First in pair
        Pair orientation = F1R2
        ----------------------
        MD = 5C0C1C0C1C4C7C14C0C0C2C0C2C1C1G0
        XG = CT
        NM = 15
        XM = .....hh.hh.x....h.......x..............hhh..hh..h.x..
        XR = CT
        -------------------
        Alignment start position = chrY:20665340
        GTTTTTTTTTTTTGATTTATATAATTGGAAAAATAATAATTTTTTTTTTTTTT
        Attached Files

        Comment


        • #5
          That alignment is to the top strand, so a G will always just be a G (if the read had an A there then it'd be either a SNP or a sequencing error). Alignments to the top strand give information on C->T transitions. Alignments to the bottom strand give information on G->A transition. No alignment (or pair) will give information on both.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X