SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Non-C methylation in Bismark .cov files bedward1 Epigenetics 5 02-04-2015 04:42 AM
BISMARK paired-end result sam file contains on read1 info mbk0asis Epigenetics 3 01-08-2015 12:24 AM
Determine different methylation patterns after Bismark cast457 Bioinformatics 4 10-11-2014 05:08 AM
Help with Bismark methylation extractor Dipro Bioinformatics 2 07-23-2014 05:18 AM
Question about sam output of Bismark serenaliao Epigenetics 3 02-12-2013 11:37 AM

Reply
 
Thread Tools
Old 07-01-2015, 02:14 PM   #1
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default Discrepancy between Bismark SAM file and bismark methylation extractor

Hi,

I'm flummoxed by a discrepancy between the Bismark SAM file and the results of bismark methylation extractor.

bismark_methylation_extractory counts 11 reads at the location 20,665,340 on chromosome Y.
When I view the SAM file in IGV (after conversion to the BAM format and indexing), I can see that there are 14 reads at that location.
Given that four of the reads are two pairs of overlapping reads, I would expect that the count for bismark methylation extractor would be 12, since bismark_methalytation_extractor only counts the first member of overlapping reads.
According to my understanding of bismark_methylation_extractor, it should therefore count 12 reads.
Why does it count 11 reads?
On what criteria does it exclude another read?
I've put in attachment a screenshot of IGV for the location of interest.
I converted the cov file to the bigWig format, but the results are identical if I go back to the original cov file.

[alexis@lg-1r17-n03 methylation_extractor]$ grep 20665340 AtTneo_Y_sorted_read_name.deduplicated.bismark.cov
Y 20665340 20665340 90.9090909090909 10 1

Here is also the original command.

bismark_methylation_extractor \
--paired-end \
--bedGraph \
--buffer_size 8100M \
--CX_context \
--zero_based \
--merge_non_CpG \
--comprehensive \
--output ../../results/bismark/AtTneo/Y/methylation_extractor \
../../results/bismark/AtTneo/Y/AtTneo_Y_sorted_read_name.deduplicated.bam

Thank you for your help,

Alexis
Attached Images
File Type: png Screen Shot 2015-07-01 at 5.12.31 PM.png (71.8 KB, 8 views)
blancha is offline   Reply With Quote
Old 07-01-2015, 03:56 PM   #2
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I've simplified the problem, to make the troubleshooting easier.
I've run bismark_methylation_extractor a second time, including overlaps.
I get 13 bases aligning at position 20665340, whereas I count 14 when I view the SAM generated with Bismark in IGV.
So, why the discrepancy? Is it the bottom read that appears to be truncated that is not counted? I don't see any mention in the documentation of reads being skipped. Or is it a bug?

I've put in attachment the IGV screenshot, this time not showing paired reads together, to make the counting easier.

[blancha@lg-1r17-n04 methylation_extractor_with_overlap]$ grep 20665340 *bismark.cov
Y 20665340 20665340 92.3076923076923 12 1

bismark_methylation_extractor \
--include_overlap \
--paired-end \
--bedGraph \
--buffer_size 8100M \
--CX_context \
--zero_based \
--merge_non_CpG \
--comprehensive \
--output ../../results/bismark/AtTneo/Y/methylation_extractor_with_overlap \
../../results/bismark/AtTneo/Y/AtTneo_Y_sorted_read_name.deduplicated.bam
Attached Images
File Type: png IGV screenshot.png (84.8 KB, 3 views)
blancha is offline   Reply With Quote
Old 07-02-2015, 12:19 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

We'd need to see the lines from the SAM file to give you a real answer. My guess is that one of the alignments is to the original top strand while the others are to the original bottom.
dpryan is offline   Reply With Quote
Old 07-02-2015, 07:04 AM   #4
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Thank you.
You are absolutely correct.

I have put in attachment the IGV screenshot with the reads colored by the XG tag. 13 reads have the XG tag set to GA, while one read has the tag set to CT.
So, I suppose that one read with the tag set to CT is excluded from the bismark_methylation_extractor count.
I've put at the botttom the details for the one read that appears not to be counted by bismark_methylation_extractor.
I'm still not absolutely clear on why the read is excluded from the methylation counts based on the tag, but I'll think about it a a bit more. I suppose that bismark_methylation_extractor has concluded that the original sequenced base was a G and not a C, and therefore could not be methylated?

It's a bit frustrating to get different results from the Bismark SAM file, the bismark_methylation_extractor, and methylKit, but I've made progress in understanding why. bismark_methylation_extractor excludes overlapping reads by default. methylKit's read.bismark function excludes bases with a Phred quality score below 20 by default. bismark_methylation_extractor appears to exclude certain reads based on their XG and XR tags, according to criteria that I do not fully understand yet (not counted if sequenced base was a G?).

Read name = HWI-ST915:46:C6ANNACXX:5:2112:8591:53033_1:N:0:
----------------------
Location = chrY:20,665,340
Alignment start = 20,665,340 (+)
Cigar = 53M
Mapped = yes
Mapping quality = 15
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = G
Base phred quality = 33
----------------------
Mate is mapped = yes
Mate start = chrY:20665426 (-)
Insert size = 186
First in pair
Pair orientation = F1R2
----------------------
MD = 5C0C1C0C1C4C7C14C0C0C2C0C2C1C1G0
XG = CT
NM = 15
XM = .....hh.hh.x....h.......x..............hhh..hh..h.x..
XR = CT
-------------------
Alignment start position = chrY:20665340
GTTTTTTTTTTTTGATTTATATAATTGGAAAAATAATAATTTTTTTTTTTTTT
Attached Images
File Type: png Screen Shot 2015-07-02 at 10.01.55 AM.png (93.8 KB, 9 views)
blancha is offline   Reply With Quote
Old 07-02-2015, 07:08 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That alignment is to the top strand, so a G will always just be a G (if the read had an A there then it'd be either a SNP or a sequencing error). Alignments to the top strand give information on C->T transitions. Alignments to the bottom strand give information on G->A transition. No alignment (or pair) will give information on both.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO