SEQanswers

Go Back   SEQanswers > Applications Forums > Epigenetics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Understanding Read Groups for single end and paired end data imsharmanitin Bioinformatics 1 07-10-2018 04:05 AM
Confusion about single end vs paired end read output on Illumina HiSeq acidcoated Illumina/Solexa 1 03-19-2015 09:56 AM
Weird and unexpected insert size distribution from ChIPseq Paired-end data jajclement Bioinformatics 0 05-13-2014 06:41 AM
weird output with BWA paired end alignment jmt Bioinformatics 2 08-13-2013 07:05 PM
Linker Bias in 454 Paired-End Libraries lzembek Sample Prep / Library Generation 5 06-03-2010 03:32 AM

Reply
 
Thread Tools
Old 09-19-2019, 05:14 AM   #1
altintasali
Junior Member
 
Location: Copenhagen

Join Date: Sep 2019
Posts: 1
Default Paired-end RRBS with weird M-bias on read-2

Dear all,

I have a paired-end RRBS dataset from mouse and I am a bit puzzled since the M-bias plots show weird peaks especially on read-2. I would like to ask your opinion whether I should be considering a different approach on my RRBS analysis.

I have also attached a png file of bismark2report file which might help you to understand my problem in details. I should also note that, I have 12 libraries and they all have the same characteristics.

Questions
The percentage of non-CpG (CHG and CHH) methylated cytosines I observe is ~5-6%. As far as I understand, this can be interpreted as the bisulfite conversion efficiency if at least 94-95%.
[Question 1]: Is this a bad efficiency? Would you rather do not proceed with the analysis of a library of this many non-CpG methylation?

Regarding to Read-1, M-bias plot show a fairly stable distribution of CpG methylation across all different positions except the first 3 bases.
[Question 2]: However, there are some weird spikes for CHG (14 bp) and CHH (24, 34 bp) methylation. Why do you think these anomalies exist?

More interestingly, Read-2 has a big spike on 10th bp for CpG methylation and a huge methylation increase in the 3' end while still have different spikes on different positions for CHG and CHH methylation.
[Question 3]: Why is there a methylation increase on 3' end of the Read-2? Is it due to end-repair reaction?
[Question 4]: Do you have an explanation of the methylation spike on the 10th bp of Read-2? Shall I trim the reads until I get rid of the spike on the 10th position?
[Question 5]: More importantly, would you confidently use this RRBS dataset? Is there any steps, diagnostics and considerations that you would recommend?


You can find detailed information below about the library and the pipeline I followed:
Library
Sequencing type: Paired-end RRBS (Reduced Representation Bisulfite Sequencing)
Sequencer: Illumina Nextseq 500
Organism: Mouse

Pipeline
1. Reads are trimmed using trim_galore with "--rrbs" and "--paired-end" options.
2. Trimmed reads were mapped to mouse genome by bismark bisulfite mapper using default settings.
3. Methylation information for individual cytosines were extracted by bismark_methylation_extractor using default settings.

Thank you so much in advance for your help and time.
Attached Files
File Type: pdf bismark2report_weird.m-bias.pdf (1.06 MB, 5 views)
altintasali is offline   Reply With Quote
Old 09-24-2019, 03:51 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Dear Ali,

Thanks for your kind words, and for your thoughtful questions. As you will see below, I will probably not be able to give you a satisfactory answer to all questions you raised, but I will try to share my view on some of the issues nevertheless.

RRBS data has always looked quite ‘funky’ when it comes to M-bias plots. We have so far mostly chosen to simply accept this ‘as is’, especially given that we have more or less not used RRBS ourselves for more than six years…

To Question 1:
I don’t think the overall bisulfite conversion efficiency should necessarily be judged based on overall methylation percentage. The report you attached shows that the non-CG methylation is ~2.8% overall (which would mean a conversion efficiency of at least 97.2%), but one can see that the M-bias plots are not at all behaving uniformly. It rather looks like the overall non-CG methylation is well under 1% for most positions (just mouse-over in the plot, they are probably ~0.4-0.6% mostly), but there are some positions that show around 16-20% methylation. Such positions (see also Q2), will have a big impact on the average methylation percentage, and therefore get an unfair say in judging the conversion efficiency. We would argue that conversion efficiency should not discriminate by position (or context), so the lowest methylation average you see anywhere in the reads or in the genome can be used as a proxy for conversion efficiency.

This means that if you see a non-CG methylation of ~0.4% for most parts of all reads, that number has to be the combination of i) true non-CG methylation, ii) bisulfite conversion failure and iii) mismapping effects. If you now assume that there are hardly that many mismapping effects, and that there is hardly any non-CG methylation in the cell type you are looking at, then it would mean that virtually all of the 0.4% methylation are conversion errors (in reality it is probably a combination of all three effects though). So in the worst case, I would argue that the conversion efficiency must have been 99.6% efficient, or a bit more even. A value that I would find perfectly acceptable.

To Questions 2 and 3:
Spikes at individual positions: I would think that such positions come from very repetitive regions in the genome, and are possibly the result of mis-mapping or suffer from conversion failure because of some kind of higher order structure (something demonstrated very convincingly by your colleagues for methylation of the mitochondrium, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5671948/).

Spikes at the 3’ end of Read 2 are the result of stringent adapter trimming. The Illumina adapter starts with AGATC…, so reads will never end in A, AG, AGA, etc. Read 1 will participate in methylation calling only at C or T positions, however since Read 2 is the reverse complement of R1, methylation calling occurs at G and A positions. Since reads may never end in A, this also means that the very last position of a Read 2 may never be found in an unmethylated state. While it is true that this in theory introduces a bias for that very last position, one should take into consideration that:

a) it really only ever occurs for at the very last position of R2 which is not always the same for every read (arguably more likely for RRBS thought),

b) the total number of calls at the very last position is typically quite low (just mouse over for details)

c) the very last position is of R2 is also subject to overlap removal if the read overlaps with R1 (fairly likely).

In other words: Yes, this position is biased towards being called methylated, but it will almost certainly not have any impact on your results as a whole whatsoever.

Regarding the spiked positions again: You should be able to look at the genomic distribution of alignments. I would predict that there will be certain positions in the genome (e.g. close to the edges of chromosomes or centromeres, the MT etc). where there you will find thousands of reads aligned to the very same position (which could harbour the conversion artefacts). Depending on how you move on with downstream analysis, these positions might be completely irrelevant for your further results. While these positions can have quite some influence on the overall numbers and average stats (the ones you find in the Bismark report), but if you would call the average methylation over larger regions you could collapse the methylation values of tens of thousands of reads down to a single methylation percentage. In such an analysis, the artefactually high read coverage would have no higher say than any other region the genome.

To conclude, I would not hesitate to continue working with the data. And in any case, once you found potential regions of interest you should go back to the original data and convince yourself that you trust the underlying signal at that position.

I hope this helps a little.
All the best, Felix
fkrueger is offline   Reply With Quote
Reply

Tags
bismark, m-bias, trim_galore

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:40 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO