Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Paired-end RRBS with weird M-bias on read-2

    Dear all,

    I have a paired-end RRBS dataset from mouse and I am a bit puzzled since the M-bias plots show weird peaks especially on read-2. I would like to ask your opinion whether I should be considering a different approach on my RRBS analysis.

    I have also attached a png file of bismark2report file which might help you to understand my problem in details. I should also note that, I have 12 libraries and they all have the same characteristics.

    Questions
    The percentage of non-CpG (CHG and CHH) methylated cytosines I observe is ~5-6%. As far as I understand, this can be interpreted as the bisulfite conversion efficiency if at least 94-95%.
    [Question 1]: Is this a bad efficiency? Would you rather do not proceed with the analysis of a library of this many non-CpG methylation?

    Regarding to Read-1, M-bias plot show a fairly stable distribution of CpG methylation across all different positions except the first 3 bases.
    [Question 2]: However, there are some weird spikes for CHG (14 bp) and CHH (24, 34 bp) methylation. Why do you think these anomalies exist?

    More interestingly, Read-2 has a big spike on 10th bp for CpG methylation and a huge methylation increase in the 3' end while still have different spikes on different positions for CHG and CHH methylation.
    [Question 3]: Why is there a methylation increase on 3' end of the Read-2? Is it due to end-repair reaction?
    [Question 4]: Do you have an explanation of the methylation spike on the 10th bp of Read-2? Shall I trim the reads until I get rid of the spike on the 10th position?
    [Question 5]: More importantly, would you confidently use this RRBS dataset? Is there any steps, diagnostics and considerations that you would recommend?


    You can find detailed information below about the library and the pipeline I followed:
    Library
    Sequencing type: Paired-end RRBS (Reduced Representation Bisulfite Sequencing)
    Sequencer: Illumina Nextseq 500
    Organism: Mouse

    Pipeline
    1. Reads are trimmed using trim_galore with "--rrbs" and "--paired-end" options.
    2. Trimmed reads were mapped to mouse genome by bismark bisulfite mapper using default settings.
    3. Methylation information for individual cytosines were extracted by bismark_methylation_extractor using default settings.

    Thank you so much in advance for your help and time.
    Attached Files

  • #2
    Dear Ali,

    Thanks for your kind words, and for your thoughtful questions. As you will see below, I will probably not be able to give you a satisfactory answer to all questions you raised, but I will try to share my view on some of the issues nevertheless.

    RRBS data has always looked quite ‘funky’ when it comes to M-bias plots. We have so far mostly chosen to simply accept this ‘as is’, especially given that we have more or less not used RRBS ourselves for more than six years…

    To Question 1:
    I don’t think the overall bisulfite conversion efficiency should necessarily be judged based on overall methylation percentage. The report you attached shows that the non-CG methylation is ~2.8% overall (which would mean a conversion efficiency of at least 97.2%), but one can see that the M-bias plots are not at all behaving uniformly. It rather looks like the overall non-CG methylation is well under 1% for most positions (just mouse-over in the plot, they are probably ~0.4-0.6% mostly), but there are some positions that show around 16-20% methylation. Such positions (see also Q2), will have a big impact on the average methylation percentage, and therefore get an unfair say in judging the conversion efficiency. We would argue that conversion efficiency should not discriminate by position (or context), so the lowest methylation average you see anywhere in the reads or in the genome can be used as a proxy for conversion efficiency.

    This means that if you see a non-CG methylation of ~0.4% for most parts of all reads, that number has to be the combination of i) true non-CG methylation, ii) bisulfite conversion failure and iii) mismapping effects. If you now assume that there are hardly that many mismapping effects, and that there is hardly any non-CG methylation in the cell type you are looking at, then it would mean that virtually all of the 0.4% methylation are conversion errors (in reality it is probably a combination of all three effects though). So in the worst case, I would argue that the conversion efficiency must have been 99.6% efficient, or a bit more even. A value that I would find perfectly acceptable.

    To Questions 2 and 3:
    Spikes at individual positions: I would think that such positions come from very repetitive regions in the genome, and are possibly the result of mis-mapping or suffer from conversion failure because of some kind of higher order structure (something demonstrated very convincingly by your colleagues for methylation of the mitochondrium, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5671948/).

    Spikes at the 3’ end of Read 2 are the result of stringent adapter trimming. The Illumina adapter starts with AGATC…, so reads will never end in A, AG, AGA, etc. Read 1 will participate in methylation calling only at C or T positions, however since Read 2 is the reverse complement of R1, methylation calling occurs at G and A positions. Since reads may never end in A, this also means that the very last position of a Read 2 may never be found in an unmethylated state. While it is true that this in theory introduces a bias for that very last position, one should take into consideration that:

    a) it really only ever occurs for at the very last position of R2 which is not always the same for every read (arguably more likely for RRBS thought),

    b) the total number of calls at the very last position is typically quite low (just mouse over for details)

    c) the very last position is of R2 is also subject to overlap removal if the read overlaps with R1 (fairly likely).

    In other words: Yes, this position is biased towards being called methylated, but it will almost certainly not have any impact on your results as a whole whatsoever.

    Regarding the spiked positions again: You should be able to look at the genomic distribution of alignments. I would predict that there will be certain positions in the genome (e.g. close to the edges of chromosomes or centromeres, the MT etc). where there you will find thousands of reads aligned to the very same position (which could harbour the conversion artefacts). Depending on how you move on with downstream analysis, these positions might be completely irrelevant for your further results. While these positions can have quite some influence on the overall numbers and average stats (the ones you find in the Bismark report), but if you would call the average methylation over larger regions you could collapse the methylation values of tens of thousands of reads down to a single methylation percentage. In such an analysis, the artefactually high read coverage would have no higher say than any other region the genome.

    To conclude, I would not hesitate to continue working with the data. And in any case, once you found potential regions of interest you should go back to the original data and convince yourself that you trust the underlying signal at that position.

    I hope this helps a little.
    All the best, Felix

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    8 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    8 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    66 views
    0 likes
    Last Post seqadmin  
    Working...
    X