Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BS-Seq mapping efficiency, what can be expected?

    Hi,

    I'm interested in references to expected and acceptable mapping efficiency (i.e. % of mappable reads) in different BS-Seq scenarios due to my own experiments but this should also be of general interest. I have seen little definite references in BS-seq papers about this as of yet.

    100% efficiency can't be expected since there's always at least a bit of DNA degradation by bisulfite. Furthermore, there are differences between genome-wide and RRBS data, e.g. due to the amount of repeats and ambiguous reads.

    One recent publication by Babraham institute seems to indicate that 80-90% mapping efficiency can be routinely expected in BS-seq base space (Fig 2b of "DNA methylome analysis using short bisulfite sequencing data", http://www.nature.com/nmeth/journal/...meth.1828.html). Did I understand that right, or was this on simulated/ideal data after all?

    But I also found a post by Felix Krueger stating that 68% mapping efficiency is already fair for BS-Seq paired-end (http://seqanswers.com/forums/showthr...?t=8140&page=3).

    About paired-end: as I understand, mapping quality is usually slightly lower in comparison to single end because both mate pairs need to be acceptable. It would also be interesting to elucidate whether there are BS-seq specific differences in mapping efficiency in single- vs. paired-end as well.

  • #2
    Hi Mixter,

    I think it is fair to say that mapping efficiency in BS-Seq is a function of the read length, altough the gain in mapping efficiency gets smaller with increasing read lengths. The figure you are probably referring to (Fig. 2?) was indeed done with simulated data that did not contain any Ns.

    Real world datasets tend to contain quite a number of sequences that can't be mapped, and this is probably a combination of several factors:
    - reads that come from regions in the genome that are not actually present in the genome assembly (e.g. plenty of sequence in the genome builds around centromeres or towards the ends is simply masked by Ns)
    - reads from repetitive regions that can't be mapped uniquely
    - reads with adapter or primer contamination or other artefacts generated during library generation

    Just to give you some ballpark figures, we regularly see around 60-68% mapping efficiency for 40bp long RRBS (SE) reads. I have seen some high quality (quality and adapter trimmed) longer datasets of 75-100bp that were getting close to the 80% mark, and this is already quite high for standard genomic sequence mapping.

    We have seen that paired-end reads tend to increase the mapping efficiency by a few percent (up to 3 or 4% for 40bp RRBS reads), however this increase in mapping efficiency does not necessarily translate into a linear increase in methylation data because paired-end reads may overlap, and such overlaps generate redundant data. I have tried to write up a few more things about this in a brief RRBS guide that is available here. I believe the homepage might currently experience some difficulties but hopefully it'll be back up soon. If you have any specific queries about your dataset don't hesitate to send me an email directly.

    Comment


    • #3
      Many thanks! For now, we are just looking at public data sets. I just wanted to say that we found this an extremely helpful orientation.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 08:47 AM
      0 responses
      11 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X