SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Low mapping efficiency of WGBS data for DNA methylation naveed.jhamat Bioinformatics 3 11-17-2016 07:30 AM
Discrepancy between Bismark SAM file and bismark methylation extractor blancha Bioinformatics 4 07-02-2015 06:08 AM
High non-CG methylation and low mapping efficiency in PBAT library yub Epigenetics 1 04-13-2015 12:02 AM
Bismark: paired-end low mapping efficiency dideco Epigenetics 31 02-18-2015 06:01 AM

Reply
 
Thread Tools
Old 05-23-2017, 08:47 AM   #1
dross11
Member
 
Location: uk

Join Date: Mar 2015
Posts: 14
Default Variable Bismark mapping efficiency of targetted BS data for DNA methylation

I have sequenced numerous multiplexed pools of BS amplicon-seq libraries derived from human samples on a MiSeq over the past few weeks. I have been utilising trim-galore and Bismark for alignment and am finding the mapping efficiency to be highly variable across pools:

pool1 - ~85% - no PhiX spiked in
pool2 - ~70% - no PhiX spiked-in
pool3 - ~55% - 10% PhiX spiked-in
pool4 - ~30% - 10% PhiX spiked-in

How do I go about trouble shooting this? Which factors are likely to affect mapping efficiency?

The MiSeq run metrics were all very ideal for all of these pools, so I don't think anything strange happened on the sequencing side of things. The amplicons were around 135bp before transformation to sequencing libraries and the later two pools (pool3 and pool4) are between 100-135bp.

If any additional info is needed then please do ask.

Edit:
I forgot to look at the fastqc files for the last two pools. Quickly looked at them and most of the per base sequence content for cytosine hovers around 10-20% throughout the entire read. I assume this means that bisulfite conversion was unsuccessful before library construction? This would likely affect the mapping efficiency?

Last edited by dross11; 05-23-2017 at 10:03 AM. Reason: Fastqc
dross11 is offline   Reply With Quote
Old 05-25-2017, 02:15 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

There are a couple of things that could affect the mapping efficiency, I'll list a few here:

- depending on the amplification strategy you might have to use --non_directional for mapping. This might also explain why you are seeing cytosine levels of ~10-20%

- very often the spike-in is not present at the amount you were aiming for. We have seen libraries that were supposed to have a 5% Lambda spike-in that actually contained 90% of Lambda.

- There might be other contaminants in the library you weren't expecting (e.g. human, bacteria etc). What would help in this case would be to run Fastq_Screen on the data (also for the PhiX spike-in)

- did you run single-end or paired-end alignments? If you had PE libraries you might want to run Read 1 in SE mode to see if that helps

If you wanted I could offer you to run a quick screen of your 4 samples. Just send me an email and attach ~100,000 sequences for each sample (gzipped, should well fit as attachment) and I can take a quick look. Cheers, Felix
fkrueger is offline   Reply With Quote
Old 05-25-2017, 04:04 AM   #3
dross11
Member
 
Location: uk

Join Date: Mar 2015
Posts: 14
Default

I really appreciate you helping out here Felix
We bisulfite convert the DNA and then performed PCR with primers designed to either OT or OB strands. I am brand new to field so do excuse my ignorance, but if my OT primer amplifies then I expect the resulting amplicons and libraries to be in either the OT or CTOT configuration? If so, then the --no_directional flag usage makes sense but it is quite bizarre how am getting 85% mapping efficiency with pool1 without the --no_directional flag.

I tried the --no_directional flag with one pair of fastq files and it increased the mapping efficiency from 32% to 83%!

My cytosine levels in the fastqc files are high because the CTOT and CTOB strands would have Cytosines; complementary to Thymine in the respective OT/OB strand?

Someone in our lab is constructing libraries using mouse DNA so Fastq_screen is likely a good idea. I'll try this out soon.

Read 1 in SE mode produced the same mapping efficiency percentages (~30%).
dross11 is offline   Reply With Quote
Old 05-25-2017, 11:52 AM   #4
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Sorry for the slow reply, it seems that I am not allowed to post anymore when I am at work, trying from home now...

i again,

this is where it is getting confusing, I just had to draw this out on a sheet of paper myself... I believe there are basically 2 ways of making bisulfite amplicon libraries:

1) If we assume you are creating an amplicon against the top strand you use a primer (1) that looks like the top strand (bisulfite converted, OT) and second primer (2) that is complementary to the top (bisulfite converted, CTOT). People here design primer (1) so that it starts with the Illumina PE1 portion and then the sequences of interest, and primer (2) starts with the Illumina PE2 portion and then the sequence of interest. After the initial amplification you make the libraries with the Illumina PE primers, and consequently the OT sequence will always carry the PE1 primer and be sequenced first. This will mean that the alignment strand will be always OT for both single-end or paired-end sequencing (The second read of paired-end libraries taken alone would map to the CTOT strand, but this doesn’t happen during PE mapping).
The same is true for OB amplicons which I left out here. If your libraries were constructed like this then you should only get alignments to OT and OB depending on which strand you targeted, and the FastQC plots should show low C content for Read1, and low G content for Read2.

2) You could also design primers to the OT and CTOT strand as above, lets call them (A) and (B). Instead of carrying the Illumina PE portions as well you could simply amplify the genomic loci, then perform A-tailing and subsequently ligate on the sequencing adapters. In this scenario you might end up getting the PE1 primer on either the OT side or the CTOT side, and thus you would get both OT as well as CTOT alignments. (and also OB and CTOB if you also targeted the bottom strand). In these kinds of libraries G and C should be at a similar level in both Read1 and Read2, and the libraries will be non-directional.

Given that you are getting both directional results (pools 1 and 2) and non-directional results (pools 3 and 4), is there any chance that you changed the amplification protocol or primers during the course of the experiment?

If you can bring up the mapping efficiency to over 80% I don’t think that Fastq_Screen will find any major contaminants because most of the data is already well… Let me know if you would like me to take a look at some of the data (or do a quick screen) for you.
fkrueger is offline   Reply With Quote
Old 05-26-2017, 12:42 AM   #5
dross11
Member
 
Location: uk

Join Date: Mar 2015
Posts: 14
Default

So I spent the better part of yesterday reading the Bismark publication and the new version (v0.18) of the Bismark docs to greater my understanding of strand configurations. As I understand it, between my reading and your reply, that:

Directional has to be designed in such a way that the OT and OB strand are tagged so that the primers adhere to these tags thus only amplify these configurations; post-bisulfite adapter tagging (PBAT) sequencing (the EpiGnome library prep workflow is a good example of this?). I suppose the tagging is the PE1/2 portion of the sequence?

Non-directional has no such tagging so all configurations (OT,OB,CTOT,CTOB) are amplified.

We designed the forward and reverse primers to the same strand configuration, either OT or OB (not to CTOT or CTOB at all), and they were not designed to partially anneal to the Illumina PE1/2 sequences. Could the Forward/Reverse primers designed to OT configuration anneal to CTOT? If so, I believe we have designed the amplicons in an non-directional fashion, unfortunately our lab people are also confused with what CTOT and CTOB truly is. All four pools contain different samples and primers but the amplification protocol has remained the same. However, pool3 and pool4 contain primers designed by an student whom I cannot confirm definitely designed the primers in the correct configuration.

Nevertheless, I have sent sample post-trimmed fastq pairs from pool1 and pool4 to your email address specified on the bismark webpage.
dross11 is offline   Reply With Quote
Old 05-26-2017, 03:01 AM   #6
dross11
Member
 
Location: uk

Join Date: Mar 2015
Posts: 14
Default

Update:

Talking to the lab people I found that the primers used to generated amplicon-libs in pool1 and pool2 were produced using BisulfitePrimerSeeker and these primers seemed to have been designed specifically to OT and OB strand configuration. Primers used to generate amplicon-libs in pool3 and pool4 were produced using PrimerSuite and were designed to OT and CTOB strand configuration by accident; they thought they were designing to OB and not CTOB. After aligning pool4 fastqs to bisamrk using the --non_directional flag, I used bismark methylation extraction and found the resulting outputted file with OT or CTOB in their name were much larger files than their counterpart files with OB or CTOT in their filename. I think this explains why pool1 & 2 have a high mapping efficiency with --directional and pool3 and 4 have a high mapping efficiency with --non_directional. Does this logic seem sound?
dross11 is offline   Reply With Quote
Old 05-26-2017, 04:12 AM   #7
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Quote:
Originally Posted by dross11 View Post
Update:

Talking to the lab people I found that the primers used to generated amplicon-libs in pool1 and pool2 were produced using BisulfitePrimerSeeker and these primers seemed to have been designed specifically to OT and OB strand configuration. Primers used to generate amplicon-libs in pool3 and pool4 were produced using PrimerSuite and were designed to OT and CTOB strand configuration by accident; they thought they were designing to OB and not CTOB. After aligning pool4 fastqs to bisamrk using the --non_directional flag, I used bismark methylation extraction and found the resulting outputted file with OT or CTOB in their name were much larger files than their counterpart files with OB or CTOT in their filename. I think this explains why pool1 & 2 have a high mapping efficiency with --directional and pool3 and 4 have a high mapping efficiency with --non_directional. Does this logic seem sound?
I just replied this via email:

Hi David,

Thanks for the sequences and the other update on SeqAnswers.

I had a look at your sequencing files, and came to the same conclusions. Both files align to the human genome with >90%, which is a good start. Pool 4 has some 5% or so of PhiX, but there are no contaminations worth noting.

Pool 1 aligns to the OT (40%) and OB (60%) strands, but Pool 2 aligns to the OT (35%), CTOB (60%) and OB (5%). I had a look at the non-deduplicated alingments in SeqMonk and the amplicons look fantastically clean with almost no background whatsoever. Since some of the Pool 1 and Pool 4 reads overlapped perfectly I also concluded that at least some of the regions must have been designed to the same locus but this different primers and/or using a different protocol somehow (students, ey? )

So overall I believe that you should be just fine looking at OT/OB for pools 1 and 2, and OT/CTOB for pools 3 and 4, the information should be the same (at least theoretically).

All the best, Felix
fkrueger is offline   Reply With Quote
Old 05-26-2017, 04:35 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by fkrueger View Post
Sorry for the slow reply, it seems that I am not allowed to post anymore when I am at work, trying from home now...
@Felix: @ECO has turned the DDoS filter back on and forum software is aggressively marking posts for approval after a recent attack. I try to keep an eye out and approve legitimate posts soon as I can.
GenoMax is offline   Reply With Quote
Old 05-26-2017, 04:42 AM   #9
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Excellent, sorry for being so ranty! :P

I tried three different computers (PC/Mac) on site yesterday, tried posting from my phone (on Eduroam but on site as well) but then had to wait until I was at home in the evening where the same post went through immediately (on a MacBook). So I am assuming that our Institute IP rage is probably flagged up as known Spam?

In any case, thanks for your support! Cheers, Felix
fkrueger is offline   Reply With Quote
Old 05-26-2017, 05:12 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by fkrueger View Post
Excellent, sorry for being so ranty! :P

I tried three different computers (PC/Mac) on site yesterday, tried posting from my phone (on Eduroam but on site as well) but then had to wait until I was at home in the evening where the same post went through immediately (on a MacBook). So I am assuming that our Institute IP rage is probably flagged up as known Spam?

In any case, thanks for your support! Cheers, Felix
It is possible that your institutes IP was being flagged temporarily (you seem to be able to post today). Those border filter appliances are necessary evil we have to live with now.
GenoMax is offline   Reply With Quote
Old 05-26-2017, 05:15 AM   #11
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

I couldn't tell why this would be so. In my first message yesterday I included the [ MAIL ] tags, maybe this could be seen as malicious and/or spam and thus flag the institute IP as a potential threat for a day? Just guessing here...

Sorry for taking this thread off-topic (but I think it should have been solved anyhow).
fkrueger is offline   Reply With Quote
Reply

Tags
bismark, illumina, mapping, mapping efficiency

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO