SEQanswers

Go Back   SEQanswers > Applications Forums > Epigenetics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help in mapping Single end and paired-end reads Lizex RNA Sequencing 0 02-17-2013 11:48 PM
Confused about the strand FLAG of Bismark paired-end seq results Jerry_Zhao Bioinformatics 2 09-14-2012 11:45 AM
Can paired-end mapping produce more reads than single-end ? warrenemmett Bioinformatics 13 03-21-2012 12:10 AM
Bismark paired-end read orientation xinyu Epigenetics 4 03-16-2012 06:57 AM
Bismark paired-end positions mixter Epigenetics 1 01-19-2012 11:45 AM

Reply
 
Thread Tools
Old 02-01-2014, 05:09 AM   #1
dideco
Junior Member
 
Location: Belgium

Join Date: Jan 2014
Posts: 4
Default Bismark: paired-end low mapping efficiency

Hi all.

I know this has been posted already quite a few times before, but I've been trying most of the suggestions offered here on the forum and other forums and I'm still stuck with the same problem. So, I'm hoping that you guys still have some other ideas to help me out here :-)

This is my problem:
I am getting really low mapping efficiencies (15 to 18%) with my 90 bp paired-end bisulfite sequencing reads using Bismark/Bowtie2.

I have been quality trimming my sequences to get rid of adapter sequences and low quality base pairs using Trim Galore! at its standard settings, which are quite stringent.
FastQC didn't report any problems, except for those that can be expected for bisulfite treated data (ie. k-mer content). So no over-represented or duplicated sequences were detected.
I have been using different Bowtie2 settings, like increasing the sensitivity function, decreasing the mapping penalties through the scoring function, setting the insert size to a range as wide as 0 to 1500 bp... but nothing got me higher mapping efficiencies than 15 to 18%. Ambiguous mapped reads were each time around 8%, which is to be expected considering the fact that the genome I am working with is known to have a quite high number of gene duplications.

In conclusion, I am really running out of options and ideas and I am hoping that you guys might have some more experience with it so I can get these efficiencies up!

Thanks a ton!
dideco is offline   Reply With Quote
Old 02-01-2014, 06:54 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Other problems you might have with your library could entail some technical problem with the paired-end nature of your reads or contamination with reads from another species.

To test the technical problem I would suggest running the reads in single-end mode. If this brings up the mapping efficiency drastically it would indicate that something weird is going on with the read pairs. May I ask which kind of library preparation you performed? We found for example that PBAT libraries very rarely align with more than 25-40% in paired-end mode, and this seemed to be mainly caused by chimaeric reads where the two reads of a pair ended up aligning somewhere else in the genome, often even on different chromosomes. We found that as much as 40% of the reads looked something like a bisulfite-Seq Hi-C experiment. As a solution we are first running PE alignments while specifying --unmapped, and then running SE alignments with the left-over reads to get as much information out of the same as possible. In any case, if you find that single-end alignments work much better you can try to narrow the source of errors down by systematically trimming the reads either from the 5' side (e.g. trim 10, 20, 30bp) or the 3' side and see if this drives the efficiency up.

Also it is possible that you've got some contaminating reads in the sample. Usual suspects would of course be human contamination (sample prep?), or any other organism that is typically used or prepared in your lab. To get a quick idea I would maybe only use a subset of reads (use -u 100000) and probably also in single-end mode to align against different genomes. Not sure which library prep you used but to be safe you could run it in --non_directional mode to quickly find out if that is a problem. Feel free to send me an email with your findings, I have spent quite a while trying to identify problems and might be able to assist you further. Cheers, Felix
fkrueger is offline   Reply With Quote
Old 02-06-2014, 01:16 AM   #3
dideco
Junior Member
 
Location: Belgium

Join Date: Jan 2014
Posts: 4
Default

So that this thread can be closed and other people will be helped with it, I'll give the solution to my problem, on which Felix and I have been working over e-mail. Felix, thanks again for your help, it is very much appreciated!

Felix has been running some tests on a subset of my data, which he has been running with both Bowtie(1) and Bowtie2.
To test for potential contamination, alignments were run against the human, mouse and PhiX reference genome. That excluded major contaminants from these organisms in the library.

Using FastQC we confirmed that the data looked absolutely fine quality wise. This was re-confimed by the fact that trimming with TrimGalore also didn’t remove much.

In the end, using a more relaxed scoring function (--score_min L,0,-0.6) as the one that is passed on to Bowtie2 by Bismark by default, did the trick and increased my unique mapping efficiency to ~40%. On top of that, another 24% or so of sequences aligned to multiple places in the genome, so they weren't placed anywhere uniquely. Adding these numbers together we almost reached 65% of reads aligning, which is close to a good mapping efficiency.

However, relaxing the mapping parameters a lot may also increase the chance of getting misalignments, so there is a tradeoff between getting more sequences to align and allowing more mismatches/indels (hence the better performance of Bowtie2 in our case). The relaxation in this case was necessary as the genome assembly of the organism I am using is still a bit rough, so a few more mismatches and possibly also indels should been allowed.

In conclusion, we got about 65% of the total reads mapping by relaxing the Bismark/Bowtie2 scoring function to (--score_min L,0,-0.6), and by setting the fragment insert size boundaries to (-I 0) and (-X 1000). Also, Felix is a great guy

Cheers,
Dieter
dideco is offline   Reply With Quote
Old 03-25-2014, 02:41 AM   #4
nbartonicek
Junior Member
 
Location: Cambridge, UK

Join Date: Jul 2011
Posts: 3
Default

Dieter, I have noticed that I get many more reads mapping when I map them separately (40% for each individual pair) than when I map in a paired fashion (20%). Basically, I double my number of reads when I don't used paired-end mapping. I am using Bismark in --pbat mode for paired (bowtie1).

Did you observe the same?

Cheers,

Nenad Bartonicek
nbartonicek is offline   Reply With Quote
Old 05-13-2014, 02:18 PM   #5
gandalf886
Member
 
Location: Ipswich, MA

Join Date: Feb 2013
Posts: 11
Default paired-end mapping problem

Hi Felix,

I am trying to map some 100-bp paired-end RRBS reads to hg19 using bismark.

I trimmed the adapter sequences from the fastq file using trim_galore and feed the paired-end reads to bismark using the following options:

$<path to bismark>/bismark -u 10000 --bowtie2 --non_directional --path_to_bowtie <path to bowtie>/bowtie2-2.2.2/ <path to hg19>/hg19/ -1 rrbs-mspi_S1_L001_R1_001_trimmed.fq -2 rrbs-mspi_S1_L001_R2_001_trimmed.fq

but get very low mapping efficency like 0.1%. The mapping efficiency of single-end reads are around 60%.

I tried different versions of bismark, v0.11.1, v0.12.1 and the latest test version downloaded from seqanswers, neither of them can map with a higher efficency.

I also tried to adjust the --score-min, and saw a sudden increase of mapping efficiency from 3.4% to 86.7% when i switch from --score-min L,0,-1 to --score-min L,0,-1.9. But the mapping is very loose, a lot of mismatching/insertions were there.

Do you have any clue I can map the paired-end mapping better?

Thank you very much

GZ
gandalf886 is offline   Reply With Quote
Old 05-14-2014, 04:11 AM   #6
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Hi Gandalf,

This sounds quite odd indeed. A mapping efficiency of 0.1% is normally a sign of using a wrong genome to align against, but the fact that SE reads align with 60% would suggest that it's not. Using a very high percentage of allowed errors will certainly result in quite dodgy results (this would be > 30 mismatches or indels per read...).

Would you mind sending me say 100K reads via email so I can investigate? Best, Felix
fkrueger is offline   Reply With Quote
Old 05-19-2014, 12:32 PM   #7
gandalf886
Member
 
Location: Ipswich, MA

Join Date: Feb 2013
Posts: 11
Default

We found the problem now and it was the wrong way I was using trim_galore. I trimmed the paired-end reads separately and this will cause the read1 and read2 "out of sync" as said by Felix. Using paired-end trimming solved the problem.

Felix is a great guy!

Quote:
Originally Posted by fkrueger View Post
Hi Gandalf,

This sounds quite odd indeed. A mapping efficiency of 0.1% is normally a sign of using a wrong genome to align against, but the fact that SE reads align with 60% would suggest that it's not. Using a very high percentage of allowed errors will certainly result in quite dodgy results (this would be > 30 mismatches or indels per read...).

Would you mind sending me say 100K reads via email so I can investigate? Best, Felix
gandalf886 is offline   Reply With Quote
Old 06-11-2014, 10:05 AM   #8
seekq
Junior Member
 
Location: Warwick

Join Date: Sep 2013
Posts: 1
Default

Hi Felix,

I trimmed my bisulfite sequencing reads and tried mapping using Bismark. I cannot get mapping more than 49% whatever I do.

The following combinations I tried for cleaning:

1) java -jar trimmomatic-0.30.jar PE -phred33 s1.fq s2.fq s1_paired.fq s1_unpaired.fq s2_paired.fq s2_unpaired.fq ILLUMINACLIP:adapters.fa:2:30:10 HEADCROP:6 MINLEN:36
(various lengths of HEADCROP: 6/13/20)

2) java -jar trimmomatic-0.30.jar PE -phred33 s1.fq s2.fq s1_paired.fq s1_unpaired.fq s2_paired.fq s2_unpaired.fq ILLUMINACLIP:adapters.fa:2:30:10 HEADCROP:6 CROP:70 MINLEN:36
(various lengths in CROP: 60/70/80)

3) Also, tried trim_galore to remove adapters and last base from sequences (again a lot of times cropping sequence length)

Tested all these with Bismark with bowtie1 (with all -n values 0 - 3, -X from 500 - 1700) and found that I get a range of mapping from 46 - 49 % and not more than that.

Please help me understand what I am doing wrong?

Thanks very much,
Seekq
seekq is offline   Reply With Quote
Old 06-11-2014, 12:42 PM   #9
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Hi seekq,

I am not very surprised that changing the -X and -n parameters don't change the results drastically. The reason for that is that the standard fragment length is typically ~70 to 300bp, and hardly anything will be longer than 500bp. Not quite sure if Trimmomatic also trims poor quality off by default, but Trim Galore only leaves residues in the reads if the basecall quality was higher than 20. For this trimmed data you would allow a maximum of 2 good quality mismatches (value 30+), or rarely up to 3 over the entire read (if the values are 20). This behaviour is governed by the mismatch ceiling parameter -e which defaults to 70. If you really wanted to allow more mismatches you would also have to change the -e limit, to e.g. 150 (5 mismatches) or 300 (10 mismatches), but please note that I would not recommend doing this since it would be quite error prone and also take much longer to align.

Here are a couple of other things I would try first;
- you haven't mentioned the read length or the genome you are working with, is it a good assembly or still quite rough? Do you expect many SNPs? Do you expect insertions or deletions in the reads? In any case I would try running Bismark in Bowtie 2 mode (--bowtie2), which allows mapping with InDels and also has a function (--score_min) for the number of mismatches and/or InDels that scales with the read length. This is set very stringently by default (L,0,-0.2), but could be relaxed somewhat to see if this helps your mapping efficiency. As a guideline, --score_min L,0,-0.2 would allow ~3 mismatches for a 100bp read, --score_min L,0,-0.6 would allow up to 10 mismatches and/or Indels. If you just want to test these things you can use the -u option to just use a subset of the reads.
- You could try mapping read 1 and read 2, or a subset thereof, individually to see if the mapping efficiency is much higher in single-end mode. This might give you a clue what you want to look for, e.g. specific QC issues in one of the reads, hybrid reads, etc.

Feel free to run a few tests with Bowtie 2 and send me the results (or FastQC, Bismark reports) via email if you want me to take a look. Best, Felix
fkrueger is offline   Reply With Quote
Old 06-13-2014, 08:00 PM   #10
drunk_coder
Junior Member
 
Location: china

Join Date: Dec 2011
Posts: 9
Default

hi all,

This is my problem:
I Amplified several Genes' fragment(~500bp) as our sequencing library.I am getting really low mapping efficiencies (<1.5%) in pair-end mode but high mapping efficiencies (>90%) with my 300 bp paired-end bisulfite sequencing reads using Bismark/Bowtie2.

I have been quality trimming my sequences to get rid of adapter sequences and low quality base pairs using Trim Galore! at its standard settings, which are quite stringent. Also, FastQC didn't report any problems.

Strange things is that I got very high mapping efficiencies (95%) for each end of paired reads, but in pair-end mode, I only got very low mapping efficiencies(<1.5%).

I am really running out of options and ideas and I am hoping that you guys might have some more experience with it
drunk_coder is offline   Reply With Quote
Old 06-14-2014, 12:21 AM   #11
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

The most obvious point would be to ask whether you specified --paired when running Trim Galore, since not doing so would cause the paired end reads to become out of sync. Other than that it is difficult to give you a diagnosis without more details. Would you mind sending me ~100K reads from each side (before trimming) via email so I can take a look myself (felix.krueger@babraham.ac.uk)?
fkrueger is offline   Reply With Quote
Old 06-20-2014, 09:39 AM   #12
BFM
Member
 
Location: USA

Join Date: Jun 2014
Posts: 10
Default

Hi, i am also getting low mapping efficiency, however i want to check what percentage of total cytosines (from the genome) are covered in the aligned region. Is there any way for that.

any help would be appreciated
BFM is offline   Reply With Quote
Old 08-14-2014, 11:20 PM   #13
eicht021
Junior Member
 
Location: Australia

Join Date: Oct 2011
Posts: 5
Default

All,

I'm also curious about how Bismark handles PBAT PE libraries. I recently performed some PE 100bp reads from the Epignome kit. I also observed very low mapping rates for these reads (bismark v0.10.0). I went messing around on a few different techniques with the following mapping results:



Now, I'm still trying to wrap my head around the whole PBAT setup and exactly what is getting sequenced and what should be mapped where. I get confused however given that if I run each of the read pairs separately through a single end bismark alignment I recover many more mapped reads for R2 run with the -pbat flag and also for R1 run through the 'normal' directional method. If I just aim to perform a non-directional SE alignment on both, I see similar mapping rates.

However, I can't seem to recover a decent mapping rate for the paired end bismark runs. I assume it has something to do with the reads not playing well for each of their four bowtie attempts.

At the moment I'm trying to get around all of this by:

directional PE alignment to start and grabbing the unmapped R1 and R2 reads out the end.
unmappedR1 is then mapped SE directional
unmappedR2 is mapped SE directional under -pbat conditions.

I then combine all the results for downstream analysis.

it's not perfect, however it seems to get my mapping levels back to what I expect. Is there just something else going on with these reads? I start to think that paired-end for bisulfite is more headache than it is worth sometimes.

Last edited by eicht021; 08-14-2014 at 11:25 PM.
eicht021 is offline   Reply With Quote
Old 08-15-2014, 02:54 AM   #14
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Hi eicht,

First of all, I can feel your pain, we have also spent quite a while trying to find the reason or solutions to poor mapping efficiencies for PE PBAT libraries. Generally, the original PBAT protocol only produces alignments to the complementary strands (CTOT and CTOB), however the EpiGnome kit produces standard, i.e. directional, libraries that align to the OT and OB strands only. This means that R1, and also paired-end alignments, should only align to the OT and OB strands; however R2 needs to be aligned to the CTOT and CTOB strands because it is the reverse complement of R1. This is exactly what the option --pbat does, and in that sense the mapping efficiencies you are describing make perfect sense.

We haven’t worked with data generated by EpiGnome but used a homebrew PBAT PE protocol, but I suppose the mechanisms will be the same since they seem to be caused by the random priming step in the protocol. Since we saw the drastically reduced mapping efficiencies in PE mode we reasoned that the protocol might generate chimeric reads, i.e. reads starting at some position in the genome and ending at another. To test this we mapped the data in SE mode and paired up pairs where both R1 and R2 produced alignments. We then imported this data into SeqMonk as Hi-C data, and looked at the amount of reads that had its partner in trans (on another chromosome) as a fairly crude measure. As you can see on the screenshot attached, when we quantitated where in the genome the partner reads aligned to of reads that aligned to chromosome 1, we found that they seem to be scattered pretty much all over the genome in no discernible pattern! In this case a bit more than 40% of reads had their partner read in trans on another chromosome (partner reads aligning to chr 1 but far away were not even considered here).

Now if you would perform some similar pairing up on your reads you might get a similar answer, and if you add up the numbers of properly aligning PE reads and the number of chimeric reads (which won’t be reported by Bismark since they are no valid PE alignments), you should be very close to the mapping efficiencies you achieve in SE mapping.
Quote:
it's not perfect, however it seems to get my mapping levels back to what I expect. …
We came to a similar conclusion, and we by now also settled on the procedure you described (someone coined the term ‘Dirty Harry mode’ because it doesn’t seem to be the cleanest of methods…):

(1) directional PE alignment to start and grabbing the unmapped R1 and R2 reads out the end using the option ‘--unmapped’. Properly aligned PE reads should be methylation extracted using --no_overlap.
(2) unmappedR1 is then mapped SE directional
(3) unmappedR2 is mapped SE directional under -pbat conditions.
Single-end aligned R1 and R2 can then be methylation extracted normally as they should in theory map to different places in the genome anyway so don’t require attention to overlapping reads.

As another suggestion:
Like all PBAT applications, the first couple of bp (6bp in the case of the EpiGnome kit) produce a noticeable bias in sequence composition as well as later on in the M-bias plots (see an example here http://www.bioinformatics.babraham.a...SE_report.html). EpiCentre recommend ignoring the first 6bp of the methylation call string themselves, but it is probably a good idea to remove these residues even before mapping because we have seen that they may reduce the mapping efficiency somewhat. This could be done with --clip_r1 6 and --clip_r2 6 within Trim Galore.

Quote:
I start to think that paired-end for bisulfite is more headache than it is worth sometimes.
At least for paired-end PBAT protocols I would probably second this notion, however even sequencing longer SE reads (e.g. 150bp SE) will likely start reading into the chimeric portion of the read so also have some mappability problems...
Attached Images
File Type: jpg trans hits in PBAT-Seq data.jpg (82.7 KB, 28 views)
fkrueger is offline   Reply With Quote
Old 08-15-2014, 08:39 PM   #15
eicht021
Junior Member
 
Location: Australia

Join Date: Oct 2011
Posts: 5
Default

Thanks much for the detailed response. That is very interesting regarding the chimeric reads and I imagine could seriously impact certain types of analysis.

I'm glad we agree on the 'Dirty Harry' method! One question regarding this, I would plan to perform methylation extraction with --no_overlap for the PE reads. Is there an easy way to combine methylation extraction outputs between the PE and SE results? I'm sure I could come up with some method, but perhaps there is already an easy way to do this.
eicht021 is offline   Reply With Quote
Old 08-16-2014, 12:29 AM   #16
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Once you finished the methylation extraction you can simply take the C-context output and simply concatenate it, e.g.

cat CpG*.txt > CpG_context_merged.txt
or
zcat CpG*.txt.gz > CpG_context_merged.txt

and use this merged output as input for bismark2bedGraph. The only thing that is somewhat inconvenient is that you don't get one nice mapping report / html file because the alignment process was split up into 3 processes...
fkrueger is offline   Reply With Quote
Old 08-17-2014, 04:13 PM   #17
eicht021
Junior Member
 
Location: Australia

Join Date: Oct 2011
Posts: 5
Default

Makes sense. I think it's up and running now.

Thank you for the assistance!
eicht021 is offline   Reply With Quote
Old 09-02-2014, 06:30 AM   #18
RRondon
Junior Member
 
Location: Perpignan, France

Join Date: Sep 2014
Posts: 1
Default

Hi all,

I have a little problem with the alignment of a paired end library with Bismark/Bowtie2. We have obtained paired-end bisulfite sequencing fastq files. Files were quality clipped on phredscore 26, and interlaced. When I align with the paired-end option of Bismark I get 0,02% aligned reads. Then I have done the alignment of the single-end files separately, the forward and reverse libraries separately. The forward library alignment efficiency is ~40%, but the reverse library alignment efficiency is ~0.05%. This is reproducible with all our libraries. Do I get something wrong? Has anybody observed the same behaviour?
I would like to know if we need to change alignment parameters for reverse library?

Many thanks for your ideas!
RRondon is offline   Reply With Quote
Old 09-02-2014, 06:34 AM   #19
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Reverse reads, i.e. Read 2 of paired-end libraries, would align to the strands complementary to the original top or bottom strands (CTOT and CTOB), but these alignments are not carried out in the default mode (only OT and OB strands). If you want to align Reads 2 as single-end files you need to specify --pbat to get what you want. Cheers, Felix
fkrueger is offline   Reply With Quote
Old 09-26-2014, 04:16 AM   #20
tigerxu
Member
 
Location: Wuhan, China

Join Date: Oct 2013
Posts: 20
Default

Hi Felix, Can I ask you a question? I plan to start a RRBS sequencing but I don't know if I should choose single-end or paired-end sequencing. Would you like to give me advice which one is better currently? Thank you in advance! Zhuofei
tigerxu is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO