SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bismark: paired-end low mapping efficiency dideco Epigenetics 31 02-18-2015 06:01 AM
BISMARK paired-end result sam file contains on read1 info mbk0asis Epigenetics 3 01-07-2015 11:24 PM
Confused about the strand FLAG of Bismark paired-end seq results Jerry_Zhao Bioinformatics 2 09-14-2012 10:45 AM
Bismark paired-end read orientation xinyu Epigenetics 4 03-16-2012 05:57 AM
Bismark paired-end positions mixter Epigenetics 1 01-19-2012 10:45 AM

Reply
 
Thread Tools
Old 07-24-2018, 05:36 AM   #1
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 20
Default strainge mapping efficency for paired end seq using Bismark

although similar issue been reported several times but i tried all the suggestions but problem still exist. Im working with RRBS data obtained from Cow sperm DNA. Standard illumine adapters and low quality sequences were removed from pair reads using –paired option in trim galore. As I can see in fastQC report, read 1 is ok but read 2 is not good as read 1 and for all samples read 2 has a lower quality compared to read 1. In Bismark and using Bowtie 2 alignment, I tried pair end alignment first (PE) with different --Score-min adjustments and surprisingly the mapping efficiency was very low. Following some suggestions in different forums, I tried with –pbat option and furthermore I tried to align the reads separately, as well. But as you can see bellow, there is a huge difference in mapping efficiency between read 1 and 2 and generally pairwise alignment mapping efficiency is very low. In addition, I realized that if in single end alignment I include –pbat, for read 1 and 2 mapping efficiency decreased and increased, respectively. Do you think that my read 2 is causing the problem in pair end alignment? Any suggestion for improving the mapping efficiency greatly appreciated.



--Score-min Default (L, 0, -0.2)
PE: 11.4 % PE with --pbat: 0 % SE, R1: 27.3 % SE, R2: 0 %

--Score-min (L, 0, -0.6)
PE: 19.2 % PE with --pbat: 0 % SE, R1: 29.3 % SE, R2: 0.1 %

--Score-min (L, -0.6, -0.6)
PE: 19.2 % PE with --pbat: 0 % SE, R1: 29.4 % SE, R2: 0.1 %

--Score-min (L, -0.6, -1)
PE: 29.5 %
PE with --pbat: 0.3 %
SE, R1: 30.7 % SE, R1 with --pbat: 16.4 %
SE, R2: 0.7 % SE, R2 with --pbat: 28.7 %
Hedi86 is offline   Reply With Quote
Old 07-25-2018, 06:06 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

Hi Hedi,

Could you email some unprocessed reads, e.g. 200000, to me in .gz format? I could then take a look and report back to you with what I find.

Cheers, Felix
fkrueger is offline   Reply With Quote
Old 07-26-2018, 03:07 AM   #3
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

Hi Hedi86,

Thanks for providing the data. I have now run a few preliminary tests, here is what I found (in no particular order):

- From the sequence composition plot, the data does not look like (conventional) RRBS

- The data is directional, so the default alignment mode should be fine. If you align R1 on its own it needs default mode, Read2 on its own requires --pbat

- For both reads the first 6bp have a very strong sequence bias that is different to the rest of the reads. For my tests I removed the first 6bp (with Trim Galore --clip_r1 6 --clip_r2 6)

- Even after trimming, both reads start having more C (R1) or G (R2) from ~50bp onwards. I donít know why this would be (slide2)

- Read 1 is heavily affected by read-through Illumina adapter contamination, however R2 does not seem to contain any known adapter! (slide 3)

- The aligned data (UMD3.1) does not look very much like RRBS data, as in it does not seem to accumulate at similar positions, CpG islands or in promoters. It is rather scattered around the entire genome

- From what I tested, around 35% of the data mapped to the Cow genome (UMD3.1)

- ~26% of the data mapped to the Rat genome (Rnor_6.0). So there definitely seems to be contamination with rat data, not sure if there is anything else (we tested for human, mouse, rat, chicken, E coli, Drosphila, C. elegans, Arabidopsis, Yeast, PhiX, wasp, Lambda, MT)

- ~2-5% of the data mapped to PhiX

- The fragment sizes of paired-end aligned are mostly between 100 and 400bp long. This is way longer than would normally expect RRBS, or any BS-Seq data really, to be.

So my conclusions for the moment are:

1) The data doesn't seem to be conventional RRBS data. Maybe it would be worth going back to how the libraries were prepared exactly to see if there is clue how the data needs to be handled?

2) Also, the fact that R1 contains adapter, but R2 doesn't is highly unusual (I don't think I have ever seen anything like it yet). It is not entirely surprising to see that R2 is bringing the overall mapping efficiency of the data down.

3) The data also seems to be contaminated. Cow + Rat + PhiX together explains some 65% of the data, I don't currently know what the rest might be. Maybe you are working with different types of animal in your institute (salmon?) that are worth testing?

I hope this yields a few pointers, all the best, Felix
Attached Files
File Type: pdf Cow RRBS data assessment.pdf (434.7 KB, 8 views)
fkrueger is offline   Reply With Quote
Old 07-26-2018, 03:24 AM   #4
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 20
Default

Dear Felix

thank you so much for your huge help, i really appreciate your time end effort.

i was suspected to contamination too as my very basic Blast experiment of trimmed reads (i removed first 6 bp too) does not revealed any Cow results. However again i appreciate your time and i will keep posting if i found something new.

Best Regards
Hedi86 is offline   Reply With Quote
Old 08-19-2018, 11:27 PM   #5
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 20
Default

Hello again Felix

i just realized that library been prepared using RRBS NuGEN Ovation kit which demands its own method of trimming. this time i got 29% mapping efficiency using default parameters. still reads will map with rat genome, but i think this is because of very short reads after trimming (9 and 11 bp). since this particular trimming will not trim at minimum 20 bp, i did trim the trimmed reads again using trim-galore without using --RRBS and --illumina options. this time for both reads minimum seq length was 18 bp (i dont know why because default is 20 bp) and mapping efficiency is around 32%. do you think that second trim is necessary and if yes should i use --length <INT> option to obtain minimum 20 bp ?

best
Hedi86 is offline   Reply With Quote
Old 09-04-2018, 06:59 AM   #6
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

Sorry for the late reply but I was travelling. And no I don't think a second trim is necessary, the 20bp is an arbitrary cutoff anyway because sequences shorter than that will typically not map (uniquely) anyway.
fkrueger is offline   Reply With Quote
Old 05-28-2019, 01:22 AM   #7
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 20
Default

Hello again Felix

after long time and sequencing twice, we never managed to identify the primers on Read 2 in our libraries. therefore we continued with another kit and this time we identified primers in both reads. however, the mapping efficiency is still low and i even run Bismark in SE mode and i noticed that read 2 has a very low mapping efficiency and Read 1 alone has the same mapping efficiency when we run in PE mode. here is the summary

PE mode, Score_min (Default), mapping efficiency is 27.7
PE mode, Score_min (L,0,-0.6), mapping efficiency is 31.9

SE mode, Read 1, Score_min (Default), mapping efficiency is 30.3
SE mode, Read 1, Score_min (L,0,-0.6), mapping efficiency is 32

SE mode, Read 2, Score_min (Default), mapping efficiency is 0.1
SE mode, Read 2, Score_min (L,0,-0.6), mapping efficiency is 0.3

so apparently Read 2 has some problems and I already can see that the quality is not as good as Read 1 from fastQC report.

this time we don't have any contamination ( i tested the sequences against human, mice, rat and zebrafish genome) and I don't have any further clue why Read 2 is so unusual. any comment is highly appreciated.

Best
Hedi86 is offline   Reply With Quote
Old 05-28-2019, 02:02 AM   #8
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

If you run paired-end sequencing data on it's own as single-end reads, you need to specify --pbat for Read 2. I am sure this will bring the mapping efficiency up somewhat.

If you wanted you could send some ~200K untrimmed raw sequences to me via mail (gzipped please, fits as an attachment), and I could take a quick look myself?

Cheers, Felix
fkrueger is offline   Reply With Quote
Old 05-28-2019, 03:34 AM   #9
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 20
Default

thanks a lot, I just have sent you the files.

Best
Hedi86 is offline   Reply With Quote
Old 05-28-2019, 04:00 AM   #10
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

Thanks for those. I'll be out this afternoon but will try to get back to you tonight or tomorrow.
fkrueger is offline   Reply With Quote
Old 05-29-2019, 12:43 AM   #11
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

I am afraid the situation doesn't appear to have changed much:

- the data you sent over appears to have been adapter trimmed already, so I can't really say much about what the really raw data looked like

- Using paired-end or single-end mapping (default for R1, --pbat for R2), I got some 30-35% mapping efficiency agains the Bos taurus genome. Relaxing the mapping parameters to L,0,-0.6 only have a moderate increase in mapping efficiency (~10%).

- some 5-7% of the data appears to be PhiX

- some 40% of the data appears to be coming from Rattus norvegicus again

Taken together, this brings us in a region of >80% uniquely mapped data, which is actually a very decent value for any bisulfite library if you considered mapping efficiency alone... I haven't looked at whether the data looks more like RRBS than last time, but judging from the FastQC profiles I am not very hopeful to be honest. Hope this helps.
fkrueger is offline   Reply With Quote
Old 06-02-2019, 04:23 AM   #12
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 20
Default

thank you for your help

yes and sorry i realized i have sent trimmed file. i have sent the raw file now again and if you have time i would appreciate your effort again.

i managed to get the same results with R2 --pbat, but i guess if we consider downstream analyses we have to use pair mode anyway, because we need to have one SAM file for each sample, right ? is there a way to use --pbat for R2 in PE mode ?

if we want to report mapping efficiency, could we report it as R1 + R2 mapping efficiency or we have to report it based on PE mode ?

Best Regards
Hedi86 is offline   Reply With Quote
Old 06-02-2019, 08:12 AM   #13
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

Don't worry, Bismark is taking care of all that for you. Just run the alignments in paired-end mode, and use the BAM file further downstream. There is no obvious reason why you should be running single-end alignments whatsoever.

So yes, just use the paired-end mapping efficiency.

I just had a quick look at the untrimmed files. Apart from a bias at the first 3 bases, and quite some Illumina sequence adapter contamination it all looks pretty normal. After trimming the data is still ~ 40% rat, some PhiX etc. So nothing has changed on that front.
fkrueger is offline   Reply With Quote
Old 06-02-2019, 08:26 AM   #14
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 20
Default

Dear Felix

thank you for your time, we really dont know why we have data alignment with Rat, we never had Rat samples in our lab. the platform used for DNA extraction never been exposed to rat samples and it is fully automated. sequencing center also confirmed that they didn't have any rat samples for quite long time...

thank you again for your help
Hedi86 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:06 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO