SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bismark: paired-end low mapping efficiency dideco Epigenetics 31 02-18-2015 07:01 AM
BISMARK paired-end result sam file contains on read1 info mbk0asis Epigenetics 3 01-08-2015 12:24 AM
Confused about the strand FLAG of Bismark paired-end seq results Jerry_Zhao Bioinformatics 2 09-14-2012 11:45 AM
Bismark paired-end read orientation xinyu Epigenetics 4 03-16-2012 06:57 AM
Bismark paired-end positions mixter Epigenetics 1 01-19-2012 11:45 AM

Reply
 
Thread Tools
Old 07-24-2018, 06:36 AM   #1
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 13
Default strainge mapping efficency for paired end seq using Bismark

although similar issue been reported several times but i tried all the suggestions but problem still exist. Im working with RRBS data obtained from Cow sperm DNA. Standard illumine adapters and low quality sequences were removed from pair reads using –paired option in trim galore. As I can see in fastQC report, read 1 is ok but read 2 is not good as read 1 and for all samples read 2 has a lower quality compared to read 1. In Bismark and using Bowtie 2 alignment, I tried pair end alignment first (PE) with different --Score-min adjustments and surprisingly the mapping efficiency was very low. Following some suggestions in different forums, I tried with –pbat option and furthermore I tried to align the reads separately, as well. But as you can see bellow, there is a huge difference in mapping efficiency between read 1 and 2 and generally pairwise alignment mapping efficiency is very low. In addition, I realized that if in single end alignment I include –pbat, for read 1 and 2 mapping efficiency decreased and increased, respectively. Do you think that my read 2 is causing the problem in pair end alignment? Any suggestion for improving the mapping efficiency greatly appreciated.



--Score-min Default (L, 0, -0.2)
PE: 11.4 % PE with --pbat: 0 % SE, R1: 27.3 % SE, R2: 0 %

--Score-min (L, 0, -0.6)
PE: 19.2 % PE with --pbat: 0 % SE, R1: 29.3 % SE, R2: 0.1 %

--Score-min (L, -0.6, -0.6)
PE: 19.2 % PE with --pbat: 0 % SE, R1: 29.4 % SE, R2: 0.1 %

--Score-min (L, -0.6, -1)
PE: 29.5 %
PE with --pbat: 0.3 %
SE, R1: 30.7 % SE, R1 with --pbat: 16.4 %
SE, R2: 0.7 % SE, R2 with --pbat: 28.7 %
Hedi86 is offline   Reply With Quote
Old 07-25-2018, 07:06 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 611
Default

Hi Hedi,

Could you email some unprocessed reads, e.g. 200000, to me in .gz format? I could then take a look and report back to you with what I find.

Cheers, Felix
fkrueger is offline   Reply With Quote
Old 07-26-2018, 04:07 AM   #3
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 611
Default

Hi Hedi86,

Thanks for providing the data. I have now run a few preliminary tests, here is what I found (in no particular order):

- From the sequence composition plot, the data does not look like (conventional) RRBS

- The data is directional, so the default alignment mode should be fine. If you align R1 on its own it needs default mode, Read2 on its own requires --pbat

- For both reads the first 6bp have a very strong sequence bias that is different to the rest of the reads. For my tests I removed the first 6bp (with Trim Galore --clip_r1 6 --clip_r2 6)

- Even after trimming, both reads start having more C (R1) or G (R2) from ~50bp onwards. I donít know why this would be (slide2)

- Read 1 is heavily affected by read-through Illumina adapter contamination, however R2 does not seem to contain any known adapter! (slide 3)

- The aligned data (UMD3.1) does not look very much like RRBS data, as in it does not seem to accumulate at similar positions, CpG islands or in promoters. It is rather scattered around the entire genome

- From what I tested, around 35% of the data mapped to the Cow genome (UMD3.1)

- ~26% of the data mapped to the Rat genome (Rnor_6.0). So there definitely seems to be contamination with rat data, not sure if there is anything else (we tested for human, mouse, rat, chicken, E coli, Drosphila, C. elegans, Arabidopsis, Yeast, PhiX, wasp, Lambda, MT)

- ~2-5% of the data mapped to PhiX

- The fragment sizes of paired-end aligned are mostly between 100 and 400bp long. This is way longer than would normally expect RRBS, or any BS-Seq data really, to be.

So my conclusions for the moment are:

1) The data doesn't seem to be conventional RRBS data. Maybe it would be worth going back to how the libraries were prepared exactly to see if there is clue how the data needs to be handled?

2) Also, the fact that R1 contains adapter, but R2 doesn't is highly unusual (I don't think I have ever seen anything like it yet). It is not entirely surprising to see that R2 is bringing the overall mapping efficiency of the data down.

3) The data also seems to be contaminated. Cow + Rat + PhiX together explains some 65% of the data, I don't currently know what the rest might be. Maybe you are working with different types of animal in your institute (salmon?) that are worth testing?

I hope this yields a few pointers, all the best, Felix
Attached Files
File Type: pdf Cow RRBS data assessment.pdf (434.7 KB, 7 views)
fkrueger is offline   Reply With Quote
Old 07-26-2018, 04:24 AM   #4
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 13
Default

Dear Felix

thank you so much for your huge help, i really appreciate your time end effort.

i was suspected to contamination too as my very basic Blast experiment of trimmed reads (i removed first 6 bp too) does not revealed any Cow results. However again i appreciate your time and i will keep posting if i found something new.

Best Regards
Hedi86 is offline   Reply With Quote
Old 08-20-2018, 12:27 AM   #5
Hedi86
Member
 
Location: Norway

Join Date: Oct 2017
Posts: 13
Default

Hello again Felix

i just realized that library been prepared using RRBS NuGEN Ovation kit which demands its own method of trimming. this time i got 29% mapping efficiency using default parameters. still reads will map with rat genome, but i think this is because of very short reads after trimming (9 and 11 bp). since this particular trimming will not trim at minimum 20 bp, i did trim the trimmed reads again using trim-galore without using --RRBS and --illumina options. this time for both reads minimum seq length was 18 bp (i dont know why because default is 20 bp) and mapping efficiency is around 32%. do you think that second trim is necessary and if yes should i use --length <INT> option to obtain minimum 20 bp ?

best
Hedi86 is offline   Reply With Quote
Old 09-04-2018, 07:59 AM   #6
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 611
Default

Sorry for the late reply but I was travelling. And no I don't think a second trim is necessary, the 20bp is an arbitrary cutoff anyway because sequences shorter than that will typically not map (uniquely) anyway.
fkrueger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO