Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • strainge mapping efficency for paired end seq using Bismark

    although similar issue been reported several times but i tried all the suggestions but problem still exist. Im working with RRBS data obtained from Cow sperm DNA. Standard illumine adapters and low quality sequences were removed from pair reads using –paired option in trim galore. As I can see in fastQC report, read 1 is ok but read 2 is not good as read 1 and for all samples read 2 has a lower quality compared to read 1. In Bismark and using Bowtie 2 alignment, I tried pair end alignment first (PE) with different --Score-min adjustments and surprisingly the mapping efficiency was very low. Following some suggestions in different forums, I tried with –pbat option and furthermore I tried to align the reads separately, as well. But as you can see bellow, there is a huge difference in mapping efficiency between read 1 and 2 and generally pairwise alignment mapping efficiency is very low. In addition, I realized that if in single end alignment I include –pbat, for read 1 and 2 mapping efficiency decreased and increased, respectively. Do you think that my read 2 is causing the problem in pair end alignment? Any suggestion for improving the mapping efficiency greatly appreciated.



    --Score-min Default (L, 0, -0.2)
    PE: 11.4 % PE with --pbat: 0 % SE, R1: 27.3 % SE, R2: 0 %

    --Score-min (L, 0, -0.6)
    PE: 19.2 % PE with --pbat: 0 % SE, R1: 29.3 % SE, R2: 0.1 %

    --Score-min (L, -0.6, -0.6)
    PE: 19.2 % PE with --pbat: 0 % SE, R1: 29.4 % SE, R2: 0.1 %

    --Score-min (L, -0.6, -1)
    PE: 29.5 %
    PE with --pbat: 0.3 %
    SE, R1: 30.7 % SE, R1 with --pbat: 16.4 %
    SE, R2: 0.7 % SE, R2 with --pbat: 28.7 %

  • #2
    Hi Hedi,

    Could you email some unprocessed reads, e.g. 200000, to me in .gz format? I could then take a look and report back to you with what I find.

    Cheers, Felix

    Comment


    • #3
      Hi Hedi86,

      Thanks for providing the data. I have now run a few preliminary tests, here is what I found (in no particular order):

      - From the sequence composition plot, the data does not look like (conventional) RRBS

      - The data is directional, so the default alignment mode should be fine. If you align R1 on its own it needs default mode, Read2 on its own requires --pbat

      - For both reads the first 6bp have a very strong sequence bias that is different to the rest of the reads. For my tests I removed the first 6bp (with Trim Galore --clip_r1 6 --clip_r2 6)

      - Even after trimming, both reads start having more C (R1) or G (R2) from ~50bp onwards. I don’t know why this would be (slide2)

      - Read 1 is heavily affected by read-through Illumina adapter contamination, however R2 does not seem to contain any known adapter! (slide 3)

      - The aligned data (UMD3.1) does not look very much like RRBS data, as in it does not seem to accumulate at similar positions, CpG islands or in promoters. It is rather scattered around the entire genome

      - From what I tested, around 35% of the data mapped to the Cow genome (UMD3.1)

      - ~26% of the data mapped to the Rat genome (Rnor_6.0). So there definitely seems to be contamination with rat data, not sure if there is anything else (we tested for human, mouse, rat, chicken, E coli, Drosphila, C. elegans, Arabidopsis, Yeast, PhiX, wasp, Lambda, MT)

      - ~2-5% of the data mapped to PhiX

      - The fragment sizes of paired-end aligned are mostly between 100 and 400bp long. This is way longer than would normally expect RRBS, or any BS-Seq data really, to be.

      So my conclusions for the moment are:

      1) The data doesn't seem to be conventional RRBS data. Maybe it would be worth going back to how the libraries were prepared exactly to see if there is clue how the data needs to be handled?

      2) Also, the fact that R1 contains adapter, but R2 doesn't is highly unusual (I don't think I have ever seen anything like it yet). It is not entirely surprising to see that R2 is bringing the overall mapping efficiency of the data down.

      3) The data also seems to be contaminated. Cow + Rat + PhiX together explains some 65% of the data, I don't currently know what the rest might be. Maybe you are working with different types of animal in your institute (salmon?) that are worth testing?

      I hope this yields a few pointers, all the best, Felix
      Attached Files

      Comment


      • #4
        Dear Felix

        thank you so much for your huge help, i really appreciate your time end effort.

        i was suspected to contamination too as my very basic Blast experiment of trimmed reads (i removed first 6 bp too) does not revealed any Cow results. However again i appreciate your time and i will keep posting if i found something new.

        Best Regards

        Comment


        • #5
          Hello again Felix

          i just realized that library been prepared using RRBS NuGEN Ovation kit which demands its own method of trimming. this time i got 29% mapping efficiency using default parameters. still reads will map with rat genome, but i think this is because of very short reads after trimming (9 and 11 bp). since this particular trimming will not trim at minimum 20 bp, i did trim the trimmed reads again using trim-galore without using --RRBS and --illumina options. this time for both reads minimum seq length was 18 bp (i dont know why because default is 20 bp) and mapping efficiency is around 32%. do you think that second trim is necessary and if yes should i use --length <INT> option to obtain minimum 20 bp ?

          best

          Comment


          • #6
            Sorry for the late reply but I was travelling. And no I don't think a second trim is necessary, the 20bp is an arbitrary cutoff anyway because sequences shorter than that will typically not map (uniquely) anyway.

            Comment


            • #7
              Hello again Felix

              after long time and sequencing twice, we never managed to identify the primers on Read 2 in our libraries. therefore we continued with another kit and this time we identified primers in both reads. however, the mapping efficiency is still low and i even run Bismark in SE mode and i noticed that read 2 has a very low mapping efficiency and Read 1 alone has the same mapping efficiency when we run in PE mode. here is the summary

              PE mode, Score_min (Default), mapping efficiency is 27.7
              PE mode, Score_min (L,0,-0.6), mapping efficiency is 31.9

              SE mode, Read 1, Score_min (Default), mapping efficiency is 30.3
              SE mode, Read 1, Score_min (L,0,-0.6), mapping efficiency is 32

              SE mode, Read 2, Score_min (Default), mapping efficiency is 0.1
              SE mode, Read 2, Score_min (L,0,-0.6), mapping efficiency is 0.3

              so apparently Read 2 has some problems and I already can see that the quality is not as good as Read 1 from fastQC report.

              this time we don't have any contamination ( i tested the sequences against human, mice, rat and zebrafish genome) and I don't have any further clue why Read 2 is so unusual. any comment is highly appreciated.

              Best

              Comment


              • #8
                If you run paired-end sequencing data on it's own as single-end reads, you need to specify --pbat for Read 2. I am sure this will bring the mapping efficiency up somewhat.

                If you wanted you could send some ~200K untrimmed raw sequences to me via mail (gzipped please, fits as an attachment), and I could take a quick look myself?

                Cheers, Felix

                Comment


                • #9
                  thanks a lot, I just have sent you the files.

                  Best

                  Comment


                  • #10
                    Thanks for those. I'll be out this afternoon but will try to get back to you tonight or tomorrow.

                    Comment


                    • #11
                      I am afraid the situation doesn't appear to have changed much:

                      - the data you sent over appears to have been adapter trimmed already, so I can't really say much about what the really raw data looked like

                      - Using paired-end or single-end mapping (default for R1, --pbat for R2), I got some 30-35% mapping efficiency agains the Bos taurus genome. Relaxing the mapping parameters to L,0,-0.6 only have a moderate increase in mapping efficiency (~10%).

                      - some 5-7% of the data appears to be PhiX

                      - some 40% of the data appears to be coming from Rattus norvegicus again

                      Taken together, this brings us in a region of >80% uniquely mapped data, which is actually a very decent value for any bisulfite library if you considered mapping efficiency alone... I haven't looked at whether the data looks more like RRBS than last time, but judging from the FastQC profiles I am not very hopeful to be honest. Hope this helps.

                      Comment


                      • #12
                        thank you for your help

                        yes and sorry i realized i have sent trimmed file. i have sent the raw file now again and if you have time i would appreciate your effort again.

                        i managed to get the same results with R2 --pbat, but i guess if we consider downstream analyses we have to use pair mode anyway, because we need to have one SAM file for each sample, right ? is there a way to use --pbat for R2 in PE mode ?

                        if we want to report mapping efficiency, could we report it as R1 + R2 mapping efficiency or we have to report it based on PE mode ?

                        Best Regards

                        Comment


                        • #13
                          Don't worry, Bismark is taking care of all that for you. Just run the alignments in paired-end mode, and use the BAM file further downstream. There is no obvious reason why you should be running single-end alignments whatsoever.

                          So yes, just use the paired-end mapping efficiency.

                          I just had a quick look at the untrimmed files. Apart from a bias at the first 3 bases, and quite some Illumina sequence adapter contamination it all looks pretty normal. After trimming the data is still ~ 40% rat, some PhiX etc. So nothing has changed on that front.

                          Comment


                          • #14
                            Dear Felix

                            thank you for your time, we really dont know why we have data alignment with Rat, we never had Rat samples in our lab. the platform used for DNA extraction never been exposed to rat samples and it is fully automated. sequencing center also confirmed that they didn't have any rat samples for quite long time...

                            thank you again for your help

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            9 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            67 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X