Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bismark: paired-end low mapping efficiency

    Hi all.

    I know this has been posted already quite a few times before, but I've been trying most of the suggestions offered here on the forum and other forums and I'm still stuck with the same problem. So, I'm hoping that you guys still have some other ideas to help me out here :-)

    This is my problem:
    I am getting really low mapping efficiencies (15 to 18%) with my 90 bp paired-end bisulfite sequencing reads using Bismark/Bowtie2.

    I have been quality trimming my sequences to get rid of adapter sequences and low quality base pairs using Trim Galore! at its standard settings, which are quite stringent.
    FastQC didn't report any problems, except for those that can be expected for bisulfite treated data (ie. k-mer content). So no over-represented or duplicated sequences were detected.
    I have been using different Bowtie2 settings, like increasing the sensitivity function, decreasing the mapping penalties through the scoring function, setting the insert size to a range as wide as 0 to 1500 bp... but nothing got me higher mapping efficiencies than 15 to 18%. Ambiguous mapped reads were each time around 8%, which is to be expected considering the fact that the genome I am working with is known to have a quite high number of gene duplications.

    In conclusion, I am really running out of options and ideas and I am hoping that you guys might have some more experience with it so I can get these efficiencies up!

    Thanks a ton!

  • #2
    Other problems you might have with your library could entail some technical problem with the paired-end nature of your reads or contamination with reads from another species.

    To test the technical problem I would suggest running the reads in single-end mode. If this brings up the mapping efficiency drastically it would indicate that something weird is going on with the read pairs. May I ask which kind of library preparation you performed? We found for example that PBAT libraries very rarely align with more than 25-40% in paired-end mode, and this seemed to be mainly caused by chimaeric reads where the two reads of a pair ended up aligning somewhere else in the genome, often even on different chromosomes. We found that as much as 40% of the reads looked something like a bisulfite-Seq Hi-C experiment. As a solution we are first running PE alignments while specifying --unmapped, and then running SE alignments with the left-over reads to get as much information out of the same as possible. In any case, if you find that single-end alignments work much better you can try to narrow the source of errors down by systematically trimming the reads either from the 5' side (e.g. trim 10, 20, 30bp) or the 3' side and see if this drives the efficiency up.

    Also it is possible that you've got some contaminating reads in the sample. Usual suspects would of course be human contamination (sample prep?), or any other organism that is typically used or prepared in your lab. To get a quick idea I would maybe only use a subset of reads (use -u 100000) and probably also in single-end mode to align against different genomes. Not sure which library prep you used but to be safe you could run it in --non_directional mode to quickly find out if that is a problem. Feel free to send me an email with your findings, I have spent quite a while trying to identify problems and might be able to assist you further. Cheers, Felix

    Comment


    • #3
      So that this thread can be closed and other people will be helped with it, I'll give the solution to my problem, on which Felix and I have been working over e-mail. Felix, thanks again for your help, it is very much appreciated!

      Felix has been running some tests on a subset of my data, which he has been running with both Bowtie(1) and Bowtie2.
      To test for potential contamination, alignments were run against the human, mouse and PhiX reference genome. That excluded major contaminants from these organisms in the library.

      Using FastQC we confirmed that the data looked absolutely fine quality wise. This was re-confimed by the fact that trimming with TrimGalore also didn’t remove much.

      In the end, using a more relaxed scoring function (--score_min L,0,-0.6) as the one that is passed on to Bowtie2 by Bismark by default, did the trick and increased my unique mapping efficiency to ~40%. On top of that, another 24% or so of sequences aligned to multiple places in the genome, so they weren't placed anywhere uniquely. Adding these numbers together we almost reached 65% of reads aligning, which is close to a good mapping efficiency.

      However, relaxing the mapping parameters a lot may also increase the chance of getting misalignments, so there is a tradeoff between getting more sequences to align and allowing more mismatches/indels (hence the better performance of Bowtie2 in our case). The relaxation in this case was necessary as the genome assembly of the organism I am using is still a bit rough, so a few more mismatches and possibly also indels should been allowed.

      In conclusion, we got about 65% of the total reads mapping by relaxing the Bismark/Bowtie2 scoring function to (--score_min L,0,-0.6), and by setting the fragment insert size boundaries to (-I 0) and (-X 1000). Also, Felix is a great guy

      Cheers,
      Dieter

      Comment


      • #4
        Dieter, I have noticed that I get many more reads mapping when I map them separately (40% for each individual pair) than when I map in a paired fashion (20%). Basically, I double my number of reads when I don't used paired-end mapping. I am using Bismark in --pbat mode for paired (bowtie1).

        Did you observe the same?

        Cheers,

        Nenad Bartonicek

        Comment


        • #5
          paired-end mapping problem

          Hi Felix,

          I am trying to map some 100-bp paired-end RRBS reads to hg19 using bismark.

          I trimmed the adapter sequences from the fastq file using trim_galore and feed the paired-end reads to bismark using the following options:

          $<path to bismark>/bismark -u 10000 --bowtie2 --non_directional --path_to_bowtie <path to bowtie>/bowtie2-2.2.2/ <path to hg19>/hg19/ -1 rrbs-mspi_S1_L001_R1_001_trimmed.fq -2 rrbs-mspi_S1_L001_R2_001_trimmed.fq

          but get very low mapping efficency like 0.1%. The mapping efficiency of single-end reads are around 60%.

          I tried different versions of bismark, v0.11.1, v0.12.1 and the latest test version downloaded from seqanswers, neither of them can map with a higher efficency.

          I also tried to adjust the --score-min, and saw a sudden increase of mapping efficiency from 3.4% to 86.7% when i switch from --score-min L,0,-1 to --score-min L,0,-1.9. But the mapping is very loose, a lot of mismatching/insertions were there.

          Do you have any clue I can map the paired-end mapping better?

          Thank you very much

          GZ

          Comment


          • #6
            Hi Gandalf,

            This sounds quite odd indeed. A mapping efficiency of 0.1% is normally a sign of using a wrong genome to align against, but the fact that SE reads align with 60% would suggest that it's not. Using a very high percentage of allowed errors will certainly result in quite dodgy results (this would be > 30 mismatches or indels per read...).

            Would you mind sending me say 100K reads via email so I can investigate? Best, Felix

            Comment


            • #7
              We found the problem now and it was the wrong way I was using trim_galore. I trimmed the paired-end reads separately and this will cause the read1 and read2 "out of sync" as said by Felix. Using paired-end trimming solved the problem.

              Felix is a great guy!

              Originally posted by fkrueger View Post
              Hi Gandalf,

              This sounds quite odd indeed. A mapping efficiency of 0.1% is normally a sign of using a wrong genome to align against, but the fact that SE reads align with 60% would suggest that it's not. Using a very high percentage of allowed errors will certainly result in quite dodgy results (this would be > 30 mismatches or indels per read...).

              Would you mind sending me say 100K reads via email so I can investigate? Best, Felix

              Comment


              • #8
                Hi Felix,

                I trimmed my bisulfite sequencing reads and tried mapping using Bismark. I cannot get mapping more than 49% whatever I do.

                The following combinations I tried for cleaning:

                1) java -jar trimmomatic-0.30.jar PE -phred33 s1.fq s2.fq s1_paired.fq s1_unpaired.fq s2_paired.fq s2_unpaired.fq ILLUMINACLIP:adapters.fa:2:30:10 HEADCROP:6 MINLEN:36
                (various lengths of HEADCROP: 6/13/20)

                2) java -jar trimmomatic-0.30.jar PE -phred33 s1.fq s2.fq s1_paired.fq s1_unpaired.fq s2_paired.fq s2_unpaired.fq ILLUMINACLIP:adapters.fa:2:30:10 HEADCROP:6 CROP:70 MINLEN:36
                (various lengths in CROP: 60/70/80)

                3) Also, tried trim_galore to remove adapters and last base from sequences (again a lot of times cropping sequence length)

                Tested all these with Bismark with bowtie1 (with all -n values 0 - 3, -X from 500 - 1700) and found that I get a range of mapping from 46 - 49 % and not more than that.

                Please help me understand what I am doing wrong?

                Thanks very much,
                Seekq

                Comment


                • #9
                  Hi seekq,

                  I am not very surprised that changing the -X and -n parameters don't change the results drastically. The reason for that is that the standard fragment length is typically ~70 to 300bp, and hardly anything will be longer than 500bp. Not quite sure if Trimmomatic also trims poor quality off by default, but Trim Galore only leaves residues in the reads if the basecall quality was higher than 20. For this trimmed data you would allow a maximum of 2 good quality mismatches (value 30+), or rarely up to 3 over the entire read (if the values are 20). This behaviour is governed by the mismatch ceiling parameter -e which defaults to 70. If you really wanted to allow more mismatches you would also have to change the -e limit, to e.g. 150 (5 mismatches) or 300 (10 mismatches), but please note that I would not recommend doing this since it would be quite error prone and also take much longer to align.

                  Here are a couple of other things I would try first;
                  - you haven't mentioned the read length or the genome you are working with, is it a good assembly or still quite rough? Do you expect many SNPs? Do you expect insertions or deletions in the reads? In any case I would try running Bismark in Bowtie 2 mode (--bowtie2), which allows mapping with InDels and also has a function (--score_min) for the number of mismatches and/or InDels that scales with the read length. This is set very stringently by default (L,0,-0.2), but could be relaxed somewhat to see if this helps your mapping efficiency. As a guideline, --score_min L,0,-0.2 would allow ~3 mismatches for a 100bp read, --score_min L,0,-0.6 would allow up to 10 mismatches and/or Indels. If you just want to test these things you can use the -u option to just use a subset of the reads.
                  - You could try mapping read 1 and read 2, or a subset thereof, individually to see if the mapping efficiency is much higher in single-end mode. This might give you a clue what you want to look for, e.g. specific QC issues in one of the reads, hybrid reads, etc.

                  Feel free to run a few tests with Bowtie 2 and send me the results (or FastQC, Bismark reports) via email if you want me to take a look. Best, Felix

                  Comment


                  • #10
                    hi all,

                    This is my problem:
                    I Amplified several Genes' fragment(~500bp) as our sequencing library.I am getting really low mapping efficiencies (<1.5%) in pair-end mode but high mapping efficiencies (>90%) with my 300 bp paired-end bisulfite sequencing reads using Bismark/Bowtie2.

                    I have been quality trimming my sequences to get rid of adapter sequences and low quality base pairs using Trim Galore! at its standard settings, which are quite stringent. Also, FastQC didn't report any problems.

                    Strange things is that I got very high mapping efficiencies (95%) for each end of paired reads, but in pair-end mode, I only got very low mapping efficiencies(<1.5%).

                    I am really running out of options and ideas and I am hoping that you guys might have some more experience with it

                    Comment


                    • #11
                      The most obvious point would be to ask whether you specified --paired when running Trim Galore, since not doing so would cause the paired end reads to become out of sync. Other than that it is difficult to give you a diagnosis without more details. Would you mind sending me ~100K reads from each side (before trimming) via email so I can take a look myself ([email protected])?

                      Comment


                      • #12
                        Hi, i am also getting low mapping efficiency, however i want to check what percentage of total cytosines (from the genome) are covered in the aligned region. Is there any way for that.

                        any help would be appreciated

                        Comment


                        • #13
                          All,

                          I'm also curious about how Bismark handles PBAT PE libraries. I recently performed some PE 100bp reads from the Epignome kit. I also observed very low mapping rates for these reads (bismark v0.10.0). I went messing around on a few different techniques with the following mapping results:



                          Now, I'm still trying to wrap my head around the whole PBAT setup and exactly what is getting sequenced and what should be mapped where. I get confused however given that if I run each of the read pairs separately through a single end bismark alignment I recover many more mapped reads for R2 run with the -pbat flag and also for R1 run through the 'normal' directional method. If I just aim to perform a non-directional SE alignment on both, I see similar mapping rates.

                          However, I can't seem to recover a decent mapping rate for the paired end bismark runs. I assume it has something to do with the reads not playing well for each of their four bowtie attempts.

                          At the moment I'm trying to get around all of this by:

                          directional PE alignment to start and grabbing the unmapped R1 and R2 reads out the end.
                          unmappedR1 is then mapped SE directional
                          unmappedR2 is mapped SE directional under -pbat conditions.

                          I then combine all the results for downstream analysis.

                          it's not perfect, however it seems to get my mapping levels back to what I expect. Is there just something else going on with these reads? I start to think that paired-end for bisulfite is more headache than it is worth sometimes.
                          Last edited by eicht021; 08-14-2014, 10:25 PM.

                          Comment


                          • #14
                            Hi eicht,

                            First of all, I can feel your pain, we have also spent quite a while trying to find the reason or solutions to poor mapping efficiencies for PE PBAT libraries. Generally, the original PBAT protocol only produces alignments to the complementary strands (CTOT and CTOB), however the EpiGnome kit produces standard, i.e. directional, libraries that align to the OT and OB strands only. This means that R1, and also paired-end alignments, should only align to the OT and OB strands; however R2 needs to be aligned to the CTOT and CTOB strands because it is the reverse complement of R1. This is exactly what the option --pbat does, and in that sense the mapping efficiencies you are describing make perfect sense.

                            We haven’t worked with data generated by EpiGnome but used a homebrew PBAT PE protocol, but I suppose the mechanisms will be the same since they seem to be caused by the random priming step in the protocol. Since we saw the drastically reduced mapping efficiencies in PE mode we reasoned that the protocol might generate chimeric reads, i.e. reads starting at some position in the genome and ending at another. To test this we mapped the data in SE mode and paired up pairs where both R1 and R2 produced alignments. We then imported this data into SeqMonk as Hi-C data, and looked at the amount of reads that had its partner in trans (on another chromosome) as a fairly crude measure. As you can see on the screenshot attached, when we quantitated where in the genome the partner reads aligned to of reads that aligned to chromosome 1, we found that they seem to be scattered pretty much all over the genome in no discernible pattern! In this case a bit more than 40% of reads had their partner read in trans on another chromosome (partner reads aligning to chr 1 but far away were not even considered here).

                            Now if you would perform some similar pairing up on your reads you might get a similar answer, and if you add up the numbers of properly aligning PE reads and the number of chimeric reads (which won’t be reported by Bismark since they are no valid PE alignments), you should be very close to the mapping efficiencies you achieve in SE mapping.
                            it's not perfect, however it seems to get my mapping levels back to what I expect. …
                            We came to a similar conclusion, and we by now also settled on the procedure you described (someone coined the term ‘Dirty Harry mode’ because it doesn’t seem to be the cleanest of methods…):

                            (1) directional PE alignment to start and grabbing the unmapped R1 and R2 reads out the end using the option ‘--unmapped’. Properly aligned PE reads should be methylation extracted using --no_overlap.
                            (2) unmappedR1 is then mapped SE directional
                            (3) unmappedR2 is mapped SE directional under -pbat conditions.
                            Single-end aligned R1 and R2 can then be methylation extracted normally as they should in theory map to different places in the genome anyway so don’t require attention to overlapping reads.

                            As another suggestion:
                            Like all PBAT applications, the first couple of bp (6bp in the case of the EpiGnome kit) produce a noticeable bias in sequence composition as well as later on in the M-bias plots (see an example here http://www.bioinformatics.babraham.a...SE_report.html). EpiCentre recommend ignoring the first 6bp of the methylation call string themselves, but it is probably a good idea to remove these residues even before mapping because we have seen that they may reduce the mapping efficiency somewhat. This could be done with --clip_r1 6 and --clip_r2 6 within Trim Galore.

                            I start to think that paired-end for bisulfite is more headache than it is worth sometimes.
                            At least for paired-end PBAT protocols I would probably second this notion, however even sequencing longer SE reads (e.g. 150bp SE) will likely start reading into the chimeric portion of the read so also have some mappability problems...
                            Attached Files

                            Comment


                            • #15
                              Thanks much for the detailed response. That is very interesting regarding the chimeric reads and I imagine could seriously impact certain types of analysis.

                              I'm glad we agree on the 'Dirty Harry' method! One question regarding this, I would plan to perform methylation extraction with --no_overlap for the PE reads. Is there an easy way to combine methylation extraction outputs between the PE and SE results? I'm sure I could come up with some method, but perhaps there is already an easy way to do this.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X