Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa-meth and bs-seq comparison

    Hi All, I have written a simple BS-Seq aligner that wraps bwa mem. It is available from here: https://github.com/brentp/bwa-meth

    There is a writeup of it here: http://arxiv.org/abs/1401.1129

    In the paper, I did a comparison with bismark, Last, bsmap and gsnap. The calls that I used are here: https://github.com/brentp/bwa-meth/tree/master/compare in their respective run-*.sh files.

    I will add gnumap-bs and further test bismark with bowtie 2 in the coming weeks.

    As the intent is to find the best-case for each aligner, I welcome comments on the parameters I've chosen for the other aligners--in all cases I report the set of parameters that performed best of the ones I tested. Though, I'm sure there is room for improvement. I'd also appreciate feedback on what bismark parameters work well for bowtie2.

    Finally, of course, I encourage use of and feedback on bwa-meth on paired-end BS-Seq reads, it works on both python2.7+ and python3.

  • #2
    Looks very promising! I'll have to give it a test.

    Regarding gnumaps-bs, I'd be curious to hear how your test with it turns out. I tested gnumaps-bs last month and was very disappointed in both how slow and inaccurate it was. In my hands at least, it took about as long as bismark to run and produced fewer correct alignments (at least if you filter at a decent MAPQ threshold). I was also displeased that it divides methylation scores across multiple bases when a read is a multimapper (I would argue that one should simply ignore such reads, but that's me).

    Comment


    • #3
      Hi Brent,

      The aligner looks interesting indeed. Thanks also for opening this up to the community, I think this is a very helpful approach. I agree it would be a good idea to rerun Bismark using Bowtie 2 since that it also geared towards aligning longer reads. If you follow the treatment suggested below you might possibly also want to rerun the other tools.

      A quick comment as to why I feel the Bismark-Bowtie 1 comparison is not adequate (by the way the parameters in the manuscript and in the shell script do not agree with each other):
      (a) the way you ran it allowed probably only 2 mismatches per entire 100bp read (not very much for 100bp),
      (b) does not allow indel mapping,
      (c) does not allow fully overlapping alignments (quirk of Bowtie) and
      (d) the reads were not adapter trimmed and therefore this is an unrealistic and unfair comparison (this would more show the point that BWA-mem performs local alignments while Bowtie doesn't, but this can be read in the tools' manuals).

      Couple of comments:

      - Adapter trimming: Not sure if I got this right, but you seem to have used Sickle to only trim poor qualities but you didn't trim adapters? In case you didn't then you absolutely should, because comparing tools that do local alignment (bwa-meth) can only be compared to adapter trimmed data with end-to-end alignment.

      - Did you deposit the data somewhere on GEO where one could have a look?

      - comparable number of mismatches/indels: I am not very familiar with the details of BWA and how many mismatches and/or indels it allows per 100bp read. This can be set in Bowtie 2 with the --score_min option, which is set to L,0,-0.2 by default in Bismark. This is quite stringent and allows a penalty of up to -20 for 100bp reads, which is roughly equivalent of 3 mismatches, or 2 mismatches plus one 2bp indel. Selecting --score_min L,0,-0.3 would be up to 5 mismatches, --score_min L,0,-0.4 up to 6 etc.

      - simulated data: When using Sherman you seem to have used --snp 1 as well as -e 0.01, however --snp forces an error rate of 0%. This means that you are looking at reads containing a single mismatch somewhere in the read with constant high qualities, so I do not really understand why there is a difference between Suppl Fig 3 and 4 which says some reads with poor qualities got removed (there shouldn't be any). Also in this scenario Bowtie 1 will fail to align fully overlapping reads, but Bowtie 2 shouldn't mind.

      So if you wanted a fair comparison you need to

      1) trim the data, I suggest just running Trim Galore on the data in default mode, i.e.
      trim_galore --paired file1.fq file2.fq

      2) Run Bismark as (possibly adjusting --score_min to something similar to what all tools are using)
      bismark --bowtie2 -X 800 /path/to/genome/ -1 file1.trimmed.fq -2 file2.trimmed.fq

      I daresay I shall be very surprised if the % on target rate is still so poor... Let me know if you've got any questions. Cheers, Felix

      Comment


      • #4
        Devon, that's interesting to hear about gnumap-bs, from their paper, it looks promising, though I hadn't heard about it until recently.

        Felix, thanks for the very thorough reply. I will:
        1. use sherman without --snp (I've updated the script on github but not re-done the analysis)
        2. try trim_galore in place of sickle
        3. run with bowtie2 and adjust the --score-min as you suggest.

        I suspect you are right about overlapping reads and I didn't consider that previously.

        I have updated the page to include the link to the real data: https://github.com/brentp/bwa-meth/tree/master/compare

        Comment


        • #5
          Hmm, I'll have to give the real data a try. I tried bwa-meth on a test dataset generated by Sherman and got some rather strange results. Perhaps it's due to the shorter reads (50bp paired-end), but for some reason most (~91%) of the alignments are to chr1, rather than being distributed correctly over the genome, and there's a huge over-abundance of indels. I have similar Sherman-generated datasets with longer reads that I'll try next.

          Regarding gnumaps-bs, yeah, the data in their paper looked really promising. I've never come up with an entirely satisfying explanation for the discrepancy.

          Comment


          • #6
            Hi Devon, since it's using bwa mem, bwa-meth is suited for longer reads. I added a warning in case the user tries to send shorter reads. Let me know if you have problems with 2x100 reads.

            Comment


            • #7
              I have updated a few of the aligners, including bismark with bowtie2 for the data simulated with the Sherman data with a 1% error rate. I am continuing to check parameter values for other aligners so these will change but bismark has improved quite a bit already.

              This image is here for un-trimmed data:

              https://gist.github.com/brentp/bf7d3.../qual-plot.png

              Trimming moves the last line to the right and doesn't change the others much.

              More info about the comparison is here: https://github.com/brentp/bwa-meth/tree/master/compare

              Comment


              • #8
                FYI, it turns out that the earlier problem I ran into was due to either a bug in bwa or a corrupted index (bwa was itself producing non-sensical alignments, updating my local copy and reindexing solved the problem).

                Since I was curious, I downloaded and trimmed the real dataset that you posted. Not having the bait regions, I just did some comparisons by looking at whether alignments overlapped. I used bison, since I can then use local alignments and more easily tweak the options. Below, "Same mapping" occurs whenever there's an overlap in the alignment produced by the two programs (since differences in soft-clipping might otherwise skew things a bit). The top number at the bottom are the total number and the lower number that sans alignments with MAPQ==0 (there are a lot fewer of those produced by bison, for purely algorithmic reasons). MAPQ scores are those calculated by BWA, except in the bison column (since it's not using bwa).

                So if one tweaks the score-min and computes MAPQ scores then something like the following would be produced by bismark. Alignments are different or unique to bwa-meth in around 1.5% of cases when ignoring alignments with MAPQ==0, which is pretty reasonable given that bowtie2 defaults to end-to-end alignments and bison (like bismark) doesn't allow discordant alignments or those where each read of a pair is mapped alone (mixed alignments in the bowtie nomenclature).


                My local development version of bison allows discordant and mixed alignments, which make things a bit more similar:


                Using local alignment increases the alignment rate with this dataset for bison (and presumably would for bismark as well, were it to allow that) and continues to increase the similarity a bit:


                Since there are scoring differences between bowtie2 and bwa mem, those are likely causing most of the remainder of the differences. Some of the others are due to my development version of bison still not quite dealing with discordant/mixed alignments exactly correctly.

                In any case, bwa-meth looks like a quite nice aligner. BTW, I would be hesitant to include the time/memory resources in the paper, since they're not really apples-to-apples comparisons (e.g., bismark is making methylation calls while it aligns, bwa-meth is doing that separately, though I would expect bwa-meth to still be faster).

                Comment


                • #9
                  Devon, that's a nice analysis and a good way to visualize the differences, thanks for trying it on the real data. Would you mind sharing the bison parameters you recommend?

                  I updated the git repo to contain the regions (bwa-meth/compare/data/mm10.capture-regions.bed.gz).

                  I agree about time and memory constraints. I will add a note of warning to the paper. I think it does say in the paper that we are most interested in accuracy and less so in run-time and memory as long as both of those are reasonable.

                  Comment


                  • #10
                    It looks like "-N 1 --very-sensitive" is good enough ("--very-sensitive" results in a similar seed size being used by bowtie2 as is used by bwa mem, so this is unsurprising). Differences between this and the results you found are due to differences in trimming.



                    I also included just "-N 1" to show how changing the seed size has its benefits. I also included a version adding "--maxins 1000", mostly as a cautionary note since I noticed that your initial tests with bismark used that (the default is almost always fine there). The "-N 1 --very-sensitive" options are probably good general options for real data. I didn't include any singletons or discordant alignments in this since the production version of bison doesn't currently allow them (presumably that'd increase the on target percentage a bit).

                    BTW, targeted BS-seq data is always a bit strange, particularly since it tends to be stranded. You can actually exploit that to get even better results:



                    That's due to aligning against just the original bottom strand. A more correct implementation that I thought of after seeing a similar dataset to yours previously was to simply penalize alignments to the non-target strand (similar to how bwa mem penalizes paired-end reads aligned as singletons). I never implemented that, but it'd be easy to do in bison (bismark as well) and would probably produce superior results for datasets like this.

                    Comment


                    • #11
                      BTW, there is bias in the 5' and 3' end of the mappings (likely in bwa-meth's results as well, I haven't checked). I haven't checked, but I hope your methylation extractor can be told about that.

                      Comment


                      • #12
                        Hi Brent,

                        Thank you for contributing a promising aligner. I have been looking for a bisulfite aligner that provides MAPQ values, and bwa-meth seems to fit the bill. (Thanks also to Devon Ryan for bringing bwa-meth to my attention in a separate thread.)

                        I have a couple of questions:
                        1. I will be receiving a sizable WGBS Illumina dataset with 2 x 75-bp reads, and I notice that bwa-meth issues a warning for read lengths < 80 bp. Do you think I should consider using bwa-meth for these data, or do you think 75-bp reads are too short? According to the BWA man page, BWA-MEM is recommended for reads as short as 70-bp. Since sequence complexity is lower with 3-base alignment, it seems reasonable that the minimum read length might be higher than for non-bisulfite alignment. I am curious to know how you settled on 80 bp as the warning threshold.
                        2. On another dataset (with 2 x 100-bp reads), I have performed mapping with bwa-meth to produce reasonable-looking BAM files. However, I am having problems extracting methylation calls. Here is the traceback:
                          $ bwa-meth-0.09/bwameth.py tabulate --reference hg19_lambda.fa --threads 2 --prefix out --bissnp BisSNP/BisSNP-0.69.jar 2_221_1_NA_005.bam
                          java -Xmx15g -jar BisSNP/BisSNP-0.69.jar \
                          -R hg19_lambda.fa \
                          -I 2_221_1_NA_005.bam \
                          -T BisulfiteGenotyper \
                          --trim_5_end_bp 2 \
                          --trim_3_end_bp 2 \
                          -vfn1 out.meth.vcf -vfn2 out.snp.vcf \
                          -mbq 20 \
                          -mmq 25 \
                          -nt 2
                          out.meth.vcf
                          Traceback (most recent call last):
                          File "/mnt/ngs/analysis/software/bwa-meth/bwa-meth-0.09/bwameth.py", line 549, in <module>
                          main(sys.argv[1:])
                          File "/mnt/ngs/analysis/software/bwa-meth/bwa-meth-0.09/bwameth.py", line 511, in main
                          sys.exit(tabulate_main(args[1:]))
                          File "/mnt/ngs/analysis/software/bwa-meth/bwa-meth-0.09/bwameth.py", line 468, in tabulate_main
                          header="ordered")):
                          TypeError: reader() got an unexpected keyword argument 'skip_while'

                          It looks like the reader() function is confused by the "skip_while=lambda toks: ..." argument. Any ideas?


                        Thanks in advance,
                        Andrew

                        Comment


                        • #13
                          I would have to do some tests on 75 base reads but indeed, bwa mem will work fine on them, I just wanted to be conservative in the recommendation. I assume that it would be fine. For you problem with `skip_while` just do:

                          pip install -U toolshed

                          to get a newer version of toolshed. If that still doesn't work, let me know.

                          Comment


                          • #14
                            by the way, in the bwa-meth repository, there is a script for extracting methylation call from any sorted bam file regardless of it's origin (aligner). It is here:



                            It may need more testing and does some unfriendly stuff like dumping the output to files in the current directory, but does give sane results when compared to, e.g. bismark_methylation_extractor.

                            Comment


                            • #15
                              Originally posted by brentp View Post
                              ... but does give sane results when compared to, e.g. bismark_methylation_extractor.
                              Can you please elaborate on this, Brent?
                              Cheers,
                              Pete

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X