SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Normalize coverage in Meth-Seq data crackerjack.tej SOLiD 0 05-23-2013 10:23 PM
RRBS library generation-meth dCTP? shawpa Sample Prep / Library Generation 1 03-22-2012 06:00 AM
Roundup of next-next-gen in Nat Meth sci_guy The Pipeline 1 07-12-2008 01:47 AM

Reply
 
Thread Tools
Old 02-04-2014, 04:36 AM   #1
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Arrow bwa-meth and bs-seq comparison

Hi All, I have written a simple BS-Seq aligner that wraps bwa mem. It is available from here: https://github.com/brentp/bwa-meth

There is a writeup of it here: http://arxiv.org/abs/1401.1129

In the paper, I did a comparison with bismark, Last, bsmap and gsnap. The calls that I used are here: https://github.com/brentp/bwa-meth/tree/master/compare in their respective run-*.sh files.

I will add gnumap-bs and further test bismark with bowtie 2 in the coming weeks.

As the intent is to find the best-case for each aligner, I welcome comments on the parameters I've chosen for the other aligners--in all cases I report the set of parameters that performed best of the ones I tested. Though, I'm sure there is room for improvement. I'd also appreciate feedback on what bismark parameters work well for bowtie2.

Finally, of course, I encourage use of and feedback on bwa-meth on paired-end BS-Seq reads, it works on both python2.7+ and python3.
brentp is offline   Reply With Quote
Old 02-04-2014, 05:17 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Looks very promising! I'll have to give it a test.

Regarding gnumaps-bs, I'd be curious to hear how your test with it turns out. I tested gnumaps-bs last month and was very disappointed in both how slow and inaccurate it was. In my hands at least, it took about as long as bismark to run and produced fewer correct alignments (at least if you filter at a decent MAPQ threshold). I was also displeased that it divides methylation scores across multiple bases when a read is a multimapper (I would argue that one should simply ignore such reads, but that's me).
dpryan is offline   Reply With Quote
Old 02-04-2014, 08:52 AM   #3
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

Hi Brent,

The aligner looks interesting indeed. Thanks also for opening this up to the community, I think this is a very helpful approach. I agree it would be a good idea to rerun Bismark using Bowtie 2 since that it also geared towards aligning longer reads. If you follow the treatment suggested below you might possibly also want to rerun the other tools.

A quick comment as to why I feel the Bismark-Bowtie 1 comparison is not adequate (by the way the parameters in the manuscript and in the shell script do not agree with each other):
(a) the way you ran it allowed probably only 2 mismatches per entire 100bp read (not very much for 100bp),
(b) does not allow indel mapping,
(c) does not allow fully overlapping alignments (quirk of Bowtie) and
(d) the reads were not adapter trimmed and therefore this is an unrealistic and unfair comparison (this would more show the point that BWA-mem performs local alignments while Bowtie doesn't, but this can be read in the tools' manuals).

Couple of comments:

- Adapter trimming: Not sure if I got this right, but you seem to have used Sickle to only trim poor qualities but you didn't trim adapters? In case you didn't then you absolutely should, because comparing tools that do local alignment (bwa-meth) can only be compared to adapter trimmed data with end-to-end alignment.

- Did you deposit the data somewhere on GEO where one could have a look?

- comparable number of mismatches/indels: I am not very familiar with the details of BWA and how many mismatches and/or indels it allows per 100bp read. This can be set in Bowtie 2 with the --score_min option, which is set to L,0,-0.2 by default in Bismark. This is quite stringent and allows a penalty of up to -20 for 100bp reads, which is roughly equivalent of 3 mismatches, or 2 mismatches plus one 2bp indel. Selecting --score_min L,0,-0.3 would be up to 5 mismatches, --score_min L,0,-0.4 up to 6 etc.

- simulated data: When using Sherman you seem to have used --snp 1 as well as -e 0.01, however --snp forces an error rate of 0%. This means that you are looking at reads containing a single mismatch somewhere in the read with constant high qualities, so I do not really understand why there is a difference between Suppl Fig 3 and 4 which says some reads with poor qualities got removed (there shouldn't be any). Also in this scenario Bowtie 1 will fail to align fully overlapping reads, but Bowtie 2 shouldn't mind.

So if you wanted a fair comparison you need to

1) trim the data, I suggest just running Trim Galore on the data in default mode, i.e.
trim_galore --paired file1.fq file2.fq

2) Run Bismark as (possibly adjusting --score_min to something similar to what all tools are using)
bismark --bowtie2 -X 800 /path/to/genome/ -1 file1.trimmed.fq -2 file2.trimmed.fq

I daresay I shall be very surprised if the % on target rate is still so poor... Let me know if you've got any questions. Cheers, Felix
fkrueger is offline   Reply With Quote
Old 02-04-2014, 09:34 AM   #4
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

Devon, that's interesting to hear about gnumap-bs, from their paper, it looks promising, though I hadn't heard about it until recently.

Felix, thanks for the very thorough reply. I will:
1. use sherman without --snp (I've updated the script on github but not re-done the analysis)
2. try trim_galore in place of sickle
3. run with bowtie2 and adjust the --score-min as you suggest.

I suspect you are right about overlapping reads and I didn't consider that previously.

I have updated the page to include the link to the real data: https://github.com/brentp/bwa-meth/tree/master/compare
brentp is offline   Reply With Quote
Old 02-05-2014, 01:27 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Hmm, I'll have to give the real data a try. I tried bwa-meth on a test dataset generated by Sherman and got some rather strange results. Perhaps it's due to the shorter reads (50bp paired-end), but for some reason most (~91%) of the alignments are to chr1, rather than being distributed correctly over the genome, and there's a huge over-abundance of indels. I have similar Sherman-generated datasets with longer reads that I'll try next.

Regarding gnumaps-bs, yeah, the data in their paper looked really promising. I've never come up with an entirely satisfying explanation for the discrepancy.
dpryan is offline   Reply With Quote
Old 02-05-2014, 06:46 AM   #6
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

Hi Devon, since it's using bwa mem, bwa-meth is suited for longer reads. I added a warning in case the user tries to send shorter reads. Let me know if you have problems with 2x100 reads.
brentp is offline   Reply With Quote
Old 02-08-2014, 06:07 AM   #7
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

I have updated a few of the aligners, including bismark with bowtie2 for the data simulated with the Sherman data with a 1% error rate. I am continuing to check parameter values for other aligners so these will change but bismark has improved quite a bit already.

This image is here for un-trimmed data:

https://gist.github.com/brentp/bf7d3.../qual-plot.png

Trimming moves the last line to the right and doesn't change the others much.

More info about the comparison is here: https://github.com/brentp/bwa-meth/tree/master/compare
brentp is offline   Reply With Quote
Old 02-11-2014, 07:18 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

FYI, it turns out that the earlier problem I ran into was due to either a bug in bwa or a corrupted index (bwa was itself producing non-sensical alignments, updating my local copy and reindexing solved the problem).

Since I was curious, I downloaded and trimmed the real dataset that you posted. Not having the bait regions, I just did some comparisons by looking at whether alignments overlapped. I used bison, since I can then use local alignments and more easily tweak the options. Below, "Same mapping" occurs whenever there's an overlap in the alignment produced by the two programs (since differences in soft-clipping might otherwise skew things a bit). The top number at the bottom are the total number and the lower number that sans alignments with MAPQ==0 (there are a lot fewer of those produced by bison, for purely algorithmic reasons). MAPQ scores are those calculated by BWA, except in the bison column (since it's not using bwa).

So if one tweaks the score-min and computes MAPQ scores then something like the following would be produced by bismark. Alignments are different or unique to bwa-meth in around 1.5% of cases when ignoring alignments with MAPQ==0, which is pretty reasonable given that bowtie2 defaults to end-to-end alignments and bison (like bismark) doesn't allow discordant alignments or those where each read of a pair is mapped alone (mixed alignments in the bowtie nomenclature).


My local development version of bison allows discordant and mixed alignments, which make things a bit more similar:


Using local alignment increases the alignment rate with this dataset for bison (and presumably would for bismark as well, were it to allow that) and continues to increase the similarity a bit:


Since there are scoring differences between bowtie2 and bwa mem, those are likely causing most of the remainder of the differences. Some of the others are due to my development version of bison still not quite dealing with discordant/mixed alignments exactly correctly.

In any case, bwa-meth looks like a quite nice aligner. BTW, I would be hesitant to include the time/memory resources in the paper, since they're not really apples-to-apples comparisons (e.g., bismark is making methylation calls while it aligns, bwa-meth is doing that separately, though I would expect bwa-meth to still be faster).
dpryan is offline   Reply With Quote
Old 02-11-2014, 08:58 AM   #9
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

Devon, that's a nice analysis and a good way to visualize the differences, thanks for trying it on the real data. Would you mind sharing the bison parameters you recommend?

I updated the git repo to contain the regions (bwa-meth/compare/data/mm10.capture-regions.bed.gz).

I agree about time and memory constraints. I will add a note of warning to the paper. I think it does say in the paper that we are most interested in accuracy and less so in run-time and memory as long as both of those are reasonable.
brentp is offline   Reply With Quote
Old 02-12-2014, 12:00 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It looks like "-N 1 --very-sensitive" is good enough ("--very-sensitive" results in a similar seed size being used by bowtie2 as is used by bwa mem, so this is unsurprising). Differences between this and the results you found are due to differences in trimming.



I also included just "-N 1" to show how changing the seed size has its benefits. I also included a version adding "--maxins 1000", mostly as a cautionary note since I noticed that your initial tests with bismark used that (the default is almost always fine there). The "-N 1 --very-sensitive" options are probably good general options for real data. I didn't include any singletons or discordant alignments in this since the production version of bison doesn't currently allow them (presumably that'd increase the on target percentage a bit).

BTW, targeted BS-seq data is always a bit strange, particularly since it tends to be stranded. You can actually exploit that to get even better results:



That's due to aligning against just the original bottom strand. A more correct implementation that I thought of after seeing a similar dataset to yours previously was to simply penalize alignments to the non-target strand (similar to how bwa mem penalizes paired-end reads aligned as singletons). I never implemented that, but it'd be easy to do in bison (bismark as well) and would probably produce superior results for datasets like this.
dpryan is offline   Reply With Quote
Old 02-12-2014, 03:17 AM   #11
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

BTW, there is bias in the 5' and 3' end of the mappings (likely in bwa-meth's results as well, I haven't checked). I haven't checked, but I hope your methylation extractor can be told about that.
dpryan is offline   Reply With Quote
Old 04-16-2014, 05:39 PM   #12
adeirossi
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default

Hi Brent,

Thank you for contributing a promising aligner. I have been looking for a bisulfite aligner that provides MAPQ values, and bwa-meth seems to fit the bill. (Thanks also to Devon Ryan for bringing bwa-meth to my attention in a separate thread.)

I have a couple of questions:
  1. I will be receiving a sizable WGBS Illumina dataset with 2 x 75-bp reads, and I notice that bwa-meth issues a warning for read lengths < 80 bp. Do you think I should consider using bwa-meth for these data, or do you think 75-bp reads are too short? According to the BWA man page, BWA-MEM is recommended for reads as short as 70-bp. Since sequence complexity is lower with 3-base alignment, it seems reasonable that the minimum read length might be higher than for non-bisulfite alignment. I am curious to know how you settled on 80 bp as the warning threshold.
  2. On another dataset (with 2 x 100-bp reads), I have performed mapping with bwa-meth to produce reasonable-looking BAM files. However, I am having problems extracting methylation calls. Here is the traceback:
    $ bwa-meth-0.09/bwameth.py tabulate --reference hg19_lambda.fa --threads 2 --prefix out --bissnp BisSNP/BisSNP-0.69.jar 2_221_1_NA_005.bam
    java -Xmx15g -jar BisSNP/BisSNP-0.69.jar \
    -R hg19_lambda.fa \
    -I 2_221_1_NA_005.bam \
    -T BisulfiteGenotyper \
    --trim_5_end_bp 2 \
    --trim_3_end_bp 2 \
    -vfn1 out.meth.vcf -vfn2 out.snp.vcf \
    -mbq 20 \
    -mmq 25 \
    -nt 2
    out.meth.vcf
    Traceback (most recent call last):
    File "/mnt/ngs/analysis/software/bwa-meth/bwa-meth-0.09/bwameth.py", line 549, in <module>
    main(sys.argv[1:])
    File "/mnt/ngs/analysis/software/bwa-meth/bwa-meth-0.09/bwameth.py", line 511, in main
    sys.exit(tabulate_main(args[1:]))
    File "/mnt/ngs/analysis/software/bwa-meth/bwa-meth-0.09/bwameth.py", line 468, in tabulate_main
    header="ordered")):
    TypeError: reader() got an unexpected keyword argument 'skip_while'
    It looks like the reader() function is confused by the "skip_while=lambda toks: ..." argument. Any ideas?

Thanks in advance,
Andrew
adeirossi is offline   Reply With Quote
Old 04-17-2014, 08:03 AM   #13
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

I would have to do some tests on 75 base reads but indeed, bwa mem will work fine on them, I just wanted to be conservative in the recommendation. I assume that it would be fine. For you problem with `skip_while` just do:

pip install -U toolshed

to get a newer version of toolshed. If that still doesn't work, let me know.
brentp is offline   Reply With Quote
Old 04-17-2014, 08:06 AM   #14
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

by the way, in the bwa-meth repository, there is a script for extracting methylation call from any sorted bam file regardless of it's origin (aligner). It is here:

https://github.com/brentp/bwa-meth/b...methylation.py

It may need more testing and does some unfriendly stuff like dumping the output to files in the current directory, but does give sane results when compared to, e.g. bismark_methylation_extractor.
brentp is offline   Reply With Quote
Old 04-17-2014, 03:31 PM   #15
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by brentp View Post
... but does give sane results when compared to, e.g. bismark_methylation_extractor.
Can you please elaborate on this, Brent?
Cheers,
Pete
PeteH is offline   Reply With Quote
Old 04-17-2014, 04:01 PM   #16
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

I mean it gives similar and therefore I assume sane/correct output.
brentp is offline   Reply With Quote
Old 04-17-2014, 08:31 PM   #17
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Thanks, Brent. Sorry, I'd misinterpreted your statement as saying that you thought bismark_methylation_extractor didn't give sane results. All good
PeteH is offline   Reply With Quote
Reply

Tags
bs-seq, bwa, methylation

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO