SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CisGenome -- an integrated tool for ChIP-seq data analysis hji Bioinformatics 66 12-30-2014 01:55 PM
Bismark Bisulfite Aligner - Now supporting CpG, CHG and CHH context fkrueger Bioinformatics 27 10-11-2013 05:40 AM
Bismark v0.6.beta1: Now supporting gapped Bisulfite-Seq alignments fkrueger Bioinformatics 6 03-19-2012 05:06 AM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 08:43 AM

Reply
 
Thread Tools
Old 06-14-2010, 01:40 AM   #1
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 586
Post Bismark - A Tool for Mapping and Analysis of Bisulfite-Seq Data

We have tested a number of the publically available bisulfite mapping programs, however we found that most of these software packages suffered from being slow, poorly documented or not very flexible regarding alignment parameters; in addition, the tested programs only performed mapping to a reference genome but none of them generated an output that can be readily used by wet lab scientists to look at actual methylation levels in their experiments (apologies but we did not yet look at the new tool MethylCoder which was posted here on SEQanswers only last week).

We have therefore developed a new tool, called Bismark, for the time-efficient analysis of BS-seq data. Bismark is a software package written in Perl that is based on the short read aligner Bowtie. All associated files can be downloaded at:

http://www.bioinformatics.bbsrc.ac.uk/projects/


How does it work?

Bismark supports both single-end and paired-end read mapping in either fastA or fastQ format produced by e.g. the Illumina Genome Analyser IIx, while retaining much of the flexibility of Bowtie (adjust the seed length, number of mismatches, --best...). Sequence reads are first transformed into fully bisulfite-converted forward (C > T) and reverse reads (G > A conversion of the forward strand),
before aligning them to similarly converted versions of the genome (also C>T and G>A). Sequence reads that produce a unique best alignment from the four alignment processes against the bisulfite genome (which are running in parallel), are then compared to the normal genomic sequence and the methylation state of all cytosine positions in the read is determined.


What does the output look like?

In the first instance, Bismark produces a comprehensive output file which can either be imported into a genome viewer (e.g. SeqMonk) or be further mined by downstream scripts. The output includes (amongst others) the following information:

(1) Seq ID
(2) strand
(2) chromosome
(3) start
(4) end
(5) observed bisulfite sequence
(6) equivalent genomic sequence
(7) methylation call string
(8) read conversion
(9) genome conversion

More details about the output format can be found at: http://seqanswers.com/wiki/Custom_Bismark_output_format.


In the current version, Bismark discriminates between cytosines in CpG context or in any other context.

Bismark comes with a supplementary methylation extractor script that operates on Bismark result files and extracts the methylation call for every single C analysed. It comes with a few options, such as ignoring the first x number of positions in the methylation call string, e.g. to remove a restriction enzyme site. The position of every single C will be written out to a new output file, depending on context (CpG or any other), whereby methylated Cs will be labelled as forward reads, non-methylated Cs as reverse reads. These positions can then be imported into a genome viewer (e.g. SeqMonk) and data analysis can commence. This output is regularly analysed by the researchers at our institute themselves!

Because the output files can become very large (an Illumina lane with 35 million reads can potentially contain a LOT of Cs!), one can either specify to get a large, comprehensive output or obtain strand-specific outputs (every read can potentially originate from each of the four strands generated by a bisulfite PCR. Please be aware that it does not make any sense to look at strand-specific methylation if the bisulfite experiment did not conserve strand-specificity in the first place. In these cases, the individual files can and should later on in SeqMonk be merged into a data group again).


Some stats about Bismark

So far we have successfully used Bismark for bisulfite mapping against various genomes (mouse NCBIM37, human NCBI36 and GRCh37, or Yeast SGD1.01) and confirmed its functionality by re-analysing a number of published results. Depending on the type of experiment we typically see alignment rates of 55-65%.

Bismark holds the reference genome in memory, in addition the 4 parallel instances of bowtie need some memory as well. Therefore I estimate the memory usage to be around 10GB (we can run 2 instances of Bismark on a 2 year old quad-core, 24GB RAM Linux server at the same time). Alignment speed depends largely the read length and on the specified bowtie parameters; allowing many mismatches and specifying --best is rather slow, whereas only looking for perfect matches can easily reach 5-10 million sequences per hour.

We have initially designed Bismark to support the kinds of analyses we require, thus if you have some ideas or suggestions which should be implemented please let us know.

If you have any comments about Bismark we would like to hear them. You can either enter them in our bug tracking system at: http://www.bioinformatics.bbsrc.ac.uk/bugzilla/ or send me a message directly.


Bismark at ISMB

I will also be presenting Bismark with a poster at ISMB in Boston soon (July 11-13) where I am happy to answer any questions.

Last edited by fkrueger; 05-12-2014 at 12:27 PM. Reason: Changed the title - not that new a tool anymore ...
fkrueger is offline   Reply With Quote
Old 06-17-2010, 07:41 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 586
Default

Thanks to some feedback which I got so far I fixed some of the initial release bugs.

I also changed the output format for both single-end and paired-end alignments slightly so that it is more concise and smaller. More details can be found at http://seqanswers.com/wiki/Custom_Bismark_output_format.


I'd like to mention that in order to get the most reliable output I recommend specifying the "--best" option for bowtie alignments. This will however take more time to complete. (For most applications we used Bismark so far we found the difference of specifying "--best" to be only marginal (less than 1% difference in mapped reads), therefore running it with the default parameters means a 2-3 x faster alignment altogether).


I am still happy for any kind of feedback!
fkrueger is offline   Reply With Quote
Old 06-17-2010, 07:24 PM   #3
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Is there support for SAM/BAM alignment format in Bismark. If so you could just plug and play with other bisulfite aligners.
zee is offline   Reply With Quote
Old 06-18-2010, 02:24 AM   #4
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 586
Default

Bismark is not a mere bisulfite alignment application, but it is also a methylation caller at the same time. To perform the methylation calls correctly it needs to know the conversion state of both the bisulfite read and the genome, as this will determine whether we have to look for C->T or G->A substitutions and whether we need to look up- or downstream to determine if the C was in CpG context or not.

To our knowledge most methylation mappers do only align reads in a C->T converted form and do not perform all four possible alignments in parallel (CT converted reads to CT converted genome, GA converted reads to CT converted genome, CT converted reads to GA converted genome and GA converted reads to GA converted genome), making this 4-alignment output unique to Bismark. As the SAM/BAM output of other mapping programs can't supply this information I don't see an easy way to mix in the output of other aligners and perform the methylation calls in Bismark.

Conversely, even though it might be possible to modify the code to get the four instances of Bowtie to report SAM output instead (or convert it to SAM/BAM format later on), this would possibly not make any difference to the downstream methylation call as it is not based on the alignment format but on read/genome conversion state. The Bismark output format does now contain only essential information, and as far as we are aware SAM format doesn't offer to store methylation information as such.

We experienced that wet lab scientists want to know the methylation state of Cytosines at base pair-resolution rather than getting read alignments which do not necessarily tell them much about the underlying methylation levels. Thus, the final format which will ultimately be handed out to the researchers themselves just contains information about the chromosome/position of an individual C, and whether it was methylated or not.

In summary , Bismark uses a unique logic for its alignments and methylation calling procedure which won't allow a lot of plug and play with other aligner outputs.
fkrueger is offline   Reply With Quote
Old 06-18-2010, 07:53 AM   #5
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Although I agree that methylation states can't be utilized in SAM/BAM, we (in our insubstantial experience) have found that people wish to visualize both the mappings and methylation status on a system such as the UCSC genome browser. While I'm aware of your program (SeqMonk) and fully agree with your approach in providing visualization through that means, it might be worthwhile to provide the mapping results obtained by Bismark as a SAM/BAM format through a simple conversion script to generate the necessary information (including the CIGAR string showing conversion relative to genome), thus giving an option for people to visualize their results on a system of their choice (which can handle BAM, for example).

On an unrelated note:
How do you present methylation status to the wet-lab biologists? One of the ways we analyze our data is to calculate bisulfite conversion rate by identifying C --> T conversion in a non-CpG context to get an idea of the quality of bisulfite conversion. We also like to give methylation "levels" per CpG based on the genomic position, and I'm wondering if you have a good way of converting your output into these variables, or if it's completely embedded into SeqMonk. Our ultimate goal is to show methylation as a BEDgraph-like track over the full genome.

Thanks for a great program. Looking forward to seeing the results.
olivertam is offline   Reply With Quote
Old 06-18-2010, 08:23 AM   #6
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 586
Default

Hi Oliver,

thanks for your comments. I am aware that many people are using the UCSC genome browser, therefore I am going look into the option of generating a more universal output in SAM/BAM format.

After an analysis is completed, Bismark will give a methylation call summary which will provides a very general overview of the methylation state in either CpG context or non-CpG context which might look like this:

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 170715821
Total methylated C's in non-CpG context: 10243954
Total methylated C's in CpG context: 8576784
Total C to T conversions in non-CpG context: 116666488
Total C to T conversions in CpG context: 35228595

C methylated but not in CpG context: 8.1%
C methylated in CpG context: 19.6%

This can be used to estimate the bisulfite conversion rate.

But as you guessed correctly, SeqMonk provides us with all the data quantitation and calculation (we) can currently think of. This includes filtering for a certain amount of reads per C, calculating methylation levels e.g. as % methylated, trends over certain features (e.g. CpG islands) etc.

I'll probably put up a quick example file of a methylation analysis to illustrate some of these aspects next week. Our goal is that the wet-lab scientists can - after the initial Bismark analysis - toy around with their data in SeqMonk as much as they like, which takes off work from our shoulders and gives them the feeling that they are analysing their data themselves rather than just handing it over to the Bioinformatics department.

Last edited by fkrueger; 06-20-2010 at 11:08 AM.
fkrueger is offline   Reply With Quote
Old 06-20-2010, 11:21 AM   #7
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 586
Default

I have made a few screenshots of different ways to analyse and present methylation data. We basically use SeqMonk for most of our graphical as well as quantitative tasks as it offers a large repertoire of useful tools to handle this kind of data.

All graphical results can be exported into annotated report files which can be used for plotting purposes or used in further analyses. Just have a look at the attached example file. (The data shown was taken from Meissner et al, Nature, 2008, PMID: 18600261; GSE11034)
Attached Files
File Type: ppt presentation of methylation.ppt (151.0 KB, 466 views)

Last edited by fkrueger; 06-21-2010 at 12:56 AM. Reason: added reference to the data used.
fkrueger is offline   Reply With Quote
Old 06-24-2010, 06:38 AM   #8
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi,

I have attached a beta version of a perl script that can convert Bismark single-end mapping to SAM format.

Usage:
bismark_to_SAM.pl -c [chrom sizes file] -i [bismark mapping output] -o [SAM output]

-c [chrom sizes file] - file containing length of chromosomes/sequences used for Bowtie mapping
-i [bismark mapping output] - file containing bismark mapping output
-o [SAM output] - name for output file in SAM format (default: [input].sam)

The chrom sizes file should be in the format of:

<chr> TAB <length> TAB <2bit file>

This file is typically obtained from UCSC genome browser download if you're using a model organism genome. However, as long as you have the file with the chromosome/sequence name in column 1 and length in column 2, the program should work. Please ensure the name of the chromosome in the chrom sizes file is the same as the name that bismark outputs (i.e. The name of the sequences you mapped against).

Please let me know if there's an issue.

Cheers,
Oliver
Attached Files
File Type: pl bismark_to_SAM.pl (6.3 KB, 74 views)

Last edited by olivertam; 06-24-2010 at 09:24 AM. Reason: Updated perl script to generate a SAM file sorted by chromosome and start co-ordinate (allows indexing)
olivertam is offline   Reply With Quote
Old 08-03-2010, 07:27 AM   #9
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 586
Default

Thanks to feedback we got both via email and at ISMB in Boston we worked on some bugfixes and suggestions for Bismark, which are now available for download from http://www.bioinformatics.bbsrc.ac.uk/projects/.

The following features received some attention:


Bismark Genome Preparation

- If the specified genome directory does already contain a bisulfite genome folder, all contents of this directory will be removed before creating and indexing a new bisulfite genome

- The genome indexer will now convert DNA ambiguity code into N's before making the bisulfite genomes (anything other than C, A, T or G will appear as N afterwards)

- The indexer will now also handle fastA files with mutltiple sequence entries in addition to (a list of) fastA files in the specified genome folder


Methylation Extractor

- Fixed a bug whereby the single-end strand-specific output got two of the four possible strands mixed up. Also, the --ignore <int> option did previously offset some of the positions of the methylation calls by the <int> specified. Both features should now work as intended

- For paired-end alignments with rather short fragment length it is theoretically possible to read stretches of overlapping sequence with both read 1 and read 2. In order not to score the methylation calls for overlapping sequence twice, we added an option (--no_overlap) to score overlapping methylation calls only from the first read of a given alignment


Further comments and suggestions are most welcome!
fkrueger is offline   Reply With Quote
Old 10-06-2010, 10:13 AM   #10
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

I'm interested to know if anyone looked further into getting data from bismark into different genome browsers. I have used bismark with paired end reads and I need to visualise the data in GBrowse. Usually I do this using a BAM file. The bismark2sam perl script attached above in this thread was designed for single end reads and I also had a problem when using it if I treat my data as single end because it didn't like the non ACGTN characters in my reference genome.

As well as needing to get the mapping results into GBrowse I also need to supply the methylated sequence of regions found to be differentially methylated. For non BS data I generate a pileup file and call a consensus but that relies on having a SAM file. How are other people handling this is give the lab people the BS converted sequence for subsequent PCR confirmation etc?

I also really like the suggestion made previously to have a bedgraph (or I would prefer BigWig) file for showing % conversion/methylation across chromosomes.
natstreet is offline   Reply With Quote
Old 10-06-2010, 10:34 AM   #11
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi Natstreet,

I have included an updated version of the script that will handle any degenerate nucleotide in the reference. I'm still assuming the input (from Illumina output) is still A, C, T, G or N.
As for making the tool to handle paired end Bismark output, it's yet to be done. I'm afraid that I don't have much experience with paired-ends output, but if you have some Bismark paired-ends output that you don't mind using as a test dataset, I'd be happy to try and make it work.

Please let me know if there are problems with the new script

Cheers,
Oliver
Attached Files
File Type: pl bismark_to_SAM.pl (6.5 KB, 33 views)
olivertam is offline   Reply With Quote
Old 10-06-2010, 11:31 AM   #12
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

To follow up on the BEDGraph/BigWig idea, we have developed a workaround for this. Again, this is based on single-end analysis, so I haven't tested paired-ends

1) Use methylation-extractor on the Bismark output, with '--comprehensive' option ('--merge_non_CpG' is optional)
2) Run the following script (genome_methylation_bismark2bedGraph.pl). This script sorts the methylation extractor output, then parses the results to generate an "overall methylation level" as a BEDGraph file, with one sampled cytosine site per line.
3) Use the bedGraphToBigWig program (available online, I believe) to convert the BEDGraph to BigWig.

Here's the usage for the genome_methylation_bismark2bedGraph.pl

Usage: genome_methylation_count.pl (--cutoff [threshold] ) [Bismark methylation caller output] > [output]

--cutoff [threshold] - The minimum number of times a methylation state was
seen for that nucleotide before its methylation
percentage is reported.
Default is no threshold

The output file is a tab-delimited BedGraph file with the following information:

<Chromosome> <Start Position> <End Position> <Methylation Percentage>

Bismark methylation caller (v0.2.0 or later) should produce three output files
(CpG, CHG and CHH) when using the "--comprehensive" option
(Two files if using the "--merge_non_CpG" option).
To count both CpG and Non-CpG, combine the output files.

Bismark methylation caller (v0.1.5 or earlier) should produce two output files
(CpG and Non-CpG) when using the "--comprehensive" option.
To count both CpG and Non-CpG, combine the two output files.


Let me know if you have any questions, issues or bugs (since this is a workaround)

Cheers,
Oliver
Attached Files
File Type: pl genome_methylation_bismark2bedGraph.pl (4.4 KB, 50 views)
olivertam is offline   Reply With Quote
Old 10-07-2010, 02:21 AM   #13
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

Thanks for the replies, it's very much appreciated (and shows the power of this forum!). I'll test both scripts today. I can re-map the data as single ends anyway because I make no use of the fact that it's paired - this was just the easiest way to increase the sequencing output.

[QUOTE=olivertam;26595I'm afraid that I don't have much experience with paired-ends output, but if you have some Bismark paired-ends output that you don't mind using as a test dataset, I'd be happy to try and make it work.[/QUOTE]

I've put an example bismark paired end output file here. If you have a chance to take a look it would be great.
natstreet is offline   Reply With Quote
Old 10-07-2010, 06:57 AM   #14
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

What organism is this?
If possible, could you provide all the chromosome names that you used for mapping, plus their length?

Thanks heaps.

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 10-07-2010, 09:18 AM   #15
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

Sorry - I should have realised some basic info would help!

The data is form Arabidopsis thaliana. I've just added a file called length.tab to the same ftp location that has then chrom lengths.

I've tried the bismark2sam and genome_methylation2bed scripts and both seem to be working perfectly. I used bedGraph2BigWig and the track looks great in GBrowse.

Do you also have any advice about the best way to extract the consensus BS-treated sequence from the bismark results files? I was thinking to use samtools after making a pileup file but I haven't actually tried it yet.

Again, all your help is much appreciated.
natstreet is offline   Reply With Quote
Old 10-07-2010, 09:26 AM   #16
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Thanks for the info.
I'm afraid I have never really tried to get a consensus BS-converted sequence, so I would be curious to hear your successes with samtools. I think that might be the best option right now, but I'll let you know if I hear something better.

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 10-07-2010, 11:04 AM   #17
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi natstreet,

With regards to pair-end Bismark, the methodology I suggested to generate bigWig is the same as single-end (i.e. methylation-extract, perl script, BEDGraph to bigWig).

I have attached a VERY preliminary version of the bismark_to_SAM_v2.pl. Theoretically, this should convert pair-end bismark to SAM. However, I have yet to test the resulting SAM/BAM file on Gbrowse, so I don't know if it works properly. Feel free to try it, and let me know if it crashes badly.

Sorry if I couldn't be more help.

Cheers,
Oliver
Attached Files
File Type: pl bismark_to_SAM_v2.pl (12.4 KB, 66 views)
olivertam is offline   Reply With Quote
Old 10-07-2010, 12:06 PM   #18
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

I just tried to generate a BAM file from the SAM output I have from bismark2sam for single end mapped reads and there is a problem. I get the error

Code:
Parse error at line 9: sequence and quality are inconsistent
Aborted
My samtools command is

Code:
samtools import arab.fa.fai Bismark_mapping_results_col-0_lib1PE120_1.fq.sam col-0_lib1PE120_1.bam
Do you know the cause of the error message?
natstreet is offline   Reply With Quote
Old 10-07-2010, 01:02 PM   #19
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi natstreet,

I apologize for the error. Can I have a look at the bismark mapping results you used as input and the SAM file that was generated? You don't need to send the entire file, just the first 2000 lines would be sufficient.

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 10-07-2010, 09:06 PM   #20
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

I've put some more files on the ftp site. The updated script you posted for paired end reads created sam files that samtools seems happy with - I just need to check the bam file in GBrowse to be certain everything worked ok (which I will do very soon).

I also found a problem with the genome_methylation_bismark2bedGraph.pl script that is specific to processing the CHH files. All CpG and CHG output files I've tested so far have been fine. All CHH files produce an error for every line of this type

Code:
Methylation state of sequence (FC61KG1AAXX:5:1:1101:10101#0/1) in file (exampleCHH) on line 2000 is inconsistent (meth_state is -, meth_state2 = h)
It makes the output file and this seems fine so I'm not sure if the error should concern me or not. I've also put the first 2000 lines from an example CHH file on the ftp site.
natstreet is offline   Reply With Quote
Reply

Tags
alignment, bisulfite, bisulphite, methylation, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO