SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CisGenome -- an integrated tool for ChIP-seq data analysis hji Bioinformatics 66 12-30-2014 02:55 PM
Bismark Bisulfite Aligner - Now supporting CpG, CHG and CHH context fkrueger Bioinformatics 27 10-11-2013 06:40 AM
Bismark v0.6.beta1: Now supporting gapped Bisulfite-Seq alignments fkrueger Bioinformatics 6 03-19-2012 06:06 AM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 09:43 AM

Reply
 
Thread Tools
Old 07-23-2014, 04:49 AM   #381
sunggong
Junior Member
 
Location: Cambridge, UK

Join Date: May 2014
Posts: 6
Default bismark_methylation_extractor

Hi,
I am wondering whether bismark_methylation_extractor is written to use multi threads? Any plan, if not?
Thanks,
Sung
sunggong is offline   Reply With Quote
Old 07-23-2014, 06:33 AM   #382
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by sunggong View Post
Hi,
I am wondering whether bismark_methylation_extractor is written to use multi threads? Any plan, if not?
Thanks,
Sung
As it stands the extractor is only using a single core. There have been plans to extend this to multi-core usage for quite some time however I never quite got round to doing it. We were thinking however to lock ourselves away for a day or two in the not too distant future where we would all have a go at doing something that never gets done due to the day-to-day business, and multi-threading of Bismark is very high up on that list!

For the time being you might want to split up the BAM file into several smaller chunks and launch them individually to gain some speed. Cheers, Felix
fkrueger is offline   Reply With Quote
Old 07-25-2014, 03:10 AM   #383
sunggong
Junior Member
 
Location: Cambridge, UK

Join Date: May 2014
Posts: 6
Default

Quote:
Originally Posted by fkrueger View Post
As it stands the extractor is only using a single core. There have been plans to extend this to multi-core usage for quite some time however I never quite got round to doing it. We were thinking however to lock ourselves away for a day or two in the not too distant future where we would all have a go at doing something that never gets done due to the day-to-day business, and multi-threading of Bismark is very high up on that list!

For the time being you might want to split up the BAM file into several smaller chunks and launch them individually to gain some speed. Cheers, Felix
I got this error while trying to run bismark_methylation_extractor by chromosome-wise (BAM files splitted by chromosome):
"This might be a result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. Please use an unsorted file instead".

It seems I need to provide unsorted BAM files, but I thought it would be great if bismark could deal with a sorted BAM file as this makes splitting the original BAM file by chromosome easier using samtools.
sunggong is offline   Reply With Quote
Old 07-25-2014, 03:30 AM   #384
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Sorting the BAM file into smaller files by chromosome shouldn't matter, you only need to take care that you don't sort by position as well so that you do not disturb the Read 1/ Read 2 order of the file. Since read pairs should always align to the same chromosome you could probably resort the reads by name (samtools sort -n) in case they were sorted by position.
It is very unlikely that we will change the methylation extractor so that it will be able to cope with sorted or otherwise random R1/R2 order since this may be a huge (t)ask memory-wise because BAM files can easily reach several tens of GB in size.
fkrueger is offline   Reply With Quote
Old 07-28-2014, 05:11 PM   #385
adeirossi
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default

Hi Felix,

Sorry to perpetuate my reputation for breaking the methylation extractor, but I recently noticed an issue with the patch I submitted that was incorporated in the v0.12.2 release.

Background on the old patch: Running v0.12.1 of bismark_methylation_extractor, I experienced a failure when I used high --ignore or --ignore_r2 values that equaled or exceeded the alignment length. The old patch I came up with simply called "next" when it encountered this condition, and this indeed prevented the fatal error.

However, I recently noticed an unfortunate behavior: with --ignore set to 0 and --ignore_r2 set to a high value, when a read 2 alignment is encountered that is shorter than --ignore_r2, the read 1 alignment is discarded as well. Scrutinizing the code, the behavior is caused by the "next" call from the old patch. Since the while loop iterates over each read pair (not each read), calling "next" discards the whole read pair though only one of the two reads may be problematic.

I came up with a new patch that corrects this behavior of the old one (see attached): now when one read is too short for its "ignore" value, its mate is not automatically discarded. If you could take a look when you get the chance, that would be much appreciated. My apologies for not catching this the first time.

Andrew

Last edited by adeirossi; 08-01-2014 at 04:37 PM. Reason: Fix attached script to not spew Perl warnings
adeirossi is offline   Reply With Quote
Old 07-29-2014, 12:11 PM   #386
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Andrew,

I am currently away on holiday but I will definitely take a look once I am back. Thanks for your continued support of the methylationXtractor ;D
fkrueger is offline   Reply With Quote
Old 08-06-2014, 07:00 PM   #387
wilson90
Member
 
Location: Singapore

Join Date: May 2012
Posts: 48
Default

Hi. I am just wondering if it is advisable to merge bam files produced by Bismark using samtools merge, and then run methyl_extrator?

The reason why I am asking this question is because I have hundreds of gzipped single-end fastq files and I am very reluctant to merge all the fastq files. The most convenient way is to run bismark on every fastq files, and merge the bamfiles before the running methyl_extrator.

Is it "legal"?
wilson90 is offline   Reply With Quote
Old 08-07-2014, 01:43 AM   #388
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Wilson,

If you've got many small FastQ files it should be absolutely fine to merge the BAM files and then run the methylation extractor. This is at least true for single-end files, however we have observed that samtools merge might interfere with the Read1/Read2 order of paired-end files, and they might need to be sorted by read name afterwards (e.g. with samtools sort -n). Again, since you seem to have single-end files it should be fine.

The only (minor) inconvenience is that you won't be able to generate the swish html report for the entire sample with bismark2report because the sample is split into many small reports.
fkrueger is offline   Reply With Quote
Old 08-14-2014, 05:48 AM   #389
sunggong
Junior Member
 
Location: Cambridge, UK

Join Date: May 2014
Posts: 6
Default

Hi Felix,

I thought it would be really nice if 'methyl_extrator' have an option to search a specified interval only (e.g. -L chr2, -L chr2:1000-2000, or -L <target_region.bed>).

Thanks,
Sung
sunggong is offline   Reply With Quote
Old 08-14-2014, 06:32 AM   #390
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Sung,

This might be a useful feature but I think it would be more sensible to apply this on e.g. the coverage report which has all values sorted by position with percentage methylation and count methylated/unmethyalted already, rather than resticting the methylation extractor itself. It should be relatively straight forward to filter the .cov file with a small script and should only run for a few seconds, but I'll be off on holiday for a while now...
fkrueger is offline   Reply With Quote
Old 08-22-2014, 04:21 PM   #391
christophe
Junior Member
 
Location: Phoenix

Join Date: May 2010
Posts: 7
Arrow M-bias: CHH and CHG total calls drop off in Read2

Hi Everyone,

I have questions regarding M-bias plot and CHH,CHG methylation calls generated with bismark. I hope it is the right thread to ask otherwise let me know where should have I posted it.
Has anyone seen a drop off of CHH and CHG total calls within a M-bias plot generated by bismark tool?
I have observed this drop off only in the M-bias read2 plot.
(See Attached png plots)
The data are from Whole Genome BiSulphite with reads of 101 bases
Library Prep was performed using Ultralow Ovation MS kit.

How to interpret the falling number of calls after 50 bases? and ....
How this can impact the alignment of read2 in bismark? and the Methylation calls?


Thanks for your help,
Christophe
christophe is offline   Reply With Quote
Old 08-22-2014, 10:42 PM   #392
adeirossi
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default

Hi Christophe,

I have also seen a falling number of methylation calls for read 2 in my data. I have always assumed this was because I ran bismark_methylation_extractor with the --no_overlap option and many of the read pairs had overlaps (i.e. insert length less than twice the read length). With --no_overlap, when the extractor encounters an overlap between read 1 and read 2, I believe it uses the read 1 calls only and discards the read 2 calls, which would explain why the number of read 2 calls decreases as you go toward its 3' end. Because this "no overlap" filtering occurs after mapping, I assume there is no impact on the alignment of read 2.

If, on the other hand, you are not using the --no_overlap option or you don't have many reads pairs with inserts <200 bp, then I have no idea what's going on.

Andrew
adeirossi is offline   Reply With Quote
Old 08-23-2014, 03:09 AM   #393
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by adeirossi View Post
Hi Christophe,

I have also seen a falling number of methylation calls for read 2 in my data. I have always assumed this was because I ran bismark_methylation_extractor with the --no_overlap option and many of the read pairs had overlaps (i.e. insert length less than twice the read length). With --no_overlap, when the extractor encounters an overlap between read 1 and read 2, I believe it uses the read 1 calls only and discards the read 2 calls, which would explain why the number of read 2 calls decreases as you go toward its 3' end. Because this "no overlap" filtering occurs after mapping, I assume there is no impact on the alignment of read 2.

If, on the other hand, you are not using the --no_overlap option or you don't have many reads pairs with inserts <200 bp, then I have no idea what's going on.

Andrew
I can only second Andrew's conclusions. Just for the record, you should also see a similar drop in CpG calls, but since these are less frequent than CHH or CHG calls it might be less obvious. If you mouse-over the HTML report you will see the actual number of calls involved. Thanks Andrew.
fkrueger is offline   Reply With Quote
Old 09-01-2014, 04:42 AM   #394
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default

Dear all,

Apologies if this was answered previously, but I couldn't find the information by Googling and browsing the thread. I am a postgraduate bioinformatics student dealing with the first methylation dataset in our group.

Issue
My issue is that I see strikingly low alignment rate by bismark (28-45% for bisulfite libraries, 49-55% for control non-bisulfite libraries).

Details
I was given a dataset of 2 x 150bp reads of whole-genome bisulfite sequencing libraries (WGBS) using the Epignome Methyl-Seq kit, sequencing using Illumina platform. I have clipped adapter sequences and trimmed low-quality bases with Trimmomatic, as suggested in the Epicentre guide (http://www.epibio.com/applications/b...eq-kit?details). This left over 90% of the reads in each sample with some workable sequence.

From reading I have done so far, I feel that the read length is much longer than necessary, and might even cause more problems than it brings information. It seem 50-75 read length is better suited to this type of analysis.
I would like to know if anyone could advise me appropriate non-default alignment parameters suited to my data. Do I have to control the number of mismatches allowed in the seed alignments? The length of the seeds? The overall alignment score? etc etc.
I prefer asking experienced users rather than test and overfit parameters of my own.

Other

In the meantime, I truncated the forward read file of the sample with worst alignment rate (28.4%) to 75 bp, and try to align that as single-end to see if I get a better alignment rate.
It turned out to align 63.9% !

Rather than clip all my reads, I'd be very grateful to learn how to salvage information from the full length reads

Thanks in advance!
kevinrue is offline   Reply With Quote
Old 09-01-2014, 04:58 AM   #395
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

kevinrun - what does fastqc look like on your forward and reverse reads?

We've recently switched over to the same kit, and now only do 71bp PE reads (we can get that out of 2 50 cycle SBS kits). I similarly got not great alignment using bismark (even after trimming the 6bp from each end) - however I get good alignment using bwameth, and methylation % obtained on the same dataset run through either pipeline were very similar (r=0.97 IIRC) - so perhaps bwa-meth is worth a look? I get about 80% of reads aligning with a mapping quality >=40.
frozenlyse is offline   Reply With Quote
Old 09-01-2014, 05:17 AM   #396
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

A recent post in this thread was asking something quite similar, maybe the reply is helpful in some way:

Quote:
Originally Posted by fkrueger View Post
Hi eicht,

First of all, I can feel your pain, we have also spent quite a while trying to find the reason or solutions to poor mapping efficiencies for PE PBAT libraries. Generally, the original PBAT protocol only produces alignments to the complementary strands (CTOT and CTOB), however the EpiGnome kit produces standard, i.e. directional, libraries that align to the OT and OB strands only. This means that R1, and also paired-end alignments, should only align to the OT and OB strands; however R2 needs to be aligned to the CTOT and CTOB strands because it is the reverse complement of R1. This is exactly what the option --pbat does, and in that sense the mapping efficiencies you are describing make perfect sense.

We haven’t worked with data generated by EpiGnome but used a homebrew PBAT PE protocol, but I suppose the mechanisms will be the same since they seem to be caused by the random priming step in the protocol. Since we saw the drastically reduced mapping efficiencies in PE mode we reasoned that the protocol might generate chimeric reads, i.e. reads starting at some position in the genome and ending at another. To test this we mapped the data in SE mode and paired up pairs where both R1 and R2 produced alignments. We then imported this data into SeqMonk as Hi-C data, and looked at the amount of reads that had its partner in trans (on another chromosome) as a fairly crude measure. As you can see on the screenshot attached, when we quantitated where in the genome the partner reads aligned to of reads that aligned to chromosome 1, we found that they seem to be scattered pretty much all over the genome in no discernible pattern! In this case a bit more than 40% of reads had their partner read in trans on another chromosome (partner reads aligning to chr 1 but far away were not even considered here).

Now if you would perform some similar pairing up on your reads you might get a similar answer, and if you add up the numbers of properly aligning PE reads and the number of chimeric reads (which won’t be reported by Bismark since they are no valid PE alignments), you should be very close to the mapping efficiencies you achieve in SE mapping.

We came to a similar conclusion, and we by now also settled on the procedure you described (someone coined the term ‘Dirty Harry mode’ because it doesn’t seem to be the cleanest of methods…):

(1) directional PE alignment to start and grabbing the unmapped R1 and R2 reads out the end using the option ‘--unmapped’. Properly aligned PE reads should be methylation extracted using --no_overlap.
(2) unmappedR1 is then mapped SE directional
(3) unmappedR2 is mapped SE directional under -pbat conditions.
Single-end aligned R1 and R2 can then be methylation extracted normally as they should in theory map to different places in the genome anyway so don’t require attention to overlapping reads.

As another suggestion:
Like all PBAT applications, the first couple of bp (6bp in the case of the EpiGnome kit) produce a noticeable bias in sequence composition as well as later on in the M-bias plots (see an example here http://www.bioinformatics.babraham.a...SE_report.html). EpiCentre recommend ignoring the first 6bp of the methylation call string themselves, but it is probably a good idea to remove these residues even before mapping because we have seen that they may reduce the mapping efficiency somewhat. This could be done with --clip_r1 6 and --clip_r2 6 within Trim Galore.

Quote:
I start to think that paired-end for bisulfite is more headache than it is worth sometimes.
At least for paired-end PBAT protocols I would probably second this notion, however even sequencing longer SE reads (e.g. 150bp SE) will likely start reading into the chimeric portion of the read so also have some mappability problems...
Just as another comment, we ran a couple of tests on 2x76bp EpiGnome paired end data as well and achieved mapping rates of >75% in all cases (using --bowtie2 default mode). So maybe the problem of chimaeric reads starts a little later which might explain the drop the efficiency for long reads? A tool supporting local alignment such as bwa-meth might indeed help in this case...
fkrueger is offline   Reply With Quote
Old 09-01-2014, 07:17 AM   #397
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default

Hi,

Thank you both for the very quick answers. Much appreciated. It took me a bit of time to get this answer together. I hope it's clear.

I didn't mention yet that I am working with the Bos taurus UMD3.1 genome. 3 billion base pairs on 29 autosomes + 2 sex chromosomes: comparable to human.

frozenlyse
I attached the fastQC archives (recompressed in tar format to satisfy the upload requirements of SeqAnswer), as I am not sure which part of it you were interested in. Due to the C->T conversion caused by the bisulfite treatment, I am not particularly concerned by:
  • the enriched proportion of T through the forward read
  • The enriched proportion of A in the reverse read
  • The low GC content in both reads
  • the enriched proportion of k-mers containing C and A in the forward read
  • the enriched proportion of k-mers containing G and T in the reverse read

bwameth sounds like a good alternative. I will look for a standard pipeline of analysis from alignment to differential methylation calls.. except if you have a link to offer

Felix

Sorry I missed that post, thanks for forwarding. I started reading it to understand the entire procedure, although it sounds convoluted. I suppose that after all those alignments, I would have to aggregate the methylation calls for each cytosine (from the paired alignment + single end forward + pbat single end reverse), and then feed it into an analytic software package.
What tools do you recommend for statistical analysis of differential methylation by the way?

Info
I just tried to align 1 million non-truncated forward reads in single end mode and got 60.7% again.
Looks like the problem of chimeric reads indeed.

Here is a copy paste of the full bismark report if it is any use:
Bismark report for: /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_1P.fastq (version: v0.12.1)
Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)
Bismark was run with Bowtie 2 against the bisulfite genome of /workspace/storage/genomes/bostaurus/UMD3.1.75/source_file/ with the specified options: -q --score-min L,0,-0.2 --ignore-quals

Final Alignment report
======================
Sequences analysed in total: 1000000
Number of alignments with a unique best hit from the different alignments: 606793
Mapping efficiency: 60.7%
Sequences with no alignments under any condition: 179520
Sequences did not map uniquely: 213687
Sequences which were discarded because genomic sequence could not be extracted: 5

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT: 303497 ((converted) top strand)
CT/GA: 303291 ((converted) bottom strand)
GA/CT: 0 (complementary to (converted) top strand)
GA/GA: 0 (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 14776304

Total methylated C's in CpG context: 644197
Total methylated C's in CHG context: 32496
Total methylated C's in CHH context: 56579
Total methylated C's in Unknown context: 756

Total unmethylated C's in CpG context: 249170
Total unmethylated C's in CHG context: 3536727
Total unmethylated C's in CHH context: 10257135
Total unmethylated C's in Unknown context: 8798

C methylated in CpG context: 72.1%
C methylated in CHG context: 0.9%
C methylated in CHH context: 0.5%
C methylated in Unknown context (CN or CHN): 7.9%


C1_ATCACG_1P.fastq_bismark_bt2_SE_report.txt (END)
Attached Files
File Type: tar C1_ATCACG_1P_fastqc.tar (478.0 KB, 5 views)
File Type: tar C1_ATCACG_2P_fastqc.tar (421.5 KB, 1 views)
kevinrue is offline   Reply With Quote
Old 09-01-2014, 08:37 AM   #398
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Kevin,

The FastQC reports look completely normal, and the results you posted look quite decent as well. So for the record, can we just state that it's not really Bismark's fault but the PBAT protocol itself is generating wild artefacts .

Regarding DMR analysis: I don't think there anything is accepted as gold standard just yet, new tools are being published almost every week. Promising and widely used tools are: bsseq, BiSeq, methylKit, MOABS, RADMeth, DMAP, RnBeads and a few more, most of them interface rather nicely with the output of Bismark. Please find attached a review of the underlying concepts of some of these tools by Mark Robinson: http://biorxiv.org/content/biorxiv/e...07120.full.pdf, I hope it may help.
fkrueger is offline   Reply With Quote
Old 09-01-2014, 09:38 AM   #399
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default

Hi Felix,

Thanks for the feedback on the fastQC and alignment report. Plus the list of DMR tools. I only tested BiSeq so far during my initial attempt to an analytic pipeline, and didn't find any significant DMR, but Katja told me the pacakge was meant for RRBS analyses. Anyway, as you can see the proper alignment of my reads might have been the limiting factor.

Indeed, as far as I can see, bismark does indeed a good job aligning reads, with good quality reads and adequate parameters
Given my recent arrival in the bisulfite-sequencing field (I have only dealt with transcriptomics data for the past two years, microarray and RNA-seq), I can only strive to understand all the complexity of the bisulfite data, the sources during their obtention and their subsequent analysis which you are much more familiar with than I am.

New result

I have a little update which surprised me. I have attached a screenshot of the alignment rate for SingleEnd/PE reads mode, with or without truncating to 75bp. Two of them were only performed on the first million read/pairs using "-u 1000000" to post you this answer today

What surprised me, as I believed chimeric read/pairs were being discarded during the alignment, is that truncating both mates to 75 bp (or fewer if quality trimming left less than that) restored an alignment rate of 56.8% (of 1 million pair), compared to 28.4% (of 8.4 million pairs, in) for full-length read pairs.
If inserts were indeed chimeric, truncating the read pairs should identified those reads as chimeric and left them out again, shouldn't it? Or am I missing something?

Now, I think I will repeat tonight the alignment of truncated read pairs with the entire 8.4 million read pairs to satisfy my OCDs and give you a table with fully comparable numbers, but I am fairly certain that these figures are representative enough.

Many thanks for your help and time
Kevin
Attached Images
File Type: png t1.png (4.7 KB, 10 views)

Last edited by kevinrue; 09-01-2014 at 01:48 PM. Reason: missing image upload?!
kevinrue is offline   Reply With Quote
Old 09-01-2014, 01:17 PM   #400
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

I am keen to hear more about what you discover!
fkrueger is offline   Reply With Quote
Reply

Tags
alignment, bisulfite, bisulphite, methylation, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:50 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO