Seqanswers Leaderboard Ad

**fkrueger** · 05-07-2015, 02:50 AM

Originally posted by bruce01 View Post

Hi Felix,

firstly, thanks for Bismark, it is a great tool. And secondly, I have read the entire thread here and searched elsewhere, so apologies if this is discussed elsewhere.

My issue is understanding the XM:Z tag in Bismark output which I believe indicates methylation status (from a line in the change log). I would like to be able to interpret this tag (i.e. what do x, h indicate?), to the end that possibly the sequence could be 'reconverted' to pre-bisulfite conversion (PI is trying to 'add value' to our methylseq...). Your thoughts on how useful this would be would also be of interest, specifically for copy number analysis (using off-target sequence with CopywriteR package, for example).

Thanks,

Bruce.

Hi Bruce,

First allow me to paste a passage from the User Guide:

The methylation call string contains a dot ‘.’ for every position in the BS-read not involving a cytosine, or contains one of the following letters for the three different cytosine methylation contexts (UPPER CASE = METHYLATED, lower case = unmethylated):

Code:

z	unmethylated C in CpG context
Z	methylated C in CpG context 
x	unmethylated C in CHG context
X	methylated C in CHG context
h	unmethylated C in CHH context
H	methylated C in CHH context
u	unmethylated C in Unknown context (CN or CHN)
U	methylated C in Unknown context (CN or CHN)

I am not quite sure I understand your other question fully though, what exactly do you mean with a pre-bisulfite state? In theory it is of course possible to do various things with the read, for which you need to take the reference sequence, the CIGAR string as well as the mismatch string into consideration.

Can't you just take the BAM file and use this to do copy number analyses? Using the coordinates of the read alignmnents you can relatively quickly extract the fully unconverted sequence from the reference genome (is this the pre-bisulfite state?), or go back and extract the bisulfite sequence as it comes from the sequencer out from the FastQ file (e.g. based on the read ID). Does this answer your questions?

**bruce01** · 05-07-2015, 03:12 AM

Hi Felix,

thanks for the reply, sorry I missed the section in the manual.

I have tried using methylation data for CNV, and comparing this to shallow WGS and also exome for same samples. Exome, WGS are consistent, but methylseq is not (it is whole genome, from a MiSeq QC run to determine conversion efficiency). We also have captured methylseq (CpGiant) which is also inconsistent with WGS, exome.

The 'pre-bisulfite state' I refer to is before the DNA is bisulfite-treated, so yes, essentially the reference genome, albeit with mutations etc. My boss is thinking this can be used in SNP, indel calling, as well as CNV. Essentially instead of sequencing exome. I am unconvinced due to bisulfite treatment damaging DNA, but said I would try.

Anyway, thanks for your help and sorry if this is obscure, and my nomenclature incorrect/confusing,

Bruce.

**fkrueger** · 05-07-2015, 05:20 AM

It appears that the coverage in BS-Seq can be affected by several factors in addition it simply being a harsh chemical treatment, e.g. the conditions used for the bisulfite-treatment, polymerase bias during amplification and the sequence composition of the single-stranded DNA. So yes maybe this is not the ideal way of looking at CNVs in these samples.

There are to my knowledge currently two tools out there that attempt to look at SNVs in bisulfite data and correct for this. One is called Bis-SNP (http://epigenome.usc.edu/publicationdata/bissnp2011/), the other one MethylExtract (http://bioinfo2.ugr.es/MethylExtract/). Maybe it would be worth looking into them?

**bruce01** · 05-07-2015, 05:38 AM

Thanks Felix, I have tried BisSNP but it runs very slowly (so slow as to be unusable, trying to figure out why currently) and have read the MethylExtract paper but not yet gotten around to running it.

**ajaffe** · 05-11-2015, 04:44 PM

--multicore and --temp_dir still crashing

Hi,

We are still having similar issues when specifying "--multicore" and "--temp_dir" in the same run. Note this is not a problem when specifying "--multicore" but not "--temp_dir" or "--temp_dir" but not "--multicore". Eg:

Code:

> bismark --temp_dir Pilot/tmp --non_directional --bowtie2 --multicore 4 -f -o Pilot/tmp lambda/ -1 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta -2 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta

...
The provided filenames for paired-end alignments are /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta and /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta
The provided filenames for paired-end alignments are /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta and /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta
Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.1: No such file or directory
Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.4: No such file or directory
Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.2: No such file or directory
Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.3: No such file or directory

> bismark --non_directional --bowtie2 --multicore 4 -f -o Pilot/tmp lambda/ -1 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta -2 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta

...

====================
Bismark run complete
====================

> bismark --non_directional --bowtie2 --temp_dir Pilot/tmp/ -f -o Pilot/tmp lambda/ -1 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta -2 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta

...

====================
Bismark run complete
====================


>

I was also wondering how --multicore differed from the -p flag with bowtie2, e.g. if I specify -p 4 keeping bismark itself on 1 core, will this have the same linear speed up as specifying --multicore 4 and -p 1?

Thanks!!

**fkrueger** · 05-12-2015, 02:41 AM

Again the problem here was that the filenames were specified with /full/path/information/ that broke things. I had fixed this for FastQ files but you were using FastA files... Please find attached a version that should be working now.

Regarding your other question about parallelisation: I haven't done extensive testing on this but the -p parameter only allows Bowtie 2 to produce the alignments somewhat faster, whereas --multicore will should almost fully parallel. The -p speed increase seems to be plateauing fairly quickly, possibly also because of the unique alignment determination and methylation calling processes need some time, too. If you've got enough system resources I would certainly recommend using --multi over -p (or potentially a combination thereof) in terms of speed increase.

Attached Files

bismark_v0.14.4_devel.zip (72.4 KB, 22 views)

**nivanov** · 05-15-2015, 10:46 AM

Hi! I work in the Jaffe Lab, and I'm using the new version of bismark that was attached in the message above. I'm having problems with it. I sampled the first million reads, and I'm aligning them to the human genome. I'm using --multicore 4, and submitted the job to 20 cores, 2Gb each (max allowable memory per core -- 3Gb). The program has been running for over 19 hours, its using 61Gb of scratch space, and has been on the following stage for the past 16 hours:

Now starting a Bowtie paired-end alignment for GAread1CTread2GAgenome (reading in sequences from /scratch/temp/4549202.1.shared.q/WGC025593L_combined_R1.fastq.gz.temp.4.gz.GA_plus_CT.fastq.gz, with the options: -q -n 1 -k 2 --best --maxins 500 --chunkmbs 512 --norc)

This seems slow. I was wondering if you could help. Thanks in advance!

**fkrueger** · 05-15-2015, 11:39 AM

Originally posted by nivanov View Post

Hi! I work in the Jaffe Lab, and I'm using the new version of bismark that was attached in the message above. I'm having problems with it. I sampled the first million reads, and I'm aligning them to the human genome. I'm using --multicore 4, and submitted the job to 20 cores, 2Gb each (max allowable memory per core -- 3Gb). The program has been running for over 19 hours, its using 61Gb of scratch space, and has been on the following stage for the past 16 hours:

Now starting a Bowtie paired-end alignment for GAread1CTread2GAgenome (reading in sequences from /scratch/temp/4549202.1.shared.q/WGC025593L_combined_R1.fastq.gz.temp.4.gz.GA_plus_CT.fastq.gz, with the options: -q -n 1 -k 2 --best --maxins 500 --chunkmbs 512 --norc)

This seems slow. I was wondering if you could help. Thanks in advance!

Yes this is unacceptably slow, this process shouldn't take more than 20 seconds or so... I am not quite sure if the memory allocation is the problem, but I wouldn't be surprised. Just as a guideline: Each instance of Bismark will need ~3GB to hold the genome in memory, and around another 3-4GB per alignment thread, so 2 times this for default directional, and 4 times this for non-directional mapping. The other processes (i.e. Samtools or gzip streams etc) shouldn't need much extra memory. Since you specified --multi 4 you will need this memory times four again. so roughly 40-50GB for directional or ~60GB for non-directional alignments. Is there are a chance that your cluster just won't execute the commands properly?
As a solution you could either try to ask for enough resources or go down with --multi, but each instance of Bismark will need ~10-16GB either way. Cheers, Felix

**jcrow06** · 06-16-2015, 03:17 AM

Hi there,

So I have recieved some Bismark output files:

Bismark (SAM/BAM)

Mbias files & Plots

Methylation Extractor reports

The documentation is very good and I have a good understanding of all the files however in my data from the Methylation extractor there are several sites that are not marked with the chromosome header.

The report files according to methylationExractor should have the following headers:

(1) seq-ID (2) methylation state (3) chromosome (4) start position (= end position) (5) methylation call

In my files I have as follows:

HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC - chr9 23913539 z
HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC + chr9 23913531 Z
HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC + chr9 23913529 Z
HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC - chr9 23913485 z
HWI-ST571:431:C43FMACXX:8:2316:9928:85484_1:N:0:GTCCGC + chr2 28594224 Z
HWI-ST571:431:C43FMACXX:8:2316:9928:85484_1:N:0:GTCCGC - chr2 28594341 z
HWI-ST571:431:C43FMACXX:8:2316:9928:85757_1:N:0:GTCCGC - 30126 6594 z
HWI-ST571:431:C43FMACXX:8:2316:9928:85757_1:N:0:GTCCGC - chr2 223724995 z
HWI-ST571:431:C43FMACXX:8:2316:9929:12206_1:N:0:GTCCGC - chr12 64474195 z
HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC + chr20 55837341 Z
HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC - chr20 55837591 z
HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC + chr20 55837557 Z
HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC + chr20 55837519 Z
HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + chr5 92603043 Z
HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + 26965 45121 Z
HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + 26965 301 Z
HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + chr5 92603089 Z
HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + chr5 92603122 Z

I am not sure about this as they seem to be a strange format. Initially I though maybe there was a formatting issue and it skipped the "chr" but none of the numbers correspond to a CG site.

When I run bismark2bedgraph the job completes and these files are also listed but without the chromosomal location I cannot locate the methylation site.

Have any of you observed this before or know what has happened? Or could it be a small bug?

Cheers

**fkrueger** · 06-16-2015, 03:23 AM

Hmm not quite sure what the problem is but to me it looks like all the information is right there:

Code:

HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC - chr9 23913539 z

HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC is the Read ID
- is the methylation state (+ is methylated, - is unmethylated)
chr9 is the chromosome, here chromosome 9
23913539 is the mapping position of the C in question
z is the context, so here CpG context

Or am I missing something?

**dpryan** · 06-16-2015, 03:27 AM

Note the 7th, 15th and 16th lines, which lack a chromosome column.

Edit: Rather, they likely have incorrect chromosomes.

**jcrow06** · 06-16-2015, 03:42 AM

Indeed as noted by dpryan
some of the hits do not have associated chr, looks like some are lost.
I looked through one of the sample files and found that there was ~ 6700 that had this missing Chr from ~6.2 Million.

sorry if my original post was not clear.
thanks for clearing it up dpryan

**fkrueger** · 06-16-2015, 04:25 AM

Sorry about not getting that in the first place. I would tend to agree with Devon that someone has tampered with the chromosome names (the ones that appear as the first line in the .fa files used for the genome preparation >...). There is no place within Bismark that would affect or remove chr from a chromosome name. Maybe you could get back to the people who gave you the data and see if they could clarify this point?

**jcrow06** · 06-16-2015, 04:34 AM

Hi fkrueger,

Honestly I had a feeling that this is an in-house issue.
When I approached them I was told "there is no annotation for chromosome etc and that it is per read based so I should summarize the read information to get results...."
From reading the well documented details of Bismark I did not feel this was the case

I wanted to double check before going back.
Thanks Fkrueger & dpryan for the help.
Much appreciated!
jcrow

**lancelothk** · 08-17-2015, 12:14 PM

Hi fkrueger,

I am currently trying to use bismark to analyse a huge BS-seq dataset in HPC environment. I am thinking to split the big fastq into smaller pieces, run bismark with each of them in one node, then merge back the BAM results. Do you have any suggestion how can I merge bismark reports? Do you have existing script to do this?

Thanks.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News