Seqanswers Leaderboard Ad

**fkrueger** · 12-05-2013, 02:27 PM

Oh, too late...

**cjsifuen** · 01-02-2014, 06:21 PM

Hi all,

Sorry if this question is a bit simple, I am very new to all of this. I am trying to reanalyze some RRBS data, following methods from a paper that was recently published. I used bismark to align paired-end RRBS reads, then used the bismark_methylation_extractor to make methylation calls. Now, I am trying to use methylKit to perform further analysis, but I cannot figure out how to get the methylation call file with the format that methylKit asks for. Is there a way to get the output file with only with the format below?

chrBase chr base strand coverage freqC freqT
chr21.9764539 chr21 9764539 R 12 25.00 75.00
chr21.9764513 chr21 9764513 R 12 0.00 100.00
chr21.9820622 chr21 9820622 F 13 0.00 100.00
chr21.9837545 chr21 9837545 F 11 0.00 100.00
chr21.9849022 chr21 9849022 F 124 72.58 27.42
chr21.9853326 chr21 9853326 F 17 70.59 29.41

I am trying to repeat what the group in the paper did, so I would prefer to read in the bismark methylation call files and not the bismark alignment files.

Thanks,

Chris

**kalyankpy** · 01-02-2014, 10:13 PM

Dear Chris,

Easiest way to move from Bismark to Methylkit is to use the .sam file from alignment as input for Methylkit.

In the other case one has to extract methylation data from .sam file and convert it into bed or bedgraph file. One has to process further if you intend to merge the strand coverage on either strands. Lastly one has to convert the merged strand coverage data into methylkit format. I have been following this method for few samples. I have a script written to do this job. This script was written for an earlier version of Bismark (0.7.5) and all the paths are hardcoded (works only on my computer). Contact me if you need some help in using this.

So, among the two methods, easiest one would be to give .sam file as input for methylkit. Somehow, I felt the methylkit tutorial/vignette is not upto the standard of other bioconductor packages. In case you are familiar with R, it will be an easy ride!

Kalyan

**dpryan** · 01-03-2014, 01:09 AM

For Bison, I wrote a bedGraph2Methylkit program. You'd actually need to download the full bison source code (as well as samtools) for compilation, but that's probably the fastest route if you've already used the methylation extractor to generate bedGraph files (I've never tried this on Bismark-generated bedGraph files, so YMMV).

**linnet** · 01-14-2014, 06:15 AM

deduplicate_bismark

Hello,

Does anyone knows input file size to memory footprint of deduplicate_bismark ? I have 156 GB WGBS bam file (10^9 PE reads) and I suspect 48 GB workstation is getting into HDD swap usage, which is not great... Any ideas or solutions? Alternatively I have tried samtools sort; rmdup; sort by name, but then methylation extractor is not happy, probably due to some pair-mates removed.

Thanks,
Skirmantas

**fkrueger** · 01-14-2014, 06:40 AM

Hi Skirmantas,

Deduplicating such a large file with only 48GB of memory is probably not going to work... Admittedly, Perl is probably not the best option to store and quickly look up such an extraordinary number of entries, but the deduplication procedure is storing a composite string of chromosome:start:end:strand which is sort of essential to the process. One could probably reduce the memory footprint somewhat by not including the chromosome into the string but using a different hash for each chromosome, and replacing the end coordinate by the length of the fragment, but I don't know how much this would reduce the footprint.
Alternatively I could offer you to set up an FTP site for your file and deduplicate it here overnight ...

**dariober** · 01-14-2014, 07:50 AM

Originally posted by linnet View Post

Alternatively I have tried samtools sort; rmdup; sort by name, but then methylation extractor is not happy, probably due to some pair-mates removed.

Hi- I think samtools rmdup doesn't handle paired reads correctly. I would try to use picard MarkDuplicates instead.

Good luck!
Dario

**linnet** · 01-17-2014, 07:48 AM

duplicates

Dear fkrueger and Dario,

Thank you both for lighting replies and suggestions! I will try picard, since I want to have more permanent solution rather than bothering fkruager every time I have a big dataset. If I am completely stuck, I may be in touch though.

Regarding picard I was under a probably false impression that samtools and picard are using the same algorithm to eliminate duplicates... I had a difficulty to find how picard defines PE duplicate (which is clearly explained in deduplicate_bismark)

Thanks again,
Skirmantas

**dariober** · 01-17-2014, 08:23 AM

Originally posted by linnet View Post

Regarding picard I was under a probably false impression that samtools and picard are using the same algorithm to eliminate duplicates... I had a difficulty to find how picard defines PE duplicate (which is clearly explained in deduplicate_bismark)

Hi- See this link for an explanation of MarkDuplicates. if you haven't already seen it http://sourceforge.net/apps/mediawik...icates_work.3F

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome, inter-chromosomal pairs will not be identified, but MarkDuplicates will not fail due to inability to find the mate pair for a read.

And here http://www.biostars.org/p/3917/ for some comments about rmdup vs MarkDuplicates.

Dario

**PeteH** · 03-04-2014, 08:27 PM

Hi Felix,

Currently the SAM header produced by Bismark has the @SQ fields in a more-or-less random order. This means that the @SQ fields will almost certainly not be in the same order as the they are in the reference genome/index. It's not a huge deal but it means I have to re-header my SAM files because I rely on the order of the @SQ fields in the SAM header in downstream tools. Is it possible to change this behaviour so that the @SQ fields in the SAM header are in the same order as in the reference genome/index?

My understanding is that the @SQ information is stored in a hash ("chromosomes") and so there is no guarantee on the order that these will be printed in when called by the function generate_SAM_header(). I'm not sure how hard that is to do this in the current framework. My idea was to store an index variable as an additional value in that hash, where the index is the position of that chromosome in the reference genome/index, but my limited Perl skills couldn't get it working.

I understand if you don't consider it an issue and I will just continue to re-header my SAM files.

Thanks,
Pete

**fkrueger** · 03-05-2014, 03:30 AM

Originally posted by PeteH View Post

Hi Felix,

Currently the SAM header produced by Bismark has the @SQ fields in a more-or-less random order. This means that the @SQ fields will almost certainly not be in the same order as the they are in the reference genome/index. It's not a huge deal but it means I have to re-header my SAM files because I rely on the order of the @SQ fields in the SAM header in downstream tools. Is it possible to change this behaviour so that the @SQ fields in the SAM header are in the same order as in the reference genome/index?

My understanding is that the @SQ information is stored in a hash ("chromosomes") and so there is no guarantee on the order that these will be printed in when called by the function generate_SAM_header(). I'm not sure how hard that is to do this in the current framework. My idea was to store an index variable as an additional value in that hash, where the index is the position of that chromosome in the reference genome/index, but my limited Perl skills couldn't get it working.

I understand if you don't consider it an issue and I will just continue to re-header my SAM files.

Thanks,
Pete

Hi Pete,
I have changed the order of the @SQ header lines to reflect the order in which the reference files are read in. This is normally the same order the files are sorted in the reference genome directory, or for multi-fasta files it should be the order the chromosomes are listed inside that file. So instead of random ordering it would now look like this for the mouse genome:

Code:

@HD     VN:1.0  SO:unsorted
@SQ     SN:1    LN:197195432
@SQ     SN:10   LN:129993255
@SQ     SN:11   LN:121843856
@SQ     SN:12   LN:121257530
@SQ     SN:13   LN:120284312
@SQ     SN:14   LN:125194864
@SQ     SN:15   LN:103494974
@SQ     SN:16   LN:98319150
@SQ     SN:17   LN:95272651
@SQ     SN:18   LN:90772031
@SQ     SN:19   LN:61342430
@SQ     SN:2    LN:181748087
@SQ     SN:3    LN:159599783
@SQ     SN:4    LN:155630120
@SQ     SN:5    LN:152537259
@SQ     SN:6    LN:149517037
@SQ     SN:7    LN:152524553
@SQ     SN:8    LN:131738871
@SQ     SN:9    LN:124076172
@SQ     SN:MT   LN:16299
@SQ     SN:X    LN:166650296
@SQ     SN:Y    LN:15902555
@PG     ID:Bismark      VN:v0.10.2      CL:"bismark /data/public/Genomes/Mouse/NCBIM37/ simulated.fastq"

Does this help? The latest version (v0.10.2) is attached.

Attached Files

bismark.zip (60.5 KB, 32 views)

**PeteH** · 03-05-2014, 04:22 PM

Fantastic, thanks Felix! Simply having a second hash storing the SQ_order was so much simpler than my idea.

**adeirossi** · 03-10-2014, 09:00 AM

MAPQ with Bowtie2?

Hi Felix,

First and foremost, thank you for a wonderful tool.

I am curious to know why Bismark always reports 255 as the MAPQ (mapping quality), even when running Bowtie2, which reports a continuous range of mapping qualities. It would be useful to have some idea of relative mapping quality among the unique alignments. Are there reasons not to simply report the MAPQ values generated by Bowtie2?

Thanks in advance,
Andrew

**fkrueger** · 03-10-2014, 09:58 AM

Originally posted by adeirossi View Post

Hi Felix,

First and foremost, thank you for a wonderful tool.

I am curious to know why Bismark always reports 255 as the MAPQ (mapping quality), even when running Bowtie2, which reports a continuous range of mapping qualities. It would be useful to have some idea of relative mapping quality among the unique alignments. Are there reasons not to simply report the MAPQ values generated by Bowtie2?

Thanks in advance,
Andrew

Hi Andrew,

To be perfectly honest we have never really considered the MAPQ value of alignments very much so far. As I understand it the mapping quality is a measure of the probability that an alignment is the true origin of a read over its next best alignment(s). Historically at least, when we started with Bismark and reads were only 28bp long or so, the C to T conversion really meant a big problem when it comes to sequence complexity (i.e. a sequence that would never align using a 4-letter alphabet might only have a single mismatch in BS-Seq). On top of that one would not only have to deal with a sequence and its second best alignment, but also with the best alignment and all other alignments coming from the other 1 (or 3 alignment) threads, depending on the type of experiment/alignment mode. We therefore decided that a sequence either has a unique alignment (as in an alignment that is better than all potentially other alignments), in which case it is given MAPQ value of 255, or it doesn't, in which case it would be discarded.

If you think that it would make sense to feed the MAPQ value of Bowtie 2 through to the actual Bismark output we might have a look into this again (I suppose one would have to look at the MAPQ values from all alignment threads in and choose the lowest one then or something like that).

**adeirossi** · 03-12-2014, 05:27 PM

MAPQ with Bowtie 2?

Originally posted by fkrueger View Post

If you think that it would make sense to feed the MAPQ value of Bowtie 2 through to the actual Bismark output we might have a look into this again (I suppose one would have to look at the MAPQ values from all alignment threads in and choose the lowest one then or something like that).

Thanks, Felix, for your reply. As you mention, things are complicated when you need to integrate information across alignment threads. If you'll bear with me, I will try to illustrate my thought process with a contrived example.

Consider a read that aligns well to exactly two sites in the genome -- one on the OT (original top) strand, the other on the OB (original bottom) strand. Bowtie 2 aligns the read uniquely to the OT strand of chr1 with a high MAPQ of 40, but in a separate thread, Bowtie 2 aligns the same read uniquely to the OB strand on chr2 with an almost-as-high MAPQ value of 38. I understand that Bismark reports the alignment with the highest alignment score (AS tag); let's say this is the first alignment, so the alignment to chr1 is printed to the SAM output.

What to report as the MAPQ value for this SAM entry? A few options:

Report 40, the MAPQ value from Bowtie 2 for the printed alignment. This would be simplest, but it ignores the presence of an almost-as-good alignment from the other thread.
Report 38, the minimum of MAPQ values from all alignment threads. This at least acknowledges the existence of another alignment from the other thread, but it seems like an inflated MAPQ value for a read that aligns quite well to two different sites.
Report a value considerably less than 38 to reflect that, when both OT and OB alignments are considered, the best alignment is not so unique.

Intuitively, option 3 seems the best; a high MAPQ value should be reported only when Bowtie 2 gives a high MAPQ in exactly one of the alignment threads. But this option is clearly ill-defined as I have not suggested an actual formula. (Maybe report [MAPQ of printed alignment] - [highest MAPQ among non-printed alignments] = 40 - 38 = 2, or something similar?)

Anyway, it seems to me that even an imperfect MAPQ value would be better than none at all. But I'm still new to bisulfite sequencing, so I will defer to your wisdom.

Andrew

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News