Seqanswers Leaderboard Ad

**christophe** · 08-22-2014, 03:21 PM

M-bias: CHH and CHG total calls drop off in Read2

Hi Everyone,

I have questions regarding M-bias plot and CHH,CHG methylation calls generated with bismark. I hope it is the right thread to ask otherwise let me know where should have I posted it.
Has anyone seen a drop off of CHH and CHG total calls within a M-bias plot generated by bismark tool?
I have observed this drop off only in the M-bias read2 plot.
(See Attached png plots)
The data are from Whole Genome BiSulphite with reads of 101 bases
Library Prep was performed using Ultralow Ovation MS kit.

How to interpret the falling number of calls after 50 bases? and ....
How this can impact the alignment of read2 in bismark? and the Methylation calls?

Thanks for your help,
Christophe

Attached Files

**adeirossi** · 08-22-2014, 09:42 PM

Hi Christophe,

I have also seen a falling number of methylation calls for read 2 in my data. I have always assumed this was because I ran bismark_methylation_extractor with the --no_overlap option and many of the read pairs had overlaps (i.e. insert length less than twice the read length). With --no_overlap, when the extractor encounters an overlap between read 1 and read 2, I believe it uses the read 1 calls only and discards the read 2 calls, which would explain why the number of read 2 calls decreases as you go toward its 3' end. Because this "no overlap" filtering occurs after mapping, I assume there is no impact on the alignment of read 2.

If, on the other hand, you are not using the --no_overlap option or you don't have many reads pairs with inserts <200 bp, then I have no idea what's going on.

Andrew

**fkrueger** · 08-23-2014, 02:09 AM

Originally posted by adeirossi View Post

Hi Christophe,

I have also seen a falling number of methylation calls for read 2 in my data. I have always assumed this was because I ran bismark_methylation_extractor with the --no_overlap option and many of the read pairs had overlaps (i.e. insert length less than twice the read length). With --no_overlap, when the extractor encounters an overlap between read 1 and read 2, I believe it uses the read 1 calls only and discards the read 2 calls, which would explain why the number of read 2 calls decreases as you go toward its 3' end. Because this "no overlap" filtering occurs after mapping, I assume there is no impact on the alignment of read 2.

If, on the other hand, you are not using the --no_overlap option or you don't have many reads pairs with inserts <200 bp, then I have no idea what's going on.

Andrew

I can only second Andrew's conclusions. Just for the record, you should also see a similar drop in CpG calls, but since these are less frequent than CHH or CHG calls it might be less obvious. If you mouse-over the HTML report you will see the actual number of calls involved. Thanks Andrew.

**kevinrue** · 09-01-2014, 03:42 AM

Dear all,

Apologies if this was answered previously, but I couldn't find the information by Googling and browsing the thread. I am a postgraduate bioinformatics student dealing with the first methylation dataset in our group.

Issue
My issue is that I see strikingly low alignment rate by bismark (28-45% for bisulfite libraries, 49-55% for control non-bisulfite libraries).

Details
I was given a dataset of 2 x 150bp reads of whole-genome bisulfite sequencing libraries (WGBS) using the Epignome Methyl-Seq kit, sequencing using Illumina platform. I have clipped adapter sequences and trimmed low-quality bases with Trimmomatic, as suggested in the Epicentre guide (http://www.epibio.com/applications/b...eq-kit?details). This left over 90% of the reads in each sample with some workable sequence.

From reading I have done so far, I feel that the read length is much longer than necessary, and might even cause more problems than it brings information. It seem 50-75 read length is better suited to this type of analysis.
I would like to know if anyone could advise me appropriate non-default alignment parameters suited to my data. Do I have to control the number of mismatches allowed in the seed alignments? The length of the seeds? The overall alignment score? etc etc.
I prefer asking experienced users rather than test and overfit parameters of my own.

Other

In the meantime, I truncated the forward read file of the sample with worst alignment rate (28.4%) to 75 bp, and try to align that as single-end to see if I get a better alignment rate.
It turned out to align 63.9% !

Rather than clip all my reads, I'd be very grateful to learn how to salvage information from the full length reads

Thanks in advance!

**frozenlyse** · 09-01-2014, 03:58 AM

kevinrun - what does fastqc look like on your forward and reverse reads?

We've recently switched over to the same kit, and now only do 71bp PE reads (we can get that out of 2 50 cycle SBS kits). I similarly got not great alignment using bismark (even after trimming the 6bp from each end) - however I get good alignment using bwameth, and methylation % obtained on the same dataset run through either pipeline were very similar (r=0.97 IIRC) - so perhaps bwa-meth is worth a look? I get about 80% of reads aligning with a mapping quality >=40.

**fkrueger** · 09-01-2014, 04:17 AM

A recent post in this thread was asking something quite similar, maybe the reply is helpful in some way:

Originally posted by fkrueger View Post

Hi eicht,

First of all, I can feel your pain, we have also spent quite a while trying to find the reason or solutions to poor mapping efficiencies for PE PBAT libraries. Generally, the original PBAT protocol only produces alignments to the complementary strands (CTOT and CTOB), however the EpiGnome kit produces standard, i.e. directional, libraries that align to the OT and OB strands only. This means that R1, and also paired-end alignments, should only align to the OT and OB strands; however R2 needs to be aligned to the CTOT and CTOB strands because it is the reverse complement of R1. This is exactly what the option --pbat does, and in that sense the mapping efficiencies you are describing make perfect sense.

We haven’t worked with data generated by EpiGnome but used a homebrew PBAT PE protocol, but I suppose the mechanisms will be the same since they seem to be caused by the random priming step in the protocol. Since we saw the drastically reduced mapping efficiencies in PE mode we reasoned that the protocol might generate chimeric reads, i.e. reads starting at some position in the genome and ending at another. To test this we mapped the data in SE mode and paired up pairs where both R1 and R2 produced alignments. We then imported this data into SeqMonk as Hi-C data, and looked at the amount of reads that had its partner in trans (on another chromosome) as a fairly crude measure. As you can see on the screenshot attached, when we quantitated where in the genome the partner reads aligned to of reads that aligned to chromosome 1, we found that they seem to be scattered pretty much all over the genome in no discernible pattern! In this case a bit more than 40% of reads had their partner read in trans on another chromosome (partner reads aligning to chr 1 but far away were not even considered here).

Now if you would perform some similar pairing up on your reads you might get a similar answer, and if you add up the numbers of properly aligning PE reads and the number of chimeric reads (which won’t be reported by Bismark since they are no valid PE alignments), you should be very close to the mapping efficiencies you achieve in SE mapping.

We came to a similar conclusion, and we by now also settled on the procedure you described (someone coined the term ‘Dirty Harry mode’ because it doesn’t seem to be the cleanest of methods…):

(1) directional PE alignment to start and grabbing the unmapped R1 and R2 reads out the end using the option ‘--unmapped’. Properly aligned PE reads should be methylation extracted using --no_overlap.
(2) unmappedR1 is then mapped SE directional
(3) unmappedR2 is mapped SE directional under -pbat conditions.
Single-end aligned R1 and R2 can then be methylation extracted normally as they should in theory map to different places in the genome anyway so don’t require attention to overlapping reads.

As another suggestion:
Like all PBAT applications, the first couple of bp (6bp in the case of the EpiGnome kit) produce a noticeable bias in sequence composition as well as later on in the M-bias plots (see an example here http://www.bioinformatics.babraham.a...SE_report.html). EpiCentre recommend ignoring the first 6bp of the methylation call string themselves, but it is probably a good idea to remove these residues even before mapping because we have seen that they may reduce the mapping efficiency somewhat. This could be done with --clip_r1 6 and --clip_r2 6 within Trim Galore.

I start to think that paired-end for bisulfite is more headache than it is worth sometimes.

At least for paired-end PBAT protocols I would probably second this notion, however even sequencing longer SE reads (e.g. 150bp SE) will likely start reading into the chimeric portion of the read so also have some mappability problems...

Just as another comment, we ran a couple of tests on 2x76bp EpiGnome paired end data as well and achieved mapping rates of >75% in all cases (using --bowtie2 default mode). So maybe the problem of chimaeric reads starts a little later which might explain the drop the efficiency for long reads? A tool supporting local alignment such as bwa-meth might indeed help in this case...

**kevinrue** · 09-01-2014, 06:17 AM

Hi,

Thank you both for the very quick answers. Much appreciated. It took me a bit of time to get this answer together. I hope it's clear.

I didn't mention yet that I am working with the Bos taurus UMD3.1 genome. 3 billion base pairs on 29 autosomes + 2 sex chromosomes: comparable to human.

frozenlyse
I attached the fastQC archives (recompressed in tar format to satisfy the upload requirements of SeqAnswer), as I am not sure which part of it you were interested in. Due to the C->T conversion caused by the bisulfite treatment, I am not particularly concerned by:

the enriched proportion of T through the forward read
The enriched proportion of A in the reverse read
The low GC content in both reads
the enriched proportion of k-mers containing C and A in the forward read
the enriched proportion of k-mers containing G and T in the reverse read

bwameth sounds like a good alternative. I will look for a standard pipeline of analysis from alignment to differential methylation calls.. except if you have a link to offer

Felix

Sorry I missed that post, thanks for forwarding. I started reading it to understand the entire procedure, although it sounds convoluted. I suppose that after all those alignments, I would have to aggregate the methylation calls for each cytosine (from the paired alignment + single end forward + pbat single end reverse), and then feed it into an analytic software package.
What tools do you recommend for statistical analysis of differential methylation by the way?

Info
I just tried to align 1 million non-truncated forward reads in single end mode and got 60.7% again.
Looks like the problem of chimeric reads indeed.

Here is a copy paste of the full bismark report if it is any use:
Bismark report for: /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_1P.fastq (version: v0.12.1)
Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)
Bismark was run with Bowtie 2 against the bisulfite genome of /workspace/storage/genomes/bostaurus/UMD3.1.75/source_file/ with the specified options: -q --score-min L,0,-0.2 --ignore-quals

Final Alignment report
======================
Sequences analysed in total: 1000000
Number of alignments with a unique best hit from the different alignments: 606793
Mapping efficiency: 60.7%
Sequences with no alignments under any condition: 179520
Sequences did not map uniquely: 213687
Sequences which were discarded because genomic sequence could not be extracted: 5

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT: 303497 ((converted) top strand)
CT/GA: 303291 ((converted) bottom strand)
GA/CT: 0 (complementary to (converted) top strand)
GA/GA: 0 (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 14776304

Total methylated C's in CpG context: 644197
Total methylated C's in CHG context: 32496
Total methylated C's in CHH context: 56579
Total methylated C's in Unknown context: 756

Total unmethylated C's in CpG context: 249170
Total unmethylated C's in CHG context: 3536727
Total unmethylated C's in CHH context: 10257135
Total unmethylated C's in Unknown context: 8798

C methylated in CpG context: 72.1%
C methylated in CHG context: 0.9%
C methylated in CHH context: 0.5%
C methylated in Unknown context (CN or CHN): 7.9%

C1_ATCACG_1P.fastq_bismark_bt2_SE_report.txt (END)

Attached Files

**fkrueger** · 09-01-2014, 07:37 AM

Hi Kevin,

The FastQC reports look completely normal, and the results you posted look quite decent as well. So for the record, can we just state that it's not really Bismark's fault but the PBAT protocol itself is generating wild artefacts

.

Regarding DMR analysis: I don't think there anything is accepted as gold standard just yet, new tools are being published almost every week. Promising and widely used tools are: bsseq, BiSeq, methylKit, MOABS, RADMeth, DMAP, RnBeads and a few more, most of them interface rather nicely with the output of Bismark. Please find attached a review of the underlying concepts of some of these tools by Mark Robinson: http://biorxiv.org/content/biorxiv/e...07120.full.pdf, I hope it may help.

**kevinrue** · 09-01-2014, 08:38 AM

Hi Felix,

Thanks for the feedback on the fastQC and alignment report. Plus the list of DMR tools. I only tested BiSeq so far during my initial attempt to an analytic pipeline, and didn't find any significant DMR, but Katja told me the pacakge was meant for RRBS analyses. Anyway, as you can see the proper alignment of my reads might have been the limiting factor.

Indeed, as far as I can see, bismark does indeed a good job aligning reads, with good quality reads and adequate parameters

Given my recent arrival in the bisulfite-sequencing field (I have only dealt with transcriptomics data for the past two years, microarray and RNA-seq), I can only strive to understand all the complexity of the bisulfite data, the sources during their obtention and their subsequent analysis which you are much more familiar with than I am.

New result

I have a little update which surprised me. I have attached a screenshot of the alignment rate for SingleEnd/PE reads mode, with or without truncating to 75bp. Two of them were only performed on the first million read/pairs using "-u 1000000" to post you this answer today

What surprised me, as I believed chimeric read/pairs were being discarded during the alignment, is that truncating both mates to 75 bp (or fewer if quality trimming left less than that) restored an alignment rate of 56.8% (of 1 million pair), compared to 28.4% (of 8.4 million pairs, in) for full-length read pairs.
If inserts were indeed chimeric, truncating the read pairs should identified those reads as chimeric and left them out again, shouldn't it? Or am I missing something?

Now, I think I will repeat tonight the alignment of truncated read pairs with the entire 8.4 million read pairs to satisfy my OCDs and give you a table with fully comparable numbers, but I am fairly certain that these figures are representative enough.

Many thanks for your help and time
Kevin

Attached Files

t1.png (4.7 KB, 36 views)

**fkrueger** · 09-01-2014, 12:17 PM

I am keen to hear more about what you discover!

**kevinrue** · 09-02-2014, 05:43 AM

Hi again Felix,

(I am wondering whether to continue the conversation I started in this thread, or move to the better suited "paired end alignment rate issue" that you linked earlier.)

bismark
As promised, I have completed running the tests and here are the statistics of unique, multiple, and unaligned reads/pairs:

Can I do anything else to help you help me explain what is going on with long paired-end reads? Given that I am seeking biological inisight between infected and control samples, I'd rather avoid over-fitting parameters which give me my favorite answer, and instead follow your advice on how to correctly use your tool.

bwa-meth
I also report the alignment rate of bwa-meth in the screenshot above, which seem much better (although the numbers don't add up to the same number of input read pairs?!), but you can read below how I am stuck with that tool as well. Here are the detailed stats for bwa-meth:
$samtools flagstat bwa-meth.bam
17570633 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
17493812 + 0 mapped (99.56%:-nan%)
17570633 + 0 paired in sequencing
8787708 + 0 read1
8782925 + 0 read2
16950242 + 0 properly paired (96.47%:-nan%)
17470980 + 0 with itself and mate mapped
22832 + 0 singletons (0.13%:-nan%)
209437 + 0 with mate mapped to a different chr
66124 + 0 with mate mapped to a different chr (mapQ>=5)

However, I am blocked one step further with that one. The "tabulate" step using Bis-SNP to call methylation for CpG's crashes due to a read in the following configuration. I know I am going out of topic here, but here is the error message (I will contact the developer there to see about that):
ERROR MESSAGE: SAM/BAM file SAMFileReader{/workspace/scratch/krue/Methylation/2014-08-27_analysis/bismark/bwa_meth/bwa-meth.bam} is malformed: Read starting with deletion. Cigar: 19D78M1D66M

Any advice welcome. Really feeling lost about what to do now.
Thanks very much

**dpryan** · 09-02-2014, 05:51 AM

You might send brentp a message and ask him about that error. It's really really strange for a read to start with a deletion (what would that even mean?). I don't recall bwa-meth messing with the CIGAR string, but it's been a while since I looked so my memory might be failing me.

The increased alignment rate by bwa-meth is presumably due to heavily soft-clipping things. It's be interesting to extract some of the longer soft-clipped sequences and try to see where they go. That might provide some insight into what's going on here.

**fkrueger** · 09-02-2014, 06:00 AM

Kevin,

We don't tend to play around with the mapping parameters very much normally (what you called over-fitting), but tend to run PE alignments first (with --unmapped) and then re-run the left-over single-end alignments. Having said that, we only do this if we've got very precious samples with limited availability and we really want to squeeze as much out of the data as possible (e.g. oocytes that have been collected by mouth-pipetting for half a year...). I guess if you've got enough material it would be a bit 'cleaner' to just run paired-end mode to make use of the --no_overlap feature. Cheers, Felix

**kevinrue** · 09-02-2014, 06:08 AM

Hi Ryan,

I am in touch with Brent. Just waiting for his answer to that surprising issue.

I don't know if I could find and visualise that particular read/pair somehow (bash command looking for that CIGAR string?)

So far I could only imagine a read aligning to the start of a chromosome/contig maybe? In which case, shouldn't the start of the read be clipped off?
Or is this an expected bahavior of tools supporting local alignment such as bwa-meth? If it was soft-clipped (= kept in the sequence, but not aligned, right?), then I suppose the CIGAR string would need to describe this part of the sequence as deleted from the 5', right?

How would you suggest I "extract some of the longer soft-clipped sequences"? Do you define those as the ones which may start/end with a deletion? Is there a samtools command for that, or will that be a bash script? Then, I suppose you suggest I create a subset bam file containing those sequences and look at their locations?

Sorry for the many questions. I never had to dig into SAM/BAM files so far :/

Thanks!
K.

**fkrueger** · 09-02-2014, 06:13 AM

very quick and dirty would be something like:
samtools view mappedfile.bam | grep 19D78M1D66M > where_is_this_read.sam

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News