Seqanswers Leaderboard Ad

**dpryan** · 09-02-2014, 06:36 AM

What Felix said regarding finding the read.

I don't know if there's anything prebuilt that will extract soft-clipped sequences. It'd be an easy* enough thing to write though.

Anything aligning to the end should get hard clipped off, though I recall there being an issue with at least earlier versions of bwa concatenating contigs and then allowing alignments to flow of the end of one onto the next. Having said that, even that shouldn't cause an alignment to start with a deletion.

*Only for someone experienced with programming.

**kevinrue** · 09-02-2014, 06:39 AM

Hi Felix,

Thanks, I see what you mean about making the most of your data when more sequencing is not an option. I am not sure how much more sequencing we can get out of our samples either, as those were used for RNA-seq as well. I will discuss the different options with colleagues.

In the meantime, I think I will simultaneously try:
- to complete the "Dirty Harry" solution to salvage as much information using bismark, and see what I can get from downstream analyses this way.
- to see if we can work around the issue described above with bwa-meth to see if the higher alignment rate (for my particular dataset of long reads) can be used to obtain higher information content. (this utopic alignment rate makes me nervous about false positives).

Here is the strange read (thanks for the command, I am always concerned about reinventing the wheel when things like samtools exist!):
HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 99 1 147356530 60 19D78M1D66M = 147356709 323 TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C NM:i:23 AS:i:82 XS:i:55 RG:Z:C1_ATCACG_P YC:Z:CT YD:Z:f

Still reading the SAM format description to spot where the problem is, if any..

Thanks
Kevin

**dpryan** · 09-02-2014, 06:51 AM

I don't think the SAM spec explicitly says it, but it's non-sensical to start or end an alignment with a deletion. By definition, there'd be no bases in the alignment actually supporting said deletion. The CIGAR for that should be 78M1D66M, for what it's worth. My guess is that this is actually a BWA bug, rather than one in bwa-meth itself.

BTW, if it's local alignment that's saving the day on this one, then there are other bowtie2-based aligners that support it (Bison and BS-Seeker2). You generally need to use the --very-sensitive-local preset with bowtie2 to get it to align as well as BWA (at least that was the case when Brent and I compared bwa-meth against Bison). Having said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.

**fkrueger** · 09-02-2014, 07:24 AM

Originally posted by dpryan View Post

Having said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.

I can only second this, and let's not kid ourselves, you can't possibly achieve a 99.56% mapping rate while including only sensible reads.... (not even simulated data from well known parts of the genome would come close, and your library likely contains a lot of repeats and other unmappable things that are not in the assembly, in addition to possible crossovers and what-ever-else-you are-going-to-dig-up)

**brentp** · 09-02-2014, 07:24 AM

as far as I can guess, dpryan is correct that what you're seeing is more likely a bwa bug than a bwa-meth bug since I don't mess with the cigar string.
As I told you, you can simply use the scripts/tabulate-methylation.py script rather than relying on BisSNP. But, yes, it'd be good to know what is going on.
I am downloading chromosome 1 of bos taurus to see.

**kevinrue** · 09-02-2014, 08:11 AM

Hi ,

Felix, I agree: as I wrote earlier, this utopic alignment rate by bwa makes me a bit nervous too. I am also curious to get to the bottom of things here, and really appreciate the help from everyone involved in this conversation.

Brent, dpryan, as I said in the email, I was a few bwa versions late, so I just updated it on our server and I am aligning again to see if that solves the problem. I had a look at their changelog, and didn't spot recent changes regarding the CIGAR string,so I don't have high hopes on this side of things.

As soon as the updated bwa aligner has fnished realigning those reads, I will try both the "bwameth.py tabulate" and the separate "scripts/tabulate-methylation.py" script to see if I can get either working.

Please let me know if you need any additional piece of information from me.
I can also discuss with my suprvisor the possibility to share some data if you'd prefer having a look at certain aspects of the data yourselves.
For the record, we have two groups of 6 samples, paired between infected and control treatments. Since I spotted that low alignment rate with bismark, I have been doing all the testing on the sample which returned the lowestalignment rate in paired-end mode using bismark (~28%).

By the way,speaking of "someone experienced with programming", I didn't want to write me CV in the first post, but a little detail might help you understand my situation at the minute. I am a PhD student/bioinformatician starting 4th year and who stayed close to the lab side (= use tool to derive biological interpretation), rather than the software development side of aligners and statistical frameworks. I know my way around several programming languages - when I know what I want to program. Here I am stuck mainly because I am in that twilight zone where I don't know exactly what I want/need to program to do things right with this type of data.
Methylation has very different issues to the RNA-seq that I am used to!

Cheers,
Kevin

**brentp** · 09-02-2014, 08:59 AM

kevinrue, can you send the output of

grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 $FQ

for both ends of your input fastqs?

Also, note that the percentage of output that you are looking at for bwa-meth is high because it will map some reads as secondary (-M flag to bwa mem) and you should also filter by quality score. Without that, samtools flagstat is only a summary overview of what happened. I filter by mapping quality score > 15 (at least ) in downstream analyses. You can get a more realistic mapping number by:

samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l

**kevinrue** · 09-02-2014, 09:27 AM

Careful about the header of the next read which got extracted in the "next four lines".

Forward mate
$grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_1P.fastq
@HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 1:N:0:ATCACG
TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG
+
FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C
@HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 1:N:0:ATCACG

Reverse mate
$grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_2P.fastq
@HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 2:N:0:ATCACG
ACTTATATTTTCCAAATAAAAAAACTTTATTAAAAAACTACTAAAACCATCCCAAAAATAAACTTCCTCTTAAAAATCGCAAACACTACATCTCCCACACAAATCCCAAATACCGAAAAACGTATAACGCATCCTCAAAATAAA
+
FFHHHHHJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJHHHHFFFFFFEEEDCDEDDDDDDCDDDDDBDDDDDDDCDDD?CBBBDDDDDDDDDBDDDCDDBDBDBDDBCDECDDDB>CCDDDDDCCDC
@HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 2:N:0:ATCACG

Also:
$ samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l
12807890
which is = 6,403,945 * 2
hence ~73% of 8,691,655 input filtered/trimmed read pairs.

Thanks for looking into this
Kévin

**brentp** · 09-02-2014, 12:06 PM

here are the alignments that I get with bwa-meth for that read-pair that you extracted:

Code:

HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545	97	1	147356549	55	78M1D66M	=	147356709	304	TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG	FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C	NM:i:4	MD:Z:15A62^T18A20A26	AS:i:82	XS:i:27	RG:Z:a_R	YC:Z:CT	YD:Z:f
HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545	145	1	147356709	60	144M	=	147356549	-304	TTTATTTTGAGGATGCGTTATACGTTTTTCGGTATTTGGGATTTGTGTGGGAGATGTAGTGTTTGCGATTTTTAAGAGGAAGTTTATTTTTGGGATGGTTTTAGTAGTTTTTTAATAAAGTTTTTTTATTTGGAAAATATAAGT	CDCCDDDDDCC>BDDDCEDCBDDBDBDBDDCDDDBDDDDDDDDDBBBC?DDDCDDDDDDDBDDDDDCDDDDDDEDCDEEEFFFFFFHHHHJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJHHHHHFF	NM:i:2	MD:Z:30A23G89	AS:i:138	XS:i:22	RG:Z:a_R	YC:Z:GA	YD:Z:f

They are mapped to the same vicinity as the alignment you pasted, but there is no odd cigar string. Maybe you mapped trimmed reads?

**kevinrue** · 09-02-2014, 12:41 PM

Hi Brent,

It seems the CIGAR string was due to an old version of bwa (0.7.2). I have installed 0.7.10 today and found the same good CIGAR and alignment positions as you posted above.
(By the way, the reads I posted are the ones I aligned for this pair. I trimmed reads based on quality and adapter sequence priori to alignment).

Now, I am running your tabulate-methylation.py script (using Python 2.7 as per our separate email). I will let you know about the results. Anything you'd like me to check in particular?

Best,
Kevin

**gene_x** · 09-03-2014, 07:55 PM

no_overlap

hi, Felix,

I just tried bismark on PE bisulfite seq data and I initially didn't use the no_overlap option and the M-bias plot looks like OK. However, with no_overlap option, there is a drop of CHH total calls at the end of the reads. I'm not sure why this would happen and how to interpret the pattern. Do you think there is something wrong with the library, any ideas?

Without no_overlap

With no_overlap

**fkrueger** · 09-04-2014, 01:01 AM

It is perfectly normal to see a decline of total calls towards the end of read 2, see an example here. This should affect all C contexts and only CHH, if you click the CHH and CHG away in the linked graph you will see it much better because the higher number of CHH calls obscures the other contexts somewhat.

The reason for this is simply that the --no_overlap function looks for overlaps only for Read 2 (since historically the basecall qualities for Read 1 were often better), so as soon as read 2 overlaps with already covered positions its methylation calls are being ignored. Again this is fine, because overlapping positions have already been extracted for Read 1 and are thus not lost.

**dpryan** · 09-04-2014, 01:12 AM

Out of curiosity, is there a reason that --no_overlap is an option rather than just being the default behaviour? Not using it would result in incorrect metrics/statistics, so I would think that defaulting to not doing that would make it rather easy for people to shoot themselves in the foot.

**fkrueger** · 09-04-2014, 01:14 AM

Good idea, shall have that for the next release.

**gene_x** · 09-04-2014, 08:56 AM

Originally posted by fkrueger View Post

It is perfectly normal to see a decline of total calls towards the end of read 2, see an example here. This should affect all C contexts and only CHH, if you click the CHH and CHG away in the linked graph you will see it much better because the higher number of CHH calls obscures the other contexts somewhat.

You mean "This should affect all C contexts and not only CHH?
How do you click the CHH and CHG away? I can only zoom in certain bases...

The reason for this is simply that the --no_overlap function looks for overlaps only for Read 2 (since historically the basecall qualities for Read 1 were often better), so as soon as read 2 overlaps with already covered positions its methylation calls are being ignored. Again this is fine, because overlapping positions have already been extracted for Read 1 and are thus not lost.

Can you help me clarify one thing? In the M-bias plot for read2, is position 1 the start of the read2? I'm a bit confused about the orientation of the read2. I thought in M-bias plot, position 1 is the the end of read2 (after read 2 is reverse complemented)? probably my understanding was wrong...

Also, in M-bias plot for read2, the first 2 bases have much lower average CpG methylation, that suggests they should be ignored in the downstream analysis? I could use the -ignore_r2 option in methylation extractor to do it?

Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News