SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CisGenome -- an integrated tool for ChIP-seq data analysis hji Bioinformatics 66 12-30-2014 01:55 PM
Bismark Bisulfite Aligner - Now supporting CpG, CHG and CHH context fkrueger Bioinformatics 27 10-11-2013 05:40 AM
Bismark v0.6.beta1: Now supporting gapped Bisulfite-Seq alignments fkrueger Bioinformatics 6 03-19-2012 05:06 AM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 08:43 AM

Reply
 
Thread Tools
Old 07-18-2013, 01:50 AM   #261
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

I agree that one should be careful when merging reads of different length into the same file and then analysing them jointly. Fig. 2B in the Hansen paper works nicely for 5 read lengths and just CpG context, but imagine the mess when plotting out CpG, CHH and CHG context as well as their coverages for every single read length encountered in the sample :P

I guess the coverage plot could also be used as an indication that something weird is going, e.g. if you see some sort of bias around bp 75 when you also see a huge spike in general methylation call coverage for this position. Thanks for sharing your considerations though.
fkrueger is offline   Reply With Quote
Old 07-18-2013, 02:02 AM   #262
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Agreed. I'm having a closer look at M-bias as a colleague suggested most of the 3' M-bias he had observed in his data was simply due to the inherent decline in quality of base calls at the 3' end of reads. This would suggest that the 3' end of some reads could be "saved", provided the base calls were of a high enough quality. Now, whether all that extra work is worth the effort is another matter... But I've been dealing with datasets where I'm effectively throwing away > 15% of each aligned read due to extensive M-bias. In such cases it would be desirable to get back some of those read positions if their quality was high enough for me to be confident they were unlikely to be affected by M-bias.

One more request: could you please add an --mbias option to bismark_methylation_extractor so as to just produce the M-bias output without doing the methylation calling. Thanks!
PeteH is offline   Reply With Quote
Old 07-18-2013, 02:21 AM   #263
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by PeteH View Post
One more request: could you please add an --mbias option to bismark_methylation_extractor so as to just produce the M-bias output without doing the methylation calling. Thanks!
That seems fair enough, I'll put it on my To-Do list!
fkrueger is offline   Reply With Quote
Old 07-19-2013, 11:17 AM   #264
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by PeteH View Post
One more request: could you please add an --mbias option to bismark_methylation_extractor so as to just produce the M-bias output without doing the methylation calling. Thanks!
Hi Pete,

I had a go at implementing a new function '--mbias_only' which you can find attached. If this option is specified, only a report and the M-bias plot(s) and their data are being written out. It will not extract any methylation data, also the bedGraph and cytosine report conversion are not allowed.

The new version of the extractor (v0.8.1a) is attached; if you find it working alright I will include it into a new release once I am back from ISMB.
Attached Files
File Type: pl bismark_methylation_extractor.pl (170.9 KB, 2 views)
fkrueger is offline   Reply With Quote
Old 07-23-2013, 01:55 AM   #265
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Hi Felix - never mind about read groups, I've pipelined adding them during postprocessing!

One small bug I've just seen is that the TLEN of some reads in the output bam file is sometimes 0 (this is with v0.7.12 in case this has been fixed in the new 0.8 versions)

eg on a paired test paired end alignment of 250,000 150bp PE reads
Code:
samtools view test_reads.rmdup.bam | awk '{if ($9=="0") print $0}' | wc -l
2518
samtools view test_reads.rmdup.bam | awk '{if ($9=="0") print $0}' | head
HWI-D00119:19:H092KADXX:1:1101:19147:6616_1:N:0:	115	chr1	1248010	255	136M	=	1248010	0	CCCTACAAAAATACTAAAACTTATAAACAAGGTCACCGTCTCGCCATTAACCAACATATAACAATTAACCCCTAAACCCCGAAAAAAAAAAAAACCTCAACCCAAACTACCCGTACTAACCCGAAATAAACCCCAC	'BB<B7BFFBBB07BBB<0000<<BB0000'0'7'70''0'07BBBBBB7B<<B<B<BBB<<FFBBB<BBBB7B77<BB7B<BBBBBFFIIFFFIIFFFFIIIFFBB00FFBB7FFBFFFFFFFFFFFFFFFF<BB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:45	XM:Z:......zxhh..h..xhhh...h.hhh..x.Z.....Z....Z.....hh..zx...h..h..x..h......xh.....Zxhhh..h.h.h.h.....x....xh..x...Z.h..x....Z.hh.hhh......	XR:Z:CT	XX:Z:G5GGGG2G2GGGG3G1GGG2GC17GG2GG3G2G2G2G6GG6GGGG2G1G1G1G5G4GG2G5G2G6GG1GGG6
HWI-D00119:19:H092KADXX:1:1101:19147:6616_1:N:0:	179	chr1	1248010	255	96M	=	1248010	0	ACCTACAAAAATACTAAAACTTATAAACAACGTCACCGTCTCCCCATTAACCAACATATAACAATTAACCCCTAAACCCCGAAAAAAAAAAAAACC	BBBFFFFFFFFFFFFFFIFFFFFFFFFFFFF00BBFF0<B0B0BB7''BBFFBBFF0<0<7BB<B<B<BFFF<7<BBFFF'00777B7<B<B707B	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:34	XM:Z:h.....zxhh..h..xhhh...h.hhh..x.Z.....Z..........hh..zx...h..h..x..h......xh.....Zxhhh..h.h.h.h..	XR:Z:GA	XX:Z:G5GGGG2G2GGGG3G1GGG2G12G5GG2GG3G2G2G2G6GG6GGGG2G1G1G1G2
HWI-D00119:19:H092KADXX:1:1101:20264:11719_1:N:0:	115	chr1	2628067	255	150M	=	2628067	0	CAACACCCACATCGCCAAACAAACATCCGCCAACCTAAAACAACACCCACACCCCCAAATAAACATCCGACAACCTAAAACAAAACCCACACCCCTAGATAAACATCCGACACCGTAAAACAACACCCCACACCCACAAATAAACATCTA	7BB70BFFBBBBBBBBFBBBFBBBB70'07BBBBBBFBBB<BBB<BBBB<7FFFBBFFFBFBBBB7<7<FBFBFBBFFFBBFFBFFFFFB<FFFFBFFFFFFFFF<<7<FB<BFFFIIIIIFFFBIIIIFFFIFFFIFFFFFFFFFFBBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:32	XM:Z:..x..........Z...xh.z.h.....Z...x...xh....x..............xh.h.h.....Z...x...xh.h..xh.............Hh.h.......Z.....Z.hh.h..x...............hh.h.h.....x	XR:Z:CT	XX:Z:2G14GG1G1G9G3GG4G14GG1G1G9G3GG1G2GG14G1G11T3GG1G2G13T1GG1G1G5G
HWI-D00119:19:H092KADXX:1:1101:20264:11719_1:N:0:	179	chr1	2628067	255	149M	=	2628067	0	CAACACCCACATCGCCAAACAAACATCCGCCAACCTAAAACAACACCCACACCCCCAAATAAACATCCGACAACCTAAAACAAAACCCACACCCCTAGATAAACATCCGACACCGTAAAACAACACCCCACACCCACAAATAAACATCT	BBBFFFFFFFFFF0BFFIIIIIFBFFII0BFFIIIIIIIIIIIIIIIIFIBFFFFB<BBBBBBBBBBB0007<BBBBBBBB07BBBBBBB7BBBB77'0BB<<<B<<B'077<B'077BBBBBB7BB<707BBBB7<<BBBBBBBBBB<	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:31	XM:Z:..x..........Z...xh.z.h.....Z...x...xh....x..............xh.h.h.....Z...x...xh.h..xh.............Hh.h.......Z.....Z.hh.h..x...............hh.h.h.....	XR:Z:GA	XX:Z:2G14GG1G1G9G3GG4G14GG1G1G9G3GG1G2GG14G1G11T3GG1G2G13T1GG1G1G5
HWI-D00119:19:H092KADXX:1:1101:4148:2705_1:N:0:	115	chr1	3104936	255	143M	=	3104936	0	TTTTTTTTTAAAAAAATAACCCGTATCACCATATCCGGGTAATACCAAACCCAACGCAAACTACGCCACTACCTACCGATACCTATTAACTTTATAATATATTTATCCCGAATATATATATACGTAAAAAAACGCGCTCTCCT	BBBBBBB<BBBBBBB<B<'''00'0'0''0<07'''0'0'<'<BBBB770<77000B<00'0''0<BB<'B70'007<700'0<700B<<<<<BBFF<<0<7BB<77007<BBFBBBFBBB77BFIIIIIFBBBBBFFFFBBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:34	XM:Z:............h....hh...Z.........h...Z.Z.hh.h...xh....x.Z..x...x.Z.....x...x..Zx.h...x..h.....h.hh.h.h........Zxh.h.h.h.h...Z.h.hhh...Z.Z.......	XR:Z:CT	XX:Z:12G4GG13G4C2GG1G3GG4G4G3G7G3G3G1G3G2G5G1GG1G1G9GG1G1G1G1G5G1GGG13
HWI-D00119:19:H092KADXX:1:1101:4148:2705_1:N:0:	179	chr1	3104936	255	22M	=	3104936	0	TTTTTTTTTAAAAAAATAACCC	<<<BBBBBB<<BBBBB07B'<B	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:3	XM:Z:............h....hh...	XR:Z:GA	XX:Z:12G4GG3
HWI-D00119:19:H092KADXX:1:1101:12448:3866_1:N:0:	115	chr1	5254199	255	150M	=	5254199	0	CAAAAAATCACCCAATATAAAACTAAAAAACAAAAACCAAAATATCATCCTATCAAATTTCTCTTCACATAAAAATCTAAAAACAAAAAAAATTCCCTAACTTCCTATCTATTCAATAAACAAAACAAACAAAAATTAACTAAATTCCCA	7<BB<FB<77'<'<<000<<B<077B<<B<0<BBB<<0BFB<0<'<<0'<0<'0<BB<7'7<<7''B<<7BBB<<700<<B77FFFB77'70'7B'77FB'07BFBB'<BB0<0FF<BBB0<IIIFF<F7F<BFF<FFFFB<B00B0B<B	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:33	XM:Z:...h..h...........hh....x.hhhh....hh...xhh.........x...xh..............h.........hh.....h.h...............x........x.h.........xh..xh.h..h...x.h......	XR:Z:CT	XX:Z:3G2G11GG4G1GGGG4GG3GGG9G3GG14G9GG5G1G15G8G1G9GG2GG1G2G3G1G6
HWI-D00119:19:H092KADXX:1:1101:12448:3866_1:N:0:	179	chr1	5254199	255	149M	=	5254199	0	CAAAAAATCACCCAATATAAAACTAAAAAACAAAAACCAAAATATCATCCTATCAAATTTCTCTTCACATAAAAATCCCAAAACAAAAAAAATTCCCTCACTTCCTATCTATTCAATAAACAAAACAAACAAAAATTAACTAAATTCCC	BBB<0<FBBFBFFFBBFBFF<0B7BB<FFB77FFBFBFBFII77BBF<<<0BFB''700'0<<7'<B<''0<BB<000'77<<7<07B77'70'000<'000'00'0<<'<'00<<'0<<<<<B'7<B''7<<B'0'0000<<0''<00	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:36	XM:Z:...h..h...........hh....x.hhhh....hh...xhh.........x...xh..............h.........hh.....h.h...............x........x.h.........xh..xh.h..h...x.h.....	XR:Z:GA	XX:Z:3G2G11GG4G1GGGG4GG3GGG9G3GG14G5TA2GG5G1G7A7G8G1G9GG2GG1G2G3G1G5
HWI-D00119:19:H092KADXX:1:1101:17706:11652_1:N:0:	115	chr1	9776942	255	150M	=	9776942	0	AAATCCTAAAAATCCTAATATCCAAAAAATAATTAAAACCGCCCCACCAACCGCTCACCCTACACCCCGTCTCAATACATCTACAACTACCTACACAATAAATTAACCCCTCACCTAACCATAATCCATTCCTCCTCCATCCTCGCCATA	<<0<B7<BBB<'0B0<BB<00<7FFFFB<<<'0BB70'0'0FFFFFFB<7<70BBB7FBB<B<7FB<7<BBBFFB<BFFBBBBBFBB<<FB<FFFFFFFIFFBBB<IIIFBFBIIFIFIIIBIIFIIIFBIIFIIFIFFFFFBFFFFBBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:35	XM:Z:hhh....xhh......xh.h....xhh.h.hh..hhhh..Z........x..Z........x......Z.....x.......x..x..x...x....x.hhh..h...........x.....hh....................Z....h	XR:Z:CT	XX:Z:GGG4GGG6GG1G4GGG1G1GG2GGGG11G11G12G7G2G2G3G4G1GGG2G11G5GG25G
HWI-D00119:19:H092KADXX:1:1101:17706:11652_1:N:0:	179	chr1	9776942	255	149M	=	9776942	0	AAATCCTAAAAATCCTAATATCCAAAAAATAATTAAAACCGCCCCACCAACCGCTCACCCTACACCCCGTCTCAATACATCTACAACTACCTACACAATAAATTAACCCCTCACCTAACCATAATCCATTCCTCCTCCATCCTCGCCAT	BBBFFFFFFFFFBFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIFIIIFFFFBFFFBBFBFFFFFBBFFFFBFBBBFFBB<BBBBBFBBBBBBBBBFBBFFFBFFF<BBBF<BBFFBBBBBFFBFFFFFFFFFFFFFFFFFFFBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:34	XM:Z:hhh....xhh......xh.h....xhh.h.hh..hhhh..Z........x..Z........x......Z.....x.......x..x..x...x....x.hhh..h...........x.....hh....................Z....	XR:Z:GA	XX:Z:GGG4GGG6GG1G4GGG1G1GG2GGGG11G11G12G7G2G2G3G4G1GGG2G11G5GG25
It appears to happen when one read is fully contained by the other (due to trimming I presume)

Cheers,
Aaron
frozenlyse is offline   Reply With Quote
Old 07-23-2013, 02:44 AM   #266
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

This supposedly happens only for Bowtie 2 alignments? I will take a look at the issue.
fkrueger is offline   Reply With Quote
Old 07-23-2013, 03:00 AM   #267
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Yes bowtie2 -
"bismark -p 4 --bowtie2 -X 1000 --unmapped --ambiguous --gzip --bam " are the bismark options i am using
frozenlyse is offline   Reply With Quote
Old 07-23-2013, 04:30 AM   #268
cnoirot
Junior Member
 
Location: Toulouse, France

Join Date: Feb 2012
Posts: 3
Default methylation_extractor bug

Hi,
I just want to report a little bug in methylation_extractor with option --CX_context , if the output directory is not the current directory, the CX_report won't be create as the bed file is searched in the current directory and not the output directory.
Cheers
Céline
cnoirot is offline   Reply With Quote
Old 07-23-2013, 04:44 AM   #269
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by cnoirot View Post
Hi,
I just want to report a little bug in methylation_extractor with option --CX_context , if the output directory is not the current directory, the CX_report won't be create as the bed file is searched in the current directory and not the output directory.
Cheers
Céline
thanks for your message, I shall look into this soon.
fkrueger is offline   Reply With Quote
Old 07-24-2013, 04:37 AM   #270
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by frozenlyse View Post
Hi Felix - never mind about read groups, I've pipelined adding them during postprocessing!

One small bug I've just seen is that the TLEN of some reads in the output bam file is sometimes 0 (this is with v0.7.12 in case this has been fixed in the new 0.8 versions)

eg on a paired test paired end alignment of 250,000 150bp PE reads
Code:
samtools view test_reads.rmdup.bam | awk '{if ($9=="0") print $0}' | wc -l
2518
samtools view test_reads.rmdup.bam | awk '{if ($9=="0") print $0}' | head
HWI-D00119:19:H092KADXX:1:1101:19147:6616_1:N:0:	115	chr1	1248010	255	136M	=	1248010	0	CCCTACAAAAATACTAAAACTTATAAACAAGGTCACCGTCTCGCCATTAACCAACATATAACAATTAACCCCTAAACCCCGAAAAAAAAAAAAACCTCAACCCAAACTACCCGTACTAACCCGAAATAAACCCCAC	'BB<B7BFFBBB07BBB<0000<<BB0000'0'7'70''0'07BBBBBB7B<<B<B<BBB<<FFBBB<BBBB7B77<BB7B<BBBBBFFIIFFFIIFFFFIIIFFBB00FFBB7FFBFFFFFFFFFFFFFFFF<BB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:45	XM:Z:......zxhh..h..xhhh...h.hhh..x.Z.....Z....Z.....hh..zx...h..h..x..h......xh.....Zxhhh..h.h.h.h.....x....xh..x...Z.h..x....Z.hh.hhh......	XR:Z:CT	XX:Z:G5GGGG2G2GGGG3G1GGG2GC17GG2GG3G2G2G2G6GG6GGGG2G1G1G1G5G4GG2G5G2G6GG1GGG6
HWI-D00119:19:H092KADXX:1:1101:19147:6616_1:N:0:	179	chr1	1248010	255	96M	=	1248010	0	ACCTACAAAAATACTAAAACTTATAAACAACGTCACCGTCTCCCCATTAACCAACATATAACAATTAACCCCTAAACCCCGAAAAAAAAAAAAACC	BBBFFFFFFFFFFFFFFIFFFFFFFFFFFFF00BBFF0<B0B0BB7''BBFFBBFF0<0<7BB<B<B<BFFF<7<BBFFF'00777B7<B<B707B	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:34	XM:Z:h.....zxhh..h..xhhh...h.hhh..x.Z.....Z..........hh..zx...h..h..x..h......xh.....Zxhhh..h.h.h.h..	XR:Z:GA	XX:Z:G5GGGG2G2GGGG3G1GGG2G12G5GG2GG3G2G2G2G6GG6GGGG2G1G1G1G2
HWI-D00119:19:H092KADXX:1:1101:20264:11719_1:N:0:	115	chr1	2628067	255	150M	=	2628067	0	CAACACCCACATCGCCAAACAAACATCCGCCAACCTAAAACAACACCCACACCCCCAAATAAACATCCGACAACCTAAAACAAAACCCACACCCCTAGATAAACATCCGACACCGTAAAACAACACCCCACACCCACAAATAAACATCTA	7BB70BFFBBBBBBBBFBBBFBBBB70'07BBBBBBFBBB<BBB<BBBB<7FFFBBFFFBFBBBB7<7<FBFBFBBFFFBBFFBFFFFFB<FFFFBFFFFFFFFF<<7<FB<BFFFIIIIIFFFBIIIIFFFIFFFIFFFFFFFFFFBBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:32	XM:Z:..x..........Z...xh.z.h.....Z...x...xh....x..............xh.h.h.....Z...x...xh.h..xh.............Hh.h.......Z.....Z.hh.h..x...............hh.h.h.....x	XR:Z:CT	XX:Z:2G14GG1G1G9G3GG4G14GG1G1G9G3GG1G2GG14G1G11T3GG1G2G13T1GG1G1G5G
HWI-D00119:19:H092KADXX:1:1101:20264:11719_1:N:0:	179	chr1	2628067	255	149M	=	2628067	0	CAACACCCACATCGCCAAACAAACATCCGCCAACCTAAAACAACACCCACACCCCCAAATAAACATCCGACAACCTAAAACAAAACCCACACCCCTAGATAAACATCCGACACCGTAAAACAACACCCCACACCCACAAATAAACATCT	BBBFFFFFFFFFF0BFFIIIIIFBFFII0BFFIIIIIIIIIIIIIIIIFIBFFFFB<BBBBBBBBBBB0007<BBBBBBBB07BBBBBBB7BBBB77'0BB<<<B<<B'077<B'077BBBBBB7BB<707BBBB7<<BBBBBBBBBB<	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:31	XM:Z:..x..........Z...xh.z.h.....Z...x...xh....x..............xh.h.h.....Z...x...xh.h..xh.............Hh.h.......Z.....Z.hh.h..x...............hh.h.h.....	XR:Z:GA	XX:Z:2G14GG1G1G9G3GG4G14GG1G1G9G3GG1G2GG14G1G11T3GG1G2G13T1GG1G1G5
HWI-D00119:19:H092KADXX:1:1101:4148:2705_1:N:0:	115	chr1	3104936	255	143M	=	3104936	0	TTTTTTTTTAAAAAAATAACCCGTATCACCATATCCGGGTAATACCAAACCCAACGCAAACTACGCCACTACCTACCGATACCTATTAACTTTATAATATATTTATCCCGAATATATATATACGTAAAAAAACGCGCTCTCCT	BBBBBBB<BBBBBBB<B<'''00'0'0''0<07'''0'0'<'<BBBB770<77000B<00'0''0<BB<'B70'007<700'0<700B<<<<<BBFF<<0<7BB<77007<BBFBBBFBBB77BFIIIIIFBBBBBFFFFBBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:34	XM:Z:............h....hh...Z.........h...Z.Z.hh.h...xh....x.Z..x...x.Z.....x...x..Zx.h...x..h.....h.hh.h.h........Zxh.h.h.h.h...Z.h.hhh...Z.Z.......	XR:Z:CT	XX:Z:12G4GG13G4C2GG1G3GG4G4G3G7G3G3G1G3G2G5G1GG1G1G9GG1G1G1G1G5G1GGG13
HWI-D00119:19:H092KADXX:1:1101:4148:2705_1:N:0:	179	chr1	3104936	255	22M	=	3104936	0	TTTTTTTTTAAAAAAATAACCC	<<<BBBBBB<<BBBBB07B'<B	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:3	XM:Z:............h....hh...	XR:Z:GA	XX:Z:12G4GG3
HWI-D00119:19:H092KADXX:1:1101:12448:3866_1:N:0:	115	chr1	5254199	255	150M	=	5254199	0	CAAAAAATCACCCAATATAAAACTAAAAAACAAAAACCAAAATATCATCCTATCAAATTTCTCTTCACATAAAAATCTAAAAACAAAAAAAATTCCCTAACTTCCTATCTATTCAATAAACAAAACAAACAAAAATTAACTAAATTCCCA	7<BB<FB<77'<'<<000<<B<077B<<B<0<BBB<<0BFB<0<'<<0'<0<'0<BB<7'7<<7''B<<7BBB<<700<<B77FFFB77'70'7B'77FB'07BFBB'<BB0<0FF<BBB0<IIIFF<F7F<BFF<FFFFB<B00B0B<B	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:33	XM:Z:...h..h...........hh....x.hhhh....hh...xhh.........x...xh..............h.........hh.....h.h...............x........x.h.........xh..xh.h..h...x.h......	XR:Z:CT	XX:Z:3G2G11GG4G1GGGG4GG3GGG9G3GG14G9GG5G1G15G8G1G9GG2GG1G2G3G1G6
HWI-D00119:19:H092KADXX:1:1101:12448:3866_1:N:0:	179	chr1	5254199	255	149M	=	5254199	0	CAAAAAATCACCCAATATAAAACTAAAAAACAAAAACCAAAATATCATCCTATCAAATTTCTCTTCACATAAAAATCCCAAAACAAAAAAAATTCCCTCACTTCCTATCTATTCAATAAACAAAACAAACAAAAATTAACTAAATTCCC	BBB<0<FBBFBFFFBBFBFF<0B7BB<FFB77FFBFBFBFII77BBF<<<0BFB''700'0<<7'<B<''0<BB<000'77<<7<07B77'70'000<'000'00'0<<'<'00<<'0<<<<<B'7<B''7<<B'0'0000<<0''<00	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:36	XM:Z:...h..h...........hh....x.hhhh....hh...xhh.........x...xh..............h.........hh.....h.h...............x........x.h.........xh..xh.h..h...x.h.....	XR:Z:GA	XX:Z:3G2G11GG4G1GGGG4GG3GGG9G3GG14G5TA2GG5G1G7A7G8G1G9GG2GG1G2G3G1G5
HWI-D00119:19:H092KADXX:1:1101:17706:11652_1:N:0:	115	chr1	9776942	255	150M	=	9776942	0	AAATCCTAAAAATCCTAATATCCAAAAAATAATTAAAACCGCCCCACCAACCGCTCACCCTACACCCCGTCTCAATACATCTACAACTACCTACACAATAAATTAACCCCTCACCTAACCATAATCCATTCCTCCTCCATCCTCGCCATA	<<0<B7<BBB<'0B0<BB<00<7FFFFB<<<'0BB70'0'0FFFFFFB<7<70BBB7FBB<B<7FB<7<BBBFFB<BFFBBBBBFBB<<FB<FFFFFFFIFFBBB<IIIFBFBIIFIFIIIBIIFIIIFBIIFIIFIFFFFFBFFFFBBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:35	XM:Z:hhh....xhh......xh.h....xhh.h.hh..hhhh..Z........x..Z........x......Z.....x.......x..x..x...x....x.hhh..h...........x.....hh....................Z....h	XR:Z:CT	XX:Z:GGG4GGG6GG1G4GGG1G1GG2GGGG11G11G12G7G2G2G3G4G1GGG2G11G5GG25G
HWI-D00119:19:H092KADXX:1:1101:17706:11652_1:N:0:	179	chr1	9776942	255	149M	=	9776942	0	AAATCCTAAAAATCCTAATATCCAAAAAATAATTAAAACCGCCCCACCAACCGCTCACCCTACACCCCGTCTCAATACATCTACAACTACCTACACAATAAATTAACCCCTCACCTAACCATAATCCATTCCTCCTCCATCCTCGCCAT	BBBFFFFFFFFFBFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIFIIIFFFFBFFFBBFBFFFFFBBFFFFBFBBBFFBB<BBBBBFBBBBBBBBBFBBFFFBFFF<BBBF<BBFFBBBBBFFBFFFFFFFFFFFFFFFFFFFBB	PG:Z:MarkDuplicates	XG:Z:GA	NM:i:34	XM:Z:hhh....xhh......xh.h....xhh.h.hh..hhhh..Z........x..Z........x......Z.....x.......x..x..x...x....x.hhh..h...........x.....hh....................Z....	XR:Z:GA	XX:Z:GGG4GGG6GG1G4GGG1G1GG2GGGG11G11G12G7G2G2G3G4G1GGG2G11G5GG25
It appears to happen when one read is fully contained by the other (due to trimming I presume)

Cheers,
Aaron
Hi Aaron,

The TLEN value was set to 0 whenever a read was completely contained within the other one, like this:

Code:
------------------------->     read 1
<------------------------      read 2
I have now changed this behavior to just use the coordinates of the longer read instead, giving it a + sign for read 1 if read 1 was longer, and - sign if read 2 was longer (this seems somewhat arbitrary to me though, which is probably why had set it to 0 in the first place). If you think this is alright I'll keep it this way and make a new release as soon as I have looked at the other issues.
fkrueger is offline   Reply With Quote
Old 07-24-2013, 04:49 AM   #271
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Hi Felix - yep that makes total sense to me, awesome!

One other thing I just noticed is an inconsistency in result file names between single and paired end (at least for bowtie2) - paired end have "_bismark_bt2_pe.bam" appended whereas single end have "_bt2_bismark.bam" appended - the bt2 and bismark are swapped. Not a problem, just made writing my wrapper for bismark a bit more complicated looking!

Cheers,
Aaron
frozenlyse is offline   Reply With Quote
Old 07-24-2013, 04:50 AM   #272
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Fine, I'll have this changed as well then!
fkrueger is offline   Reply With Quote
Old 07-24-2013, 07:36 AM   #273
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

We have just released a new version of Bismark (v0.8.2). The changes address several feature requests and bug fixes raised above:

• Bismark: Changed the values of the TLEN values in paired-end SAM format generated by Bowtie 2 whenever one read was completely contained within the other; in such cases both TLEN values will be set to the length of the longer fragment
• Bismark: Changed the output filename for Bowtie 2 files for single-end reads from '...bt2_bismark.sam' to '...bismark_bt2.sam' so that single-end and paired-end file names are more consistent

• Methylation Extractor: Added a new option '--mbias_only'. If this option is specified, the M-bias plot(s) and their data are being written out. The standard methylation report ('--report') is optional. Since this option will not extract any methylation data, neither bedGraph nor cytosine report conversion are not allowed
• Methylation Extractor: If a specific output directory and '--cytosine_report' are specified at the same time, the bedGraph2cytosine module will now use the bedGraph file located in the output directory as intended
• Methylation Extractor: Added an additional check for the module GD::Graph::colour; if it can't be found on the system drawing of the M-bias plot will be skipped

Bismark can be dowloaded here: http://www.bioinformatics.babraham.a...jects/bismark/.
fkrueger is offline   Reply With Quote
Old 07-24-2013, 07:24 PM   #274
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by fkrueger View Post
Hi Pete,

I had a go at implementing a new function '--mbias_only' which you can find attached. If this option is specified, only a report and the M-bias plot(s) and their data are being written out. It will not extract any methylation data, also the bedGraph and cytosine report conversion are not allowed.

The new version of the extractor (v0.8.1a) is attached; if you find it working alright I will include it into a new release once I am back from ISMB.
Thanks, Felix. My (brief) testing suggests there are no problems except that M-bias text file isn't written to the directory given by the "--output" flag, but rather is written to the current working directory. The pngs are written to correct output directory.
PeteH is offline   Reply With Quote
Old 07-24-2013, 07:28 PM   #275
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

It's probably worth noting in the documentation that bismark_methylation_extractor expects that for paired-end data that the reads are in queryname order, i.e. the same ordering scheme that Bismark natively outputs when writing SAM/BAM. I ran bismark_methylation_extractor on a coordinate sorted paired-end BAM file and got some strange results; obviously because bismark_methylation_extractor expects read_2 to immediately follow read_1 for each read-pair in the SAM/BAM, which is not the case for a coordinate sorted SAM/BAM.

Last edited by PeteH; 07-24-2013 at 07:30 PM.
PeteH is offline   Reply With Quote
Old 07-25-2013, 12:31 AM   #276
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by PeteH View Post
Thanks, Felix. My (brief) testing suggests there are no problems except that M-bias text file isn't written to the directory given by the "--output" flag, but rather is written to the current working directory. The pngs are written to correct output directory.
Hi Pete,

I have also found this flaw, and the v0.8.2 release does also write the text file to a specified output folder correctly. Thanks for testing.

Quote:
Originally Posted by PeteH View Post
It's probably worth noting in the documentation that bismark_methylation_extractor expects that for paired-end data that the reads are in queryname order, i.e. the same ordering scheme that Bismark natively outputs when writing SAM/BAM. I ran bismark_methylation_extractor on a coordinate sorted paired-end BAM file and got some strange results; obviously because bismark_methylation_extractor expects read_2 to immediately follow read_1 for each read-pair in the SAM/BAM, which is not the case for a coordinate sorted SAM/BAM.
This is probably really worth mentioning, I believe there were a few cases where users reported some weird results. Most notably, position-sorted BAM files will subsequently look as if non-directional libraries had been used, which can lead to a lot of confusion.
fkrueger is offline   Reply With Quote
Old 07-25-2013, 05:59 PM   #277
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

What kind of weird results did you get? I've always used it on position sorted bam files (had no idea it wanted name sorting), and WGBS data I've got correlates really well with HM450 arrays O_o

I guess bismark_methylation_extractor should error out if the bam file is sorted incorrectly in the future?

Last edited by frozenlyse; 07-25-2013 at 06:13 PM.
frozenlyse is offline   Reply With Quote
Old 07-25-2013, 08:00 PM   #278
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by frozenlyse View Post
What kind of weird results did you get? I've always used it on position sorted bam files (had no idea it wanted name sorting), and WGBS data I've got correlates really well with HM450 arrays O_o

I guess bismark_methylation_extractor should error out if the bam file is sorted incorrectly in the future?
I should emphasise, that it is only for paired-end data that the SAM/BAM needs to be sorted by queryname order. If you have single-end data then it won't matter whether the SAM/BAM is sorted by queryname or coordinate so you needn't worry (Felix, please correct me if I'm wrong on this).

The weird results obtained when running bismark_methylation_extractor on a coordinate sorted paired-end SAM/BAM where described by Felix a few posts above me. Basically, you end up getting methylation calls on the CTOT and CTOB strands when the data are from the 2-stranded protocol (and hence there should only be methylation calls on the OT and OB strands) as well as some other weird issues if the --no_overlap flag is active.

I agree that adding a check regarding the sort order of the SAM/BAM would be a good idea. There are a couple of ways to do this:
(1) check the SO field in the SAM/BAM header, however, this assumes the SAM/BAM has been sorted by a program that correctly sets this field (e.g. Picard's SortSam).
(2) A more direct way is to check that the read names are identical for read_1 and read_2 for each read-pair that is processed. I'd recommend that this check is included when the -p flag is active in bismark_methylation_extractor.
PeteH is offline   Reply With Quote
Old 07-26-2013, 07:26 AM   #279
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

We have just released a new version of Bismark (v0.8.3) that implements earlier suggestions raised here on SeqAnswers or via email (many thanks to Pete Hickey for contributing the idea for the FLAG tags, see below). This new version will now die and warn about using positionally sorted paired-end SAM/BAM input files for the methylation extractor, and Bismark is going to use new FLAG values for paired-end SAM/BAM files to allow better visualisation and processing in other software suites.

The changes in detail include:

• Bismark: Changed the FLAG values of paired-end SAM/BAM output files to comply with other downstream applications such as Picard. In addition, reads will no longer have /1 or /2 appended to the read IDs. For the time being, the old FLAG values and read ID tags can still be obtained using the option '--old_flag'. For more information on the change of FLAG tags please see the RELEASE NOTES or type 'bismark --help'

Code:
                         new default                         old_flag
                      ===================              ===================
                      Read 1       Read 2              Read 1       Read 2

             OT:         99          147                  67          131

             OB:         83          163                 115          179

             CTOT:       83          163                 115          179

             CTOB:       99          147                  67          131

• Methylation Extractor: Changed the additional check for the module GD::Graph::colour to an 'eval {require ...}' statement instead of using 'use'. This should now properly skip drawing the M-bias plot if the module is not installed on the system
• Methylation Extractor: Implemented two quick tests for paired-end SAM/BAM files to see if the file had been sorted by chromosomal position prior to using the methylation extractor, because this would cause problems with the strand identity and overlaps since both reads 1 and read 2 are expected to follow each other directly in the Bismark alignment file. The first test attempts to find an @SO (for sorted) tag in the SAM header. If this cannot be found, the first 100000 sequences are checked for whether or not their ID is the same. If the file appears to have been sorted, the methylation extractor will bail and ask for an unsorted file instead

Bismark can be downloaded here: http://www.bioinformatics.babraham.a...jects/bismark/.
fkrueger is offline   Reply With Quote
Old 07-26-2013, 07:52 PM   #280
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Thanks for incorporate those changes, Felix. One thing, you've got the CTOT and CTOB flags swapped around in that table. They should be:
Code:
                         new default                         old_flag
                      ===================              ===================
                      Read 1       Read 2              Read 1       Read 2

             OT:         99          147                  67          131

             OB:         83          163                 115          179

             CTOB:       83          163                 115          179

             CTOT:       99          147                  67          131
Here's a screenshot that shows the difference between using the new flag values (labelled bismark v0.8.1 patched) and to when using the old flag values (labelled bismark v0.8.1).
.
PeteH is offline   Reply With Quote
Reply

Tags
alignment, bisulfite, bisulphite, methylation, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO