SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CisGenome -- an integrated tool for ChIP-seq data analysis hji Bioinformatics 66 12-30-2014 01:55 PM
Bismark Bisulfite Aligner - Now supporting CpG, CHG and CHH context fkrueger Bioinformatics 27 10-11-2013 05:40 AM
Bismark v0.6.beta1: Now supporting gapped Bisulfite-Seq alignments fkrueger Bioinformatics 6 03-19-2012 05:06 AM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 08:43 AM

Reply
 
Thread Tools
Old 07-01-2013, 01:55 AM   #241
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi forzenlyse,

It seems that you were running paired-end alignments with Bowtie2, correct? Bismark does indeed append /1/1 to read 1, one of which is removed during the alignment step. The second one is then removed at a later step, but to keep the output in a way that allows immediate identification of what was R1 and R2, /1 and /2 are appended just before the SAM/BAM output is created. If you don't want these trailing fragment numbers you can just locate the following two lines in the subroutine "paired_end_SAM_output":

Code:
  my $id_1 = $id.'/1';
  my $id_2 = $id.'/2';
and comment them out like so:
Code:
  # my $id_1 = $id.'/1';
  # my $id_2 = $id.'/2';
Let me know if this fixes your problem of if I can anything else for you.
fkrueger is offline   Reply With Quote
Old 07-01-2013, 11:24 PM   #242
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Hi Felix - thanks for the prompt reply, and sorry I forgot to include my library type/bowtie version. I was just making sure this was the intended behaviour before I updated my pipeline to ignore the /1 /2 tags in remove PCR duplicates - thanks
frozenlyse is offline   Reply With Quote
Old 07-05-2013, 10:44 AM   #243
oria34
Member
 
Location: Finland

Join Date: Feb 2013
Posts: 15
Default

Hi all

I am trying to re-align my reads again using a new assembly of the reference genome and I am finding many problems in the alignment step.

After running genome_preparation I am trying to run bismark for paired end reads trimmed with trim_galore (--phred64 -t --paired filename_2 filename_2). Right after the script creates the two version os the reads (c-t, a-g) it gives back a un ending number of errors. Many of them are related to the impossibility to determine the genomic context (due to Cs located at the end of the contigs) but others are quite new for me.

The resulting SAM file is fairly small and the number of Cs analyzed (from the report file) are ridiculously low; around 50 000.

I am using the last version of Bismark with the --bowtie2 option (also --botwtie2 option for the genome preparation step)

I am running in parallel an alignment with the same .fq files against a masked version of the genome and no errors so far (4 millions reads...)

Any thoughts?

Cheers


PS. some of the lines with multiple errors

Reading in the sequence files BPM24_FIS17_trim_galore_1.fq and BPM24_FIS17_trim_galore_2.fq
Processed 100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 203898.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 203898.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:2724:10559#ACCAGACT/1 scaffold_1495
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 223364.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 223364.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:7055:11458#ACCAGACT/1 scaffold_402
Processed 200000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 592170.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 592170.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:8410:27041#ACCAGACT/1 scaffold_1917
Processed 300000 sequences so far
Processed 400000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 957754.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 957754.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14266:27423#ACCAGACT/1 scaffold_1862
Processed 500000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1095048.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1095048.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14067:33341#ACCAGACT/1 scaffold_297
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1171846.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1171846.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:20587:36545#ACCAGACT/1 scaffold_391
Processed 600000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1323878.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1323878.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:10776:58374#ACCAGACT/1 scaffold_1028
Processed 700000 sequences so far
Processed 800000 sequences so far
Processed 900000 sequences so far
Processed 1000000 sequences so far
Processed 1100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 2259346.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 2259346.

Last edited by oria34; 07-05-2013 at 10:53 AM.
oria34 is offline   Reply With Quote
Old 07-05-2013, 11:43 AM   #244
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi oria,

Interestingly I encountered the same error messages yesterday myself; they are caused by trying to print out the starting position of the read which aligns so close to the edge of a chromosome or contig/scaffold that additional 2bp sequence can't be extracted. These warnings are harmless though and can be ignored safely; I have still fixed this bug yesterday (if you want the new version just send me an email).

From the output below it seems that you are getting roughly 1 of these warning per 100000 analysed sequences, so this can hardly be the reason for poor overall mapping results. Are you by any chance running the other alignment to the repeat masked genome from the same files in the same folder? This would certainly have the potential to interfere with the results... In any case, if you continue to have problems can you email me further details of your problems? If necessary I could also create an FTP server for you to upload your files in question so I could take a look myself.
fkrueger is offline   Reply With Quote
Old 07-05-2013, 12:05 PM   #245
oria34
Member
 
Location: Finland

Join Date: Feb 2013
Posts: 15
Default

Thanks for your quick reply!

I am not really worried for the errors, I remember reading in this thread that they are due to the Cs located at the very end and, as you said, they are not relevant...Is the outcome what is killing me!

I have created two different folders for the two different alignments (unmasked and masked), each with copies of bismark´s scripts. Each folder has two subfolders, one per individual (two .fq files each)...I have even duplicated the bowtie2 folder just in case....

It is true that the two genomes are in different folders but each genome is used at the same time for the alignment of the different individuals (2x)...but this doesn´t seem to affect to the masked version of the genome.

I´ll post tomorrow how it is going.

Many thanks!
oria34 is offline   Reply With Quote
Old 07-05-2013, 12:07 PM   #246
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Alright, thanks for keeping me posted.
fkrueger is offline   Reply With Quote
Old 07-06-2013, 12:23 AM   #247
oria34
Member
 
Location: Finland

Join Date: Feb 2013
Posts: 15
Default

Hi again,

OK, I have some more news. I made a mistake checking online my runs ....the one that is running error-free is the alignment against the standard genome.

All the errors I´ve reported have been produced in the alignment run against the masked genome. These runs are now over (see report below) and they didn´t work at all.

To my understanding, Bowtie doesn´t have any type of restrictions in the use of masked genomes, does it? Do you think I could miss something during the genome preparation step?

Cheers


Bismark report for: M24_FIS17_trim_galore_1.fq and M24_FIS17_trim_galore_2.fq (version: v0.7.12)
Bowtie was run against the bisulfite genome of /Bisulfite/BISMARK_REALIGNMENT/MASKED_Version_Genome/ with the specified options: -q --phred64 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500

Option '--directional' specified: alignments to complementary strands will be ignored (i.e. not performed)
Final Alignment report
======================
Sequence pairs analysed in total: 62787165
Number of paired-end alignments with a unique best hit: 1587
Mapping efficiency: 0.0%
Sequence pairs with no alignments under any condition: 62785501
Sequence pairs did not map uniquely: 77
Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 557 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 1030 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 50806

Total methylated C's in CpG context: 1118
Total methylated C's in CHG context: 99
Total methylated C's in CHH context: 172

Total C to T conversions in CpG context: 2166
Total C to T conversions in CHG context: 16576
Total C to T conversions in CHH context: 30675

C methylated in CpG context: 34.0%
C methylated in CHG context: 0.6%
C methylated in CHH context: 0.6%

Last edited by oria34; 07-06-2013 at 12:39 AM.
oria34 is offline   Reply With Quote
Old 07-06-2013, 02:42 AM   #248
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Good to hear that the alignments against the standard genome went well.

The errors from the masked genome are quite likely to be a result of the masking process itself, do you have an idea about how much sequence has been masked into Ns (presumably)?

As far as I am aware Bowtie2 supports indexing of Ns in there reference sequence (whereas Bowtie 1 does not), so in theory it should work in the same way. N's in the reads or reference do receive a mapping penalty as well though (albeit not as much as true mismatches), so if you simply have masked too much of your genome this might just exceed the fairly stringent penalty allowance of Bismark's default setting which is governed by the "--score_min" parameter. The longer your reads and the paired-end nature of your library could make it possible that effectively all reads would fall into a 'masked' region.

To see if it is the degree of masking you could just run a few sample reads (e.g. with the parameter --qupto 500000 for the firt 500K reads), run only Read 1 in single end mode and use --score_min L,0,-0.6. This should finish in a couple of minutes and tell you quickly if the mapping efficiency increases notably.

Just out of interest, why would you want to align reads to both a masked and unmasked genome in parallel (especially since the normal genome alignments were fine)?
fkrueger is offline   Reply With Quote
Old 07-06-2013, 04:23 AM   #249
oria34
Member
 
Location: Finland

Join Date: Feb 2013
Posts: 15
Default

Well, apparently errors started to appear also in the alignment run against the standard genome, but a bit later I left the office (see below). Not really concerning if I get a good enough mapping efficiency.... Still running

Well this is embarrassing…… No idea how "they" masked the genome for the interspersed repeats. I can't even ask till next Monday...Sorry for this also, but I have no idea how to extract the first 500K reads. I am quite newbie in all this

The reason for aligning to both versions of the genome is just to compare results...well, I am just curious about how the interspersed repeats can determine certain patterns of methylation in certain parts of a couple of particular chromosomes. I just thought it was worth the shot.



Processed 47700000 sequences so far
Processed 47800000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95642710.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95642710.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:2821:99210#TCTTATAT/1 scaffold_1878
Processed 47900000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95989234.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95989234.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:12271:44482#TCTTATAT/1 scaffold_232
Processed 48000000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96116284.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96116284.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2315:3124:23585#TCTTATAT/1 scaffold_297
Processed 48100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96230734.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96230734.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:15481:55756#TCTTATAT/1 scaffold_1105
Processed 48200000 sequences so far
Processed 48300000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96765254.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96765254.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2316:2905:27999#TCTTATAT/1 scaffold_1711
Processed 48400000 sequences so far
Processed 48500000 sequences so far
Processed 48600000 sequences so far
Processed 48700000 sequences so far
Processed 48800000 sequences so far
Processed 48900000 sequences so far


FYI - Updated info: The alignment to standard genome is now finished. It took around 17h and the mapping efficiency was fairly high....thanks again for this worderful tool! I copy/paste the report:

Final Alignment report
======================
Sequence pairs analysed in total: 48900263
Number of paired-end alignments with a unique best hit: 31991336
Mapping efficiency: 65.4%

Sequence pairs with no alignments under any condition: 13461118
Sequence pairs did not map uniquely: 3447809
Sequence pairs which were discarded because genomic sequence could not be extracted: 215

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 15981506 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 16009615 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 1298097901

Total methylated C's in CpG context: 122944840
Total methylated C's in CHG context: 870088
Total methylated C's in CHH context: 2311989

Total C to T conversions in CpG context: 71226620
Total C to T conversions in CHG context: 301248948
Total C to T conversions in CHH context: 799495416

C methylated in CpG context: 63.3%
C methylated in CHG context: 0.3%
C methylated in CHH context: 0.3%

Last edited by oria34; 07-06-2013 at 05:19 AM.
oria34 is offline   Reply With Quote
Old 07-06-2013, 12:01 PM   #250
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Good news. So apparently the warning messages you saw amounted to a total of 215 (out of 49 million read pairs), indeed something you should be able to live with.

Bismark has an option '--qupto' to just use the first so many reads; just type bismark --help to see further explanations on it

So you could run single end alignments against your masked genome for the first 500000 reads with not-so-stringent criteria using the following command:

bismark --bowtie2 --qupto 500000 --score_min L,0,-0.6 /path/to/genome/ file_val_1.fq
fkrueger is offline   Reply With Quote
Old 07-12-2013, 02:04 PM   #251
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default M-Bias plot and alignment stats pie chart

We have just released a new version of Bismark (v0.8.0). The main purpose of this release is the introduction of an M-bias plot for the methylation extractor (thanks to K.D. Hansen for the suggestion, see reference below). The M-bias plot allows researchers to see whether there are fundamental technical biases in the methylation calling of their reads.

Attached are two examples:

a) the profile of a PBAT-Seq library (read1) which should in theory start with a 'random' 4-mer oligo that was used to bind to bisulfite converted reads. The M-bias (and also base composition profiles of these libraries as judged by FastQC (not shown here)) do show some serious biases primarily within, but not limited to, the first 4bp of the library.

b) Read 2 of any paired-end BS-Seq library. Due to the end-repair reaction and A-tailing procedure for Illumina libaries, the 3' ends of sonicated fragments are filled in by the Klenow enzyme with unmethylated cytosines. These will then be converted to Ts in the bisulfite treatment, resulting in a very high amount of apparently unmethylated cytosines at the first couple of positions of Read 2. (The problem is somewhat reminiscent of the fill-in problem in RRBS libraries).

In addition to just the methylation profile, the M-bias plot also shows the actual number of events for an entire read file. This should enable investigators to make an informed decision of whether to remove biased positions in the reads, e.g. by using the option --ignore <int> of the methylation extractor or by hard-trimming read 2 by 2bp prior to running the alignments, or whether it is OK to live with what they observe.

Other changes in the latest version include:
Bismark: Added new option '--prefix' to add a prefix to the output filenames. For example, '--prefix test' with 'file.fq' would result in the output file 'test.file.fq_bismark.sam' etc.
Bismark: Fixed a warning message that occurred when chromosomal sequences could not be extracted in paired-end Bowtie2 mode
Bismark: will now generate a pie chart with the alignment statistics once a run has finished; this allows to get a quick overview of how many sequences aligned uniquely or sequences that did not align, either due to producing no alignment at all, multiple mapping or because it was impossible to extract the chromosomal sequence

Methylation Extractor: upon completion, the methylation extractor will now produce an M-bias (methylation bias) plot, which shows the methylation proportion across each possible position in the reads (described in: Hansen et al., Genome Biology, 2012, 13:R83). The data for the M-bias plot will be written into a text file (to generate graphs by alternative means) and drawn into a .png file. The plot also contains the absolute number of methylation calls per position (methylated + unmethylated)

Bismark is avaialable for download here: http://www.bioinformatics.babraham.a...jects/bismark/
Attached Images
File Type: png PBAT M-bias_R1.png (9.1 KB, 16 views)
File Type: png M-bias_R2.png (8.2 KB, 11 views)
fkrueger is offline   Reply With Quote
Old 07-16-2013, 12:12 AM   #252
oria34
Member
 
Location: Finland

Join Date: Feb 2013
Posts: 15
Default

Hi Felix,

Sorry for the delay in replying. For some reason I couldn't use the --qupto option, Bismark gives back an error message:

Unknown option: qupto
Please respecify command line options

Any thoughts?

Thanks for the new release of Bismark! This new M-bias tool is really helpful!

Also, I am looking for some advice regarding the post processing of all the methylation calls. In order to have a statistically based tool to decide whether a C is actually a methylated C, I've decide to perform the well-known Binomial probability test.

When I started in my present position, all the sequencing experiments were already performed and the data ready to be analyzed so, to my knowledge, no lambda control sequence exists. The question is, Is there any straightforward way to use the methylation extractor output as a input file for any other soft performing this Binomial probability test?

I would be very grateful in receiving some advice on this

Thanks a lot
oria34 is offline   Reply With Quote
Old 07-16-2013, 12:51 AM   #253
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi oria,

Sorry the upto command is actually:

Code:
-u/--upto <int>          Only aligns the first <int> reads or read pairs from the input. Default: no limit.
(just do bismark --help if in doubt).

We personally use SeqMonk for most downstream methylation analyses, so I am probably not the best person to advise you on other downstream applications. There are couple of packages though that devoted to such analyses, e.g. methylKit by A. Akalin or the Bioconductor package bsseq by K.D. Hansen.
fkrueger is offline   Reply With Quote
Old 07-16-2013, 12:32 PM   #254
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

We just released a new version of Bismark (v0.8.1) which addresses a few minor issues such as plot colours and legends, but notably also extends the '--ignore' option of the methylation extractor to now work independently for Read 1 and Read 2 of paired-end files. This allows more flexible control of removing biases that can now be visualised via the M-bias plot. Here are the current changes in more detail:

• Methylation Extractor: Changed the function of the option '--ignore <int>' to ignore the first <int> bp from the 5' end of single-end reads or Read 1 of paired-end files. In addition, added a new option '--ignore_r2 <int>' to ignore the first <int> bp from the 5' end of Read 2 of paired-end files. Since the first couple bases in Read 2 of BS-Seq experiments show a severe bias towards non-methylation as a result of the end-repair of sonicated fragments with unmethylated cytosines (see M-bias plot), it is recommended that the the first couple of bp of Read 2 are removed before starting downstream analysis. Please see the section on M-bias plots in the Bismark User Guide for more details
• Methylation Extractor: Changed colours, legends and background colour of the M-bias plot
• Bismark: Changed the way in which the alignment overview file is being named to actually work

Together with the options '--clip_r1' and '--clip_r2' of Trim Galore one can now choose to remove known biases, e.g. for PBAT-Seq, prior to running the alignments using Trim Galore, or ignoring biased methylation calls after the alignments have been carried out using the methylation extractor.

Bismark can be downloaded here: http://www.bioinformatics.babraham.a...jects/bismark/
fkrueger is offline   Reply With Quote
Old 07-16-2013, 03:01 PM   #255
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Hi Felix,

Any chance read groups could be incorporated into the output bams in a future version of bismark?

Cheers!
frozenlyse is offline   Reply With Quote
Old 07-17-2013, 05:30 AM   #256
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by frozenlyse View Post
Hi Felix,

Any chance read groups could be incorporated into the output bams in a future version of bismark?

Cheers!
Hi frozenlyse,

Sorry but I am not too familiar with the concept of read groups in the output BAMs, would that be more or less just adding an @RG tag and ID into the header section? Or would you want to specify more details yourself?
fkrueger is offline   Reply With Quote
Old 07-17-2013, 05:50 AM   #257
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

In addition to the @RG tag and such in the header, there's then an "RG:Z:Something" auxiliary field in each read.

@frozenlyse, Bison might work for you (it's more of a niche application), if you have a bit more computational resources to use and are happy with using bowtie2 as the underlying aligner. I haven't tried using read groups, but they should work exactly as if you were using bowtie2 directly.
dpryan is offline   Reply With Quote
Old 07-17-2013, 04:39 PM   #258
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by fkrueger View Post
We just released a new version of Bismark (v0.8.1) which addresses a few minor issues such as plot colours and legends, but notably also extends the '--ignore' option of the methylation extractor to now work independently for Read 1 and Read 2 of paired-end files. This allows more flexible control of removing biases that can now be visualised via the M-bias plot. Here are the current changes in more detail:

• Methylation Extractor: Changed the function of the option '--ignore <int>' to ignore the first <int> bp from the 5' end of single-end reads or Read 1 of paired-end files. In addition, added a new option '--ignore_r2 <int>' to ignore the first <int> bp from the 5' end of Read 2 of paired-end files. Since the first couple bases in Read 2 of BS-Seq experiments show a severe bias towards non-methylation as a result of the end-repair of sonicated fragments with unmethylated cytosines (see M-bias plot), it is recommended that the the first couple of bp of Read 2 are removed before starting downstream analysis. Please see the section on M-bias plots in the Bismark User Guide for more details
• Methylation Extractor: Changed colours, legends and background colour of the M-bias plot
• Bismark: Changed the way in which the alignment overview file is being named to actually work

Together with the options '--clip_r1' and '--clip_r2' of Trim Galore one can now choose to remove known biases, e.g. for PBAT-Seq, prior to running the alignments using Trim Galore, or ignoring biased methylation calls after the alignments have been carried out using the methylation extractor.

Bismark can be downloaded here: http://www.bioinformatics.babraham.a...jects/bismark/
I'm very happy to see the M-bias feature added to Bismark! Especially because it is better than my own basic implementation of this idea and thus saves me the trouble of refining mine

I've had a quick look at the code, but not yet had a chance to run it, and have a couple of questions:
(1) Since the M-bias subroutines are part of the bismark_methylation_extractor, does this mean it is compatible with Bismark files generated from earlier versions of Bismark (version < 0.8)?
(2) Is a different M-bias plot computed for each different read length? For instance, the BAM contains reads from multiple sequencing runs, 75bp PE and 100bp PE, of the same sample.
(3) How do you deal with 5' trimming of reads? I ran into this challenge with my own script since the first position of a read may not in fact correspond to the first cycle of the sequencing run. This can mean that if you 'stack' all reads according to their first position you might be comparing different sequencing cycles across reads for a given read position. If trimming was applied prior to alignment then it seems impossible to tell which sequencing cycle the first base of the read actually corresponds to. However, if (soft) clipping of the 5' end of the read has been performed by the aligner then it should be possible to retain the full correspondence between read position and sequencing cycle by using the CIGAR (I'm not actually sure whether Bismark allows Bowtie1/2 to perform soft-clipping of reads, so this may be a moot point).
PeteH is offline   Reply With Quote
Old 07-17-2013, 11:13 PM   #259
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Pete,

To 1) Yes it should be downward compatible, at least for SAM files (the old vanilla format does a version check at the start which might fail, but I doubt anyone is using the old format still)

To 2) As it stands, only 1 M-bias plot is generated per read. Since reads are routinely trimmed before alignment it would probably mean that you get >150 M-bias plots for a paired-end library, and I doubt that anyone would want to look through all of them. Technically it should be possible though.

To 3) I don't see this as a problem because Bismark doesn't allow soft-clipping to take place (Bowtie 1 doesn't do it, and for Bowtie 2 it is not enabled). Trim Galore for example also doesn't do adaptive trimming from the 5' end, and the new options '--clip_r1' etc would trim reads uniformly. So in any case, position 1 in the read should always be the very first position.
fkrueger is offline   Reply With Quote
Old 07-18-2013, 01:38 AM   #260
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by fkrueger View Post
Hi Pete,

To 1) Yes it should be downward compatible, at least for SAM files (the old vanilla format does a version check at the start which might fail, but I doubt anyone is using the old format still)

To 2) As it stands, only 1 M-bias plot is generated per read. Since reads are routinely trimmed before alignment it would probably mean that you get >150 M-bias plots for a paired-end library, and I doubt that anyone would want to look through all of them. Technically it should be possible though.

To 3) I don't see this as a problem because Bismark doesn't allow soft-clipping to take place (Bowtie 1 doesn't do it, and for Bowtie 2 it is not enabled). Trim Galore for example also doesn't do adaptive trimming from the 5' end, and the new options '--clip_r1' etc would trim reads uniformly. So in any case, position 1 in the read should always be the very first position.
1) Great, I can retire my (slow) Python script!
2) True, however Kasper Hansen convinced me of the benefit of this level of stratification in M-bias plots (see e.g. Figure 2b of Hansen, K. D., Langmead, B. & Irizarry, R. A. Genome Biol 13, R83 (2012)). Having given this a bit more thought, for this particular example I think that read length is a proxy for library/sequencing run. I suppose that dealing with this extra complication in the M-bias plots isn't really the role of Bismark; this example is perhaps more a warning to the user to be very careful when combining data from multiple libraries/sequencing runs and in how you perform data QC and methylation calling in such datasets since different libraries/sequencing runs can and will have very different biases and error profiles.
3) Thanks for clarifying the nature of 5' trimming in Trim Galore and Bismark. That makes my concerns irrelevant
PeteH is offline   Reply With Quote
Reply

Tags
alignment, bisulfite, bisulphite, methylation, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO