SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CisGenome -- an integrated tool for ChIP-seq data analysis hji Bioinformatics 66 12-30-2014 02:55 PM
Bismark Bisulfite Aligner - Now supporting CpG, CHG and CHH context fkrueger Bioinformatics 27 10-11-2013 06:40 AM
Bismark v0.6.beta1: Now supporting gapped Bisulfite-Seq alignments fkrueger Bioinformatics 6 03-19-2012 06:06 AM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 09:43 AM

Reply
 
Thread Tools
Old 02-08-2011, 11:50 AM   #61
sgrimm
Junior Member
 
Location: North Carolina

Join Date: Sep 2010
Posts: 6
Question strand bias?

I have 3 lanes of of PE bisulf-seq data that I've aligned via bismark. I've merged the data and converted it to a UCSC track to visualize coverage. We've noticed what appears to be regions of significant strand bias randomly throughout the genome, with 1kb+ regions that have (for example) 5x coverage of reads mapped to the + strand and 40x coverage of reads mapped to the - strand.
Has anyone else noticed behavior like this? Any ideas on the source?
Thanks!
sgrimm is offline   Reply With Quote
Old 02-08-2011, 12:23 PM   #62
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi Sara,

I can offer you a possible solution for this behavior, but I am not quite sure if it is really true for all cases you are describing. Bismark is as you know performing 4 alignments in parallel, and then deciding whether there is a uniquely best alignment from one of the four alignment instances.

There is however also the possibility that a sequence (or sequence pair) produces the same uniquely best alignment to the same position in the genome in two different alignment threads:

- Either a sequence (or pair) is completely devoid of any Cs or Gs. I once ran a test to find out how frequent this is, and it turned out to be not very frequent at all.

- Alternatively, if the sequence at a position you are looking at is fully methylated, i.e. all Cs are protected, you will also find 2 alignments to the same location in different threads. If a sequence containing only lets say 1 or 2 methylated CpGs (and no other Cs) is a 100% methylated sequence, it will thus produce 2 alignments.

In these cases both alignments will yield the same methylation information, however it is impossible to determine exactly whether the sequence originated from the top strand or from the bottom strand (as they look identical). Thus, if Bismark encounters two alignments to the same position, it will score the alignment which is reported first (which is normally on the OT or OB strand). The second case, namely a sequence which is fully methylated, seems to be the major factor which can produce such a "bias" in strand alignments. Please note that the underlying methylation information coming from two different read/genome conversions would be exactly the same whichever alignment would have been analysed, it thus only looks like there was a bias in the strands the reads came from.

Could you try to verify this hypothesis by looking at such a region and send me an email about the outcome?

Best wishes
Felix
fkrueger is offline   Reply With Quote
Old 02-10-2011, 07:54 AM   #63
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

We have just made two amendments to Bismark which were prompted by user requests.

The bismark genome preparation will now write the two bisulfite converted genomes out into multi-FastA files by default instead of generating a single file per chromosome (this is still possible though using --single_fasta). Concatenating single-entry FastA files for the bowtie-build indexing process could previously exceed the OS size limit a list can have (for newly assembled genomes with up to 50000 thousand scaffold and contig files...) and thus crash the indexing process.

The other one affects the way paired-end alignments are handled internally. In particular, this affected the reporting of two special kind of sequences when --directional was specified:

There are 2 special cases under which a sequence will align to the exact same location in two different alignment conditions:

a) If a read pair consists of purely AT reads
b) If a read pair has a methylation percentage of 100%, i.e. none of the Cs (or Gs) is converted.

These reads are now correctly preferentially assigned to the original top strand (OT) or the original bottom strand (OB) and not to their complementary strands. Using --directional will thus not remove any reads described in a) or b) and hence not introduce bias anymore (previously, CTOB strand alignments were preferred over OB strand alignments which would have caused a bias against fully methylated reads originating from the bottom strand).

The latest version (Bismark v0.4.1) can be downloaded at www.bioinformatics.bbsrc.ac.uk/projects/bismark/ (If you can't see the latest files the BBSRC cache might need a force update (Ctrl+Refresh)).

Please let me know if you encounter any problems with the latest version.
fkrueger is offline   Reply With Quote
Old 03-15-2011, 09:11 PM   #64
zeam
Member
 
Location: USA

Join Date: Oct 2010
Posts: 37
Default Somebody might met a problem when use the scripts.

Quote:
Originally Posted by olivertam View Post
To follow up on the BEDGraph/BigWig idea, we have developed a workaround for this. Again, this is based on single-end analysis, so I haven't tested paired-ends

1) Use methylation-extractor on the Bismark output, with '--comprehensive' option ('--merge_non_CpG' is optional)
2) Run the following script (genome_methylation_bismark2bedGraph.pl). This script sorts the methylation extractor output, then parses the results to generate an "overall methylation level" as a BEDGraph file, with one sampled cytosine site per line.
3) Use the bedGraphToBigWig program (available online, I believe) to convert the BEDGraph to BigWig.

Here's the usage for the genome_methylation_bismark2bedGraph.pl

Usage: genome_methylation_count.pl (--cutoff [threshold] ) [Bismark methylation caller output] > [output]

--cutoff [threshold] - The minimum number of times a methylation state was
seen for that nucleotide before its methylation
percentage is reported.
Default is no threshold

The output file is a tab-delimited BedGraph file with the following information:

<Chromosome> <Start Position> <End Position> <Methylation Percentage>

Bismark methylation caller (v0.2.0 or later) should produce three output files
(CpG, CHG and CHH) when using the "--comprehensive" option
(Two files if using the "--merge_non_CpG" option).
To count both CpG and Non-CpG, combine the output files.

Bismark methylation caller (v0.1.5 or earlier) should produce two output files
(CpG and Non-CpG) when using the "--comprehensive" option.
To count both CpG and Non-CpG, combine the two output files.


Let me know if you have any questions, issues or bugs (since this is a workaround)

Cheers,
Oliver
First,thanks for your perl scripts!
When I use this to processing my own data,it works pefect;but when I use it to proccessing the data downloaded from NCBI SRA,It doesn't work.

My data was downloaded from the NCBI SRA,and after bismark methylation extractor,the output was as below:
######################################################
Bismark methylation extractor version v0.4.1
SRRXXXXXX.4.1 HWUSI-EAS585:2:1:2:1917.1 length=90 - chr05 6490339 x
SRRXXXXXX.4.1 HWUSI-EAS585:2:1:2:1917.1 length=90 - chr05 6490291 x
######################################################
The colums of chromosomes and coordinates are 5 and 6,respectively.So in your scripts,"sort -k3,3 -k4,4n" actually should be "sort -k5,5 -k6,6n",so I think you shoud call users' attention to this.Or you can do something to fix this.
zeam is offline   Reply With Quote
Old 03-15-2011, 09:24 PM   #65
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi zeam,

Thanks for finding the issue. I have never dealt with SRA data, so I was unaware of their "altered" format. Did you run the methylation extractor on the SRA data itself, or was it stored as methylation extractor output?
If you could give me the dataset from SRA that gave you the issue, as well as the cmd line parameters that you used to analyze the data, I'll be happy to try and fix the issue.

Again, thanks for bringing this bug up.

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 03-15-2011, 09:30 PM   #66
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi zeam,

It seems like the SRA has a really bad naming system, where they add spaces in the name. This throws out all the column separation completely.
If you look at the data, you'll probably find that the sequence name is actually "SRRXXXXXX.4.1 HWUSI-EAS585:2:1:2:1917.1 length=90". Unfortunately, the name (which should be in one column) is now treated as three columns thanks to the spaces (hence chr column is column 5 instead of column 3). I'm not sure what would be the best way to fix it, other than to make methylation extractor to remove all spaces from the name (or replace them with underscores).
Really not sure what the best approach is.

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 03-16-2011, 12:29 AM   #67
zeam
Member
 
Location: USA

Join Date: Oct 2010
Posts: 37
Default

Quote:
Originally Posted by olivertam View Post
Hi zeam,

Thanks for finding the issue. I have never dealt with SRA data, so I was unaware of their "altered" format. Did you run the methylation extractor on the SRA data itself, or was it stored as methylation extractor output?
If you could give me the dataset from SRA that gave you the issue, as well as the cmd line parameters that you used to analyze the data, I'll be happy to try and fix the issue.

Again, thanks for bringing this bug up.

Cheers,
Oliver
Hi,Oliver,
You are right about the SRA name,thanks for your replying,and I just think you should let the users know how to modify your script when dealing with SRA data.

Best wishes,
Zeam
zeam is offline   Reply With Quote
Old 03-16-2011, 02:38 AM   #68
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi Zeam and Oliver,

I took the freedom to modify the genome_methylation_bismark2bedGraph script so that you can now specify an option --remove_spaces which will replace all whitespaces with underscores before running the actual bedGraph conversion. Let me know if you spot any bug.

I will attach the file here, and it is also available for download at www.bioinformatics.bbsrc.ac.uk/projects/bismark/

Best wishes,
Felix
Attached Files
File Type: pl genome_methylation_bismark2bedGraph.pl (5.4 KB, 5 views)
fkrueger is offline   Reply With Quote
Old 03-16-2011, 08:29 AM   #69
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi Felix,

Thanks for doing this. I guess I shouldn't be so lazy next time
The code looks good to me, but I'm sure zeam will be able to test it with his SRA data.

Again, thanks heaps for your help.

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 03-16-2011, 12:03 PM   #70
cwhelan
Member
 
Location: Cambridge, MA

Join Date: Nov 2010
Posts: 23
Default bismark_to_SAM script for paired end data

Hi Oliver, Felix, and others,

I think there may be a small problem in the bismark_to_SAM_v2.pl script when converting paired end bismark output to SAM format. Specifically, I don't think it's calculating the insert size field in the SAM file correctly. For example, for the Bismark line:

GA2_0015_FC:1:1:2010:1104#0 + 1 208840227 208840334 GGTAGAAGAAGTTTTGATGAGGTGGAGAAGTTATTTATTTGTAGGTTTAT GGCAGAAGAAGCCCTGATGAGGTGGAGAAGTCATCCATTTGCAGGCCTATGA ..x........hhx.................h..hh.....x...hh... NNTTCTCTAAATTAAAACTATTAAAACAAACCTTCTTAAAATATTTAAAA TACTTTCTCTGAGTTAGAACTATTAAAACAGACCTTCTTAGGATATTTAAAA ........x.h...h.............x.........hh.......... CT CT

It's generating the SAM output:

GA2_0015_FC:1:1:2010:1104#0 99 1 208840227 255 50M 1 208840284 57 GGTAGAAGAAGTTTTGATGAGGTGGAGAAGTTATTTATTTGTAGGTTTAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:10 MD:Z:2T8TTT17T2TT5T3TT3

The insert size of 57 in that record doesn't seem to be taking into account the length of one of the reads - in this case 50bp, meaning that the insert size should really be 107bp, I believe.

You can see a visual example of what I'm talking about in the attached SeqMonk snapshot. In the first data set I imported the SAM output as single end data. In the second, I imported it as paired end data, and SeqMonk is displaying the location of the insert. (This was done with the latest version of SeqMonk which fixed a bug with where the inserts for paired end data were displayed.) The last data set is a paired-end import of a SAM file generated by a version of the script that I modified to try to get the right insert size - it's actually a simplification of the current calculations. I've attached my modified version, but I'd appreciate it if someone else could take a look at it and verify that my changes are correct - I find SAM format pretty confusing sometimes so please correct me if I've done something wrong!

Thanks,

Chris
Attached Images
File Type: png issue_with_bismark_to_sam_pe.png (9.4 KB, 28 views)
Attached Files
File Type: pl bismark_to_SAM_v3.pl (12.4 KB, 8 views)
cwhelan is offline   Reply With Quote
Old 03-16-2011, 12:42 PM   #71
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi cwhelan,

Thank you for picking up this error. I must admit that the SAM format for paired ends was very confusing when they first introduced it. I obtained a much better manual explaining the SAM format, and I completely agree with your changes. I do apologize for the issue, but when they previously used the term "insert size" (though it has been renamed to template length in the newest documentation), I thought it meant the size of the fragment inserted between the two sequenced ends.
Again, I apologize for the mistake, and thank you heaps for fixing it. I would totally recommend Felix to put up your version and give you credit for the change.

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 03-16-2011, 07:12 PM   #72
zeam
Member
 
Location: USA

Join Date: Oct 2010
Posts: 37
Default Queries on Bismark

Hi,felix,

Recently,I have been using the Bismark,Thanks for your wonderful program.

There is something confused me:
1)In the user'guide,I only saw option to handle mismatch of the seed,but what about the left?
2)Whar criteria has been used to define a unique map read in Bismark?

Best wishes,
Zeam
zeam is offline   Reply With Quote
Old 03-17-2011, 01:03 AM   #73
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi Zeam,

Thanks for your kind words.

To answer your question about seed length:
Bismark uses Bowtie and thus read alignments and seed length are handled exactly as in Bowtie, i.e. reads can potentially have more mismatches after the seed, however all mismatches together must not exceed the mismatch ceiling (which can be set with the -e parameter). Even though it is possible we wouldn't recommend to tolerate a large amount fo mismatches as this can have dramatic consequences on the inferred methylation levels. What we tend to do is set the seed length to the entire read length (with -l) and allow only 1 or 2 mismatches with the -n parameter, as this minimises the amount of mismappings and thus false mthylation calls.

Bismark considers alignments unique if a read has an alignment to the genome which has a lowest amount of mismatches. So if a read aligns to one position with 0 mimatches, and with 2 mismatches to another position, the perfect match will be picked, whereas a read with 2 different matches to the genome with 0 mismatches each will be removed as an ambiguous match. For paired-end alignments the lowest sum of mismatches of both reads determines whether a read pair can be placed uniquely.


Hope this helps,
Best wishes,
Felix
fkrueger is offline   Reply With Quote
Old 03-17-2011, 02:16 AM   #74
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Dear Chris,

Thanks very much for providing the new bismark2SAM_v3 script, it is now also available for download from the bismark homepage.

Best wishes,
Felix
fkrueger is offline   Reply With Quote
Old 03-24-2011, 07:41 PM   #75
zeam
Member
 
Location: USA

Join Date: Oct 2010
Posts: 37
Default A query on bismark read trimming

Hi Felix,

I have noticed that the bismark didn't have a parameter for read trimming like bwa.If I need read trimming,am I use another tool or bismark have this function that I didn't see?

Best wishes,
Zeam
zeam is offline   Reply With Quote
Old 03-25-2011, 09:44 AM   #76
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi Zeam,

you are right that Bismark doesn't dynamically trim reads at the moment, so you would have to run the sequence file through an appropriate trimmer prior to alignments.

Best,
Felix
fkrueger is offline   Reply With Quote
Old 03-27-2011, 08:19 PM   #77
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

The bismark_to_SAM_v3.pl script appears to have a small bug where the negative strand alignments are off by 1 base. This can be rectified in the script by adding 1 to the "-" strand alignment start position.
zee is offline   Reply With Quote
Old 03-27-2011, 08:22 PM   #78
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi Zee,

Is this in the single-end or paired-end analysis?

Cheers,
Oliver
olivertam is offline   Reply With Quote
Old 03-27-2011, 08:24 PM   #79
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Hi Oliver,

We have picked it up in paired-end analysis. However we have seen the same bit of code in the single-end subroutine code so it will probably be the same thing because these are "GA" alignments.

Z
zee is offline   Reply With Quote
Old 03-27-2011, 08:35 PM   #80
olivertam
Member
 
Location: Cold Spring Harbor

Join Date: Jun 2010
Posts: 22
Default

Hi Z,

I made the changes. Could you test whether it works properly now?

Thanks heaps for picking it up.

Cheers,
Oliver
Attached Files
File Type: pl bismark_to_SAM_v3a.pl (12.4 KB, 12 views)
olivertam is offline   Reply With Quote
Reply

Tags
alignment, bisulfite, bisulphite, methylation, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:16 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO