SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CisGenome -- an integrated tool for ChIP-seq data analysis hji Bioinformatics 66 12-30-2014 02:55 PM
Bismark Bisulfite Aligner - Now supporting CpG, CHG and CHH context fkrueger Bioinformatics 27 10-11-2013 06:40 AM
Bismark v0.6.beta1: Now supporting gapped Bisulfite-Seq alignments fkrueger Bioinformatics 6 03-19-2012 06:06 AM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 09:43 AM

Reply
 
Thread Tools
Old 01-14-2014, 06:15 AM   #321
linnet
Junior Member
 
Location: Oxford

Join Date: Sep 2012
Posts: 2
Default deduplicate_bismark

Hello,

Does anyone knows input file size to memory footprint of deduplicate_bismark ? I have 156 GB WGBS bam file (10^9 PE reads) and I suspect 48 GB workstation is getting into HDD swap usage, which is not great... Any ideas or solutions? Alternatively I have tried samtools sort; rmdup; sort by name, but then methylation extractor is not happy, probably due to some pair-mates removed.

Thanks,
Skirmantas
linnet is offline   Reply With Quote
Old 01-14-2014, 06:40 AM   #322
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi Skirmantas,

Deduplicating such a large file with only 48GB of memory is probably not going to work... Admittedly, Perl is probably not the best option to store and quickly look up such an extraordinary number of entries, but the deduplication procedure is storing a composite string of chromosome:start:end:strand which is sort of essential to the process. One could probably reduce the memory footprint somewhat by not including the chromosome into the string but using a different hash for each chromosome, and replacing the end coordinate by the length of the fragment, but I don't know how much this would reduce the footprint.
Alternatively I could offer you to set up an FTP site for your file and deduplicate it here overnight ...
fkrueger is offline   Reply With Quote
Old 01-14-2014, 07:50 AM   #323
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by linnet View Post
Alternatively I have tried samtools sort; rmdup; sort by name, but then methylation extractor is not happy, probably due to some pair-mates removed.
Hi- I think samtools rmdup doesn't handle paired reads correctly. I would try to use picard MarkDuplicates instead.

Good luck!
Dario
dariober is offline   Reply With Quote
Old 01-17-2014, 07:48 AM   #324
linnet
Junior Member
 
Location: Oxford

Join Date: Sep 2012
Posts: 2
Default duplicates

Dear fkrueger and Dario,

Thank you both for lighting replies and suggestions! I will try picard, since I want to have more permanent solution rather than bothering fkruager every time I have a big dataset. If I am completely stuck, I may be in touch though.

Regarding picard I was under a probably false impression that samtools and picard are using the same algorithm to eliminate duplicates... I had a difficulty to find how picard defines PE duplicate (which is clearly explained in deduplicate_bismark)

Thanks again,
Skirmantas
linnet is offline   Reply With Quote
Old 01-17-2014, 08:23 AM   #325
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by linnet View Post
Regarding picard I was under a probably false impression that samtools and picard are using the same algorithm to eliminate duplicates... I had a difficulty to find how picard defines PE duplicate (which is clearly explained in deduplicate_bismark)
Hi- See this link for an explanation of MarkDuplicates. if you haven't already seen it http://sourceforge.net/apps/mediawik...icates_work.3F

Quote:
Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome, inter-chromosomal pairs will not be identified, but MarkDuplicates will not fail due to inability to find the mate pair for a read.
And here http://www.biostars.org/p/3917/ for some comments about rmdup vs MarkDuplicates.

Dario
dariober is offline   Reply With Quote
Old 03-04-2014, 08:27 PM   #326
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Hi Felix,

Currently the SAM header produced by Bismark has the @SQ fields in a more-or-less random order. This means that the @SQ fields will almost certainly not be in the same order as the they are in the reference genome/index. It's not a huge deal but it means I have to re-header my SAM files because I rely on the order of the @SQ fields in the SAM header in downstream tools. Is it possible to change this behaviour so that the @SQ fields in the SAM header are in the same order as in the reference genome/index?

My understanding is that the @SQ information is stored in a hash ("chromosomes") and so there is no guarantee on the order that these will be printed in when called by the function generate_SAM_header(). I'm not sure how hard that is to do this in the current framework. My idea was to store an index variable as an additional value in that hash, where the index is the position of that chromosome in the reference genome/index, but my limited Perl skills couldn't get it working.

I understand if you don't consider it an issue and I will just continue to re-header my SAM files.

Thanks,
Pete
PeteH is offline   Reply With Quote
Old 03-05-2014, 03:30 AM   #327
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Quote:
Originally Posted by PeteH View Post
Hi Felix,

Currently the SAM header produced by Bismark has the @SQ fields in a more-or-less random order. This means that the @SQ fields will almost certainly not be in the same order as the they are in the reference genome/index. It's not a huge deal but it means I have to re-header my SAM files because I rely on the order of the @SQ fields in the SAM header in downstream tools. Is it possible to change this behaviour so that the @SQ fields in the SAM header are in the same order as in the reference genome/index?

My understanding is that the @SQ information is stored in a hash ("chromosomes") and so there is no guarantee on the order that these will be printed in when called by the function generate_SAM_header(). I'm not sure how hard that is to do this in the current framework. My idea was to store an index variable as an additional value in that hash, where the index is the position of that chromosome in the reference genome/index, but my limited Perl skills couldn't get it working.

I understand if you don't consider it an issue and I will just continue to re-header my SAM files.

Thanks,
Pete
Hi Pete,
I have changed the order of the @SQ header lines to reflect the order in which the reference files are read in. This is normally the same order the files are sorted in the reference genome directory, or for multi-fasta files it should be the order the chromosomes are listed inside that file. So instead of random ordering it would now look like this for the mouse genome:

Code:
@HD     VN:1.0  SO:unsorted
@SQ     SN:1    LN:197195432
@SQ     SN:10   LN:129993255
@SQ     SN:11   LN:121843856
@SQ     SN:12   LN:121257530
@SQ     SN:13   LN:120284312
@SQ     SN:14   LN:125194864
@SQ     SN:15   LN:103494974
@SQ     SN:16   LN:98319150
@SQ     SN:17   LN:95272651
@SQ     SN:18   LN:90772031
@SQ     SN:19   LN:61342430
@SQ     SN:2    LN:181748087
@SQ     SN:3    LN:159599783
@SQ     SN:4    LN:155630120
@SQ     SN:5    LN:152537259
@SQ     SN:6    LN:149517037
@SQ     SN:7    LN:152524553
@SQ     SN:8    LN:131738871
@SQ     SN:9    LN:124076172
@SQ     SN:MT   LN:16299
@SQ     SN:X    LN:166650296
@SQ     SN:Y    LN:15902555
@PG     ID:Bismark      VN:v0.10.2      CL:"bismark /data/public/Genomes/Mouse/NCBIM37/ simulated.fastq"
Does this help? The latest version (v0.10.2) is attached.
Attached Files
File Type: zip bismark.zip (60.5 KB, 3 views)
fkrueger is offline   Reply With Quote
Old 03-05-2014, 04:22 PM   #328
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Fantastic, thanks Felix! Simply having a second hash storing the SQ_order was so much simpler than my idea.
PeteH is offline   Reply With Quote
Old 03-10-2014, 10:00 AM   #329
adeirossi
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default MAPQ with Bowtie2?

Hi Felix,

First and foremost, thank you for a wonderful tool.

I am curious to know why Bismark always reports 255 as the MAPQ (mapping quality), even when running Bowtie2, which reports a continuous range of mapping qualities. It would be useful to have some idea of relative mapping quality among the unique alignments. Are there reasons not to simply report the MAPQ values generated by Bowtie2?

Thanks in advance,
Andrew
adeirossi is offline   Reply With Quote
Old 03-10-2014, 10:58 AM   #330
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Quote:
Originally Posted by adeirossi View Post
Hi Felix,

First and foremost, thank you for a wonderful tool.

I am curious to know why Bismark always reports 255 as the MAPQ (mapping quality), even when running Bowtie2, which reports a continuous range of mapping qualities. It would be useful to have some idea of relative mapping quality among the unique alignments. Are there reasons not to simply report the MAPQ values generated by Bowtie2?

Thanks in advance,
Andrew
Hi Andrew,

To be perfectly honest we have never really considered the MAPQ value of alignments very much so far. As I understand it the mapping quality is a measure of the probability that an alignment is the true origin of a read over its next best alignment(s). Historically at least, when we started with Bismark and reads were only 28bp long or so, the C to T conversion really meant a big problem when it comes to sequence complexity (i.e. a sequence that would never align using a 4-letter alphabet might only have a single mismatch in BS-Seq). On top of that one would not only have to deal with a sequence and its second best alignment, but also with the best alignment and all other alignments coming from the other 1 (or 3 alignment) threads, depending on the type of experiment/alignment mode. We therefore decided that a sequence either has a unique alignment (as in an alignment that is better than all potentially other alignments), in which case it is given MAPQ value of 255, or it doesn't, in which case it would be discarded.

If you think that it would make sense to feed the MAPQ value of Bowtie 2 through to the actual Bismark output we might have a look into this again (I suppose one would have to look at the MAPQ values from all alignment threads in and choose the lowest one then or something like that).
fkrueger is offline   Reply With Quote
Old 03-12-2014, 06:27 PM   #331
adeirossi
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default MAPQ with Bowtie 2?

Quote:
Originally Posted by fkrueger View Post
If you think that it would make sense to feed the MAPQ value of Bowtie 2 through to the actual Bismark output we might have a look into this again (I suppose one would have to look at the MAPQ values from all alignment threads in and choose the lowest one then or something like that).
Thanks, Felix, for your reply. As you mention, things are complicated when you need to integrate information across alignment threads. If you'll bear with me, I will try to illustrate my thought process with a contrived example.

Consider a read that aligns well to exactly two sites in the genome -- one on the OT (original top) strand, the other on the OB (original bottom) strand. Bowtie 2 aligns the read uniquely to the OT strand of chr1 with a high MAPQ of 40, but in a separate thread, Bowtie 2 aligns the same read uniquely to the OB strand on chr2 with an almost-as-high MAPQ value of 38. I understand that Bismark reports the alignment with the highest alignment score (AS tag); let's say this is the first alignment, so the alignment to chr1 is printed to the SAM output.

What to report as the MAPQ value for this SAM entry? A few options:
  1. Report 40, the MAPQ value from Bowtie 2 for the printed alignment. This would be simplest, but it ignores the presence of an almost-as-good alignment from the other thread.
  2. Report 38, the minimum of MAPQ values from all alignment threads. This at least acknowledges the existence of another alignment from the other thread, but it seems like an inflated MAPQ value for a read that aligns quite well to two different sites.
  3. Report a value considerably less than 38 to reflect that, when both OT and OB alignments are considered, the best alignment is not so unique.
Intuitively, option 3 seems the best; a high MAPQ value should be reported only when Bowtie 2 gives a high MAPQ in exactly one of the alignment threads. But this option is clearly ill-defined as I have not suggested an actual formula. (Maybe report [MAPQ of printed alignment] - [highest MAPQ among non-printed alignments] = 40 - 38 = 2, or something similar?)

Anyway, it seems to me that even an imperfect MAPQ value would be better than none at all. But I'm still new to bisulfite sequencing, so I will defer to your wisdom.

Andrew
adeirossi is offline   Reply With Quote
Old 03-12-2014, 06:34 PM   #332
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

This may become moot as bowtie2 can now support large genomes so separate alignments for different strands of the genome could be a thing of the past and simplify bisulfite alignment quite a bit! Of course this will take some work upgrading the various bisulfite alignment packages to be compatible with the new bowtie.
frozenlyse is offline   Reply With Quote
Old 03-13-2014, 02:43 AM   #333
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

If you need MAPQ values then you have a number of different options, including:
  1. BSmooth: Like bismark, it's a perl wrapper around bowtie2. It's always been slower than bismark when I've used it. It simply passes through the MAPQ of the best alignment, rather than recalculating it.
  2. bwa-meth: A python wrapper around bwa mem that happens to work like frozenlyse just mentioned (i.e., it concatenates two versions of the converted genome together and then aligns against that in one go). Only supports directional and paired-end libraries.
  3. Bison: A C wrapper around bowtie2 that otherwise functions similar to bismark, but recalculates MAPQ scores according to alignments to different strands using the same MAPQ algorithm in bowtie2. The setup can be non-trivial, so I wouldn't bother for a small one-off dataset (you might try bwa-meth in that case, unless you have a non-directional library).

There are a number of others out there in addition to these, of course.

Last edited by dpryan; 03-13-2014 at 02:52 AM.
dpryan is offline   Reply With Quote
Old 03-13-2014, 02:50 AM   #334
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Devon, can the calculation of MAPQ be summarized in a straight forward formula (described in detail in the Bowtie2 documentation?) or does one have to delve into the source code to find it? It shouldn't be hard to implement since all the mapping values are available and being processed anyway...
fkrueger is offline   Reply With Quote
Old 03-13-2014, 02:56 AM   #335
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You have to go through the source code to find it, unfortunately. It would seem straight-forward enough to add to bismark (more or less, there are some oddities so implementing it in perl might produce ever so slightly different results, though they'd be close enough). You can find a C version of the algorithm here (starting on line 109). It's never been clear to me how they came up with that MAPQ algorithm, so don't ask me to explain why it is the way it is
dpryan is offline   Reply With Quote
Old 03-13-2014, 03:03 AM   #336
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Cheers for that. It doesn't look overly complicated (and only slightly random), I might have a go at this next time I am running out of other things to do
fkrueger is offline   Reply With Quote
Old 03-13-2014, 03:20 AM   #337
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by fkrueger View Post
I might have a go at this next time I am running out of other things to do
Hah! I suppose you'll retire at some point... :P
dpryan is offline   Reply With Quote
Old 03-14-2014, 03:52 PM   #338
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

Quote:
Originally Posted by dpryan View Post

bwa-meth: A python wrapper around bwa mem that happens to work like frozenlyse just mentioned (i.e., it concatenates two versions of the converted genome together and then aligns against that in one go). Only supports directional and paired-end libraries.
it does support single-end as well, but yeah, only directional for now.

glad to see bismark outputting the sorted header, that does simplify things for downstream analyses.

what filtering to bismark/bison methylation extractors do?
brentp is offline   Reply With Quote
Old 03-15-2014, 07:31 AM   #339
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I can't speak to what bismark does at the moment, but bison filters by (1) read/pair MAPQ (2) base phred score and (3) position in a read (to ignore biased regions). There's also some filtering in paired-end reads if the pairs disagree on a methylation call (this is relatively rare).

Edit: I forgot to mention that flagged PCR duplicates are ignored.

Last edited by dpryan; 03-15-2014 at 10:43 AM. Reason: Forgot PCR duplicates
dpryan is offline   Reply With Quote
Old 03-25-2014, 02:22 PM   #340
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Hi Felix,

A seemingly harmless error is raised right at the end of running the methylation extractor with the --mbias_only flag (v0.10.1):

bismark_methylation_extractor -p --mbias_only sampleName.bam
# Various
# Lines
# Of
# Output
Failed to read from methylation extractor output file CpG_OT_sampleName.txt: No such file or directory

As far as I can tell nothing has actually gone wrong and there are no "orphan" files left behind by the methylation extractor.

Cheers,
Pete
PeteH is offline   Reply With Quote
Reply

Tags
alignment, bisulfite, bisulphite, methylation, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:32 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO