Seqanswers Leaderboard Ad

**simonandrews** · 04-11-2013, 11:48 PM

Originally posted by oria34 View Post

Failed to open filehandle: Too many open files at bismark_methylation_extractor line 2921

We actually had a separate report for this last week. The problem is that when it's extracting data for the bedGraph files bismark keeps a set of files open, one for each 'chromosome', so that it can write out all of the data in parallel.

If however you're working on a genome which isn't assembled into chromosomes but instead has a large number of assembly contigs then bismark will try to open a file for each contig. On all operating systems there is a limit to the number of files which can simultaneously be open for writing, and if the number of contigs is larger than the number of allowed filehandles then the script will fail. On the linux systems we checked here the limit is 1024 files, so if you had more contigs than this then this would trigger the problem.

The quick and ugly fix is to increase the number of allowed filehandles on your system. Another option would be to remove very short contigs from your assembly, as these usually make up the majority of total contigs, but contribute very little uniquely mappable sequence.

We have thought about whether we could easily fix this within bismark but the compromises in terms of efficiency to dynamically close and reopen filehandles as required are quite nasty and this probably isn't something we're going to implement in the near future.

**oria34** · 04-12-2013, 02:09 AM

Thank you for the fast reply!

Yes, my genome has a lot of unassembled contigs so that is the problem.

I will extract particular chromosomes of interest from the output SAM file and work only with that subset of data (so I need to run again the methylation extractor, right?) .

I don't think that increasing the number of the aloowed filehandles is even an option since I am just a "remote user" (not sure about the term) at a supercomputing center...

This was really helpful, many thanks

**fkrueger** · 04-12-2013, 02:23 AM

Hi Oria,

The user last week was able to get the Linux file limit increased by his IT department; once this was done the bedGraph and cytosine report seems to have worked fine.
If you don't have the option to increase this limit, concentrating on a smaller number (a couple of hundred?) of contigs would probably be the easiest option. You would indeed have to process the SAM file and only write out the lines you are interested in, and then run the methylation extractor again on those. By the way I am sorry for the slow speed of the methylation extractor right now, I am in the process of speeding it up (will probably take a while though).

PS: Your library must be directional since you got a fairly high mapping efficiency with only OT and OB alignments (most libraries are directional anyway)

**fkrueger** · 04-17-2013, 01:00 PM

We have just released a new version of Bismark (v0.7.10), which is probably the last release before the implementation of multi-threading for both Bismark and the methylation extractor.

This new release adds several new features, such as BAM or compressed temporary output as well as a dedicated option for aligning PBAT-Seq libraries. Most notably, the methylation extractor sees a 60% speed increase for the processing of ungapped reads (which is the default). Here are all changes in more detail:

• Bismark: Added new option '--gzip' that causes temporary bisulfite conversion files to be written out in a GZIP compressed form to save disk space. This option is available for most alignment modes with the exception of paired-end FastA files
• Added new option '--bam' that causes the output file to be written out in BAM format instead of the default SAM format. Bismark will attempt to use the path to Samtools that was specified with '--samtools_path', or, if it hasn't been specified explicitly, attempt to find Samtools in the PATH. If no installation of Samtools can be found the SAM output will be compressed with GZIP instead (yielding a .sam.gz output file)
• Added new option '--samtools_path' to point Bismark to your Samtools installation, e.g. /home/user/samtools/. Does not need to be specified explicitly if Samtools is in the PATH
• Added new option '--pbat' which is to be used for PBAT-Seq libraries (Post-Bisulfite Adapter Tagging; Kobayashi et al., PLoS Genetics, 2012). This is essentially the exact opposite of alignments in 'directional' mode, as it will only launch two alignment threads to the CTOT and CTOB strands instead of the normal OT and OB ones. The option '--pbat' works currently only for single-end and paired-end FastQ files for use with Bowtie1 and uncompressed temporary files only (there are no plans to extend this to other alignment modes at present)
• Methylation extractor: The methylation extractor does now also read BAM files, however this requires a working copy of Samtools. The new option '--samtools_path' may point the methylation extractor to your Samtools installation, e.g. /home/user/samtools/. This does not need to be specified explicitly if Samtools is in the PATH
• Added new option '--gzip' to write out the primary methylation extractor files (CpG_OT_..., CpG_OB_... etc) in a GZIP compressed form to save disk space. This option does not work on bedGraph and genome-wide cytosine reports as they are 'tiny' anyway
• The methylation extractor does now treat InDel free reads differently than before which leads to a ~60% increase in extraction speed for ungapped alignments in SAM format!
• Deduplication script: The deduplication script does now also read BAM files, however this requires a working copy of Samtools. The new option '--samtools_path' may point the script to your Samtools installation, e.g. /home/user/samtools/. This does not need to be specified explicitly if Samtools is in the PATH
• The deduplication script also received the new option '--bam' to write out deduplicated files directly in BAM format. If no installation of Samtools can be found the SAM output will be compressed with GZIP instead (yielding a .sam.gz output file)
• The Bismark User Guide and RRBS Guide have been updated

Bismark is available here: http://www.bioinformatics.babraham.a...jects/bismark/.

**fkrueger** · 04-22-2013, 03:12 AM

We have just released a new version of Bismark v0.7.11 (so it wasn't quite the last release before the introduction of multi-threading after all...). This version addresses some bugs and splits out the bedGraph and genome-wide cytosine report coversion options from the methylation extractor to the modules bismark2bedGraph and bedGraph2cytosine. These modules replace the former scripts 'genome_methylation_bismark2bedGraph.pl' and 'genome_wide_cytosine_report.pl'. The Bismark methylation does now call these modules, but they can also be run independently as stand alone tools.

Here are all changes in some more detail:

• Bismark: Fixed non-functional single-end alignments with Bowtie2 which were accidentally broken by introducing the option '--pbat' in v0.7.10 (an evil 'if' instead of 'elsif'...)
• For paired-end alignments with Bowtie 1, the option '--non_bs_mm' would accidentally confuse the number of mismatches of read 1 and read 2 whenever the first read aligned in reverse orientation, i.e. for OB and CTOT alignments. This has now been corrected
• Previously, the option '--non_bs_mm' would potentially output non-integer values for Bowtie 2 alignments if the read (or reference) contained 'N' characters. Alignment scores from 'N's are now adjusted so that they count as mismatches similar to what Bowtie 1 does. This works for fine reads with up to and including 5 N's (which is quite a lot...)
• Methylation extractor: To avoid duplication and keep code modular, the bedGraph conversion step invoked by the option '--bedGraph' is now been farmed out to the module 'bismark2bedGraph'. This script is independent of the methylation extractor and also works as a stand-alone tool from the methylation extractor output (compressed or gzip compressed files). To work well from within the methylation extractor this script (which is now included in the Bismark package) needs to reside in the same folder as the 'bismark_methylation_extractor' itself
• bismark2bedGraph: Temporary chromosome files now have an input file name included in their file name to enable parallel processing of several files in the same directory at the same time
• To avoid duplication and keep code modular, the bedGraph to genome-wide cytosine methylation report step invoked by the option '--cytosine_report' has now been split out to the module 'bedGraph2cytosine'. This script is independent of the methylation extractor and also works as a stand-alone tool from the Bismark bedGraph '--counts' output (compressed or gzip compressed files). To work well from within the methylation extractor this script (which is now included in the Bismark package) needs to reside in the same folder as the 'bismark_methylation_extractor' itself
• Deduplication script: Fixed some warnings that were thrown if '--bam' was not specified

Bismark is available for download from: http://www.bioinformatics.babraham.a...jects/bismark/

**fkrueger** · 05-10-2013, 02:02 AM

We have just released a new release of Bismark (v0.7.12) that is intended to fix the single-end alignment mode for Bowtie 2 which was accidentally slowed down by forgetting to remove a sleep() command while debugging... The changes in more detail:

- Bismark: Removed a rogue sleep(1) command that would slow down single-end Bowtie 2 alignments for a single lane of HiSeq (200M sequences) from ~1 day to 6 years and 4 months (roughly)
- bismark2bedGraph: keeps now track of the temp files it just created in a session instead of using all files in the output folder ending in ".methXtractor.temp". This lets you kick off the bedGraph conversion step from already sorted, individual methXtractor.temp files if desired

Bismark can be downloaded here: http://www.bioinformatics.babraham.a...jects/bismark/.

**luuloi** · 05-13-2013, 10:37 PM

Hi Felix,
Can I run Bismark, bowtie1 in multi threads -p option to tune the performance faster? I did it with bowtie2, but as you memtioned bowtie2 seems to be slow than bowtie1 with your experience. I have been waiting it for 4 days with size of .Bam file is 21M, it is so slow. BTW, when you will release multi thread Bismark? I have really looking forward to it. I have 14 WGBS samples for it

**fkrueger** · 05-14-2013, 12:26 AM

Originally posted by luuloi View Post

Hi Felix,
Can I run Bismark, bowtie1 in multi threads -p option to tune the performance faster? I did it with bowtie2, but as you memtioned bowtie2 seems to be slow than bowtie1 with your experience. I have been waiting it for 4 days with size of .Bam file is 21M, it is so slow. BTW, when you will release multi thread Bismark? I have really looking forward to it. I have 14 WGBS samples for it

Hi there, unfortunately there is no simple way to use the -p option for Bowtie1 in the current implementation of Bismark. This is because Bismark requires the alignments to be reported in the same order as the sequences appear in the input file, and multi-threaded Bowtie1 (-p > 1) does not guarantee this order. In the multi-threaded version of Bismark we are planning to take care of this shortcoming by splitting the input file into several chunks while transcribing the sequence files into bisulfite converted versions, and then essentially run several instances of Bismark at the same time. I can currently not promise however when I will find the time for the implementation though.

We do indeed see a somwhat slower speed of Bowtie2, but what you are describing sounds more like there is something going wrong. I have just had 4 lanes of HiSeq with ~200M sequences each that aligned in parallel in less than 30h. Also, with a nice cluster you should be able to align all your 14 samples with Bowtie1 overnight... maybe you want to contact me via email so we can talk about what is currently taking so long?

Cheers,
Felix

**oria34** · 05-14-2013, 03:51 AM

Hi all,

I finally managed to increase the allowed number of filehandles in my systems and Bismark Methyl Extractor ran like a champ

It took something around 12-15h to write down all the five files for a BC whole genome assembled in 18xx linking groups, so, in my opinion, it was really fast (I used --buffer_size 6GB)

An error message apeared at the end of the run but all the files seem to be ok:

Can't move to /XX/XX/XX/XX/Bisulfite/Bismark_Genome/XXXXXXX/: Not a directory at bismark_methylation_extractor line 3300.

Thank you so much for your support

PS. Just for the record; I used Bowtie 2 for the aligment and took no more than one day (~50 millions read pairs), quite fast I would say!

**cnoirot** · 05-14-2013, 05:55 AM

Hi
I'd like to clearly understand the meaning of PE directional read in bismark.

Here is the description of the sequencing kit provider : "First of all our BS-SEQ kit is not directional [...]. Secondly, you are correct in saying that read 1 is for the original strands and the second read is for the neo synthesized DNA."

So do I have to use --non_directional option or not ?

Thanks for your help.

**fkrueger** · 05-14-2013, 05:58 AM

If read 1 always aligns to the original strands you can just run it in default mode and do not need to specify --non_directional.

**luuloi** · 05-14-2013, 09:09 AM

Originally posted by luuloi View Post

Hi Felix,
Can I run Bismark, bowtie1 in multi threads -p option to tune the performance faster? I did it with bowtie2, but as you memtioned bowtie2 seems to be slow than bowtie1 with your experience. I have been waiting it for 4 days with size of .Bam file is 21M, it is so slow. BTW, when you will release multi thread Bismark? I have really looking forward to it. I have 14 WGBS samples for it

It has been resolved, thanks a lot Felix! Anyone encouter it, please just download the new version of Bismark v0.7.12

**pengchy** · 05-14-2013, 06:52 PM

Hi all,

I have two questions for bismark.
1. the read ids in the bam is not same as in the original fastq file.
The original read ids were like:

Code:

HISEQ700708:127:C1LUKACXX:3:1101:1153:42732/1
HISEQ700708:127:C1LUKACXX:3:1101:1153:42732/2

After bismark alignment, the read ids in the bam file were like:

Code:

HISEQ700708:127:C1LUKACXX:3:1101:1153:42732/1/1
HISEQ700708:127:C1LUKACXX:3:1101:1153:42732/1/2

2. In the report file, No information about how many reads were mapped with only one end of the paired-end data.

**pengchy** · 05-14-2013, 07:23 PM

Originally posted by fkrueger View Post

We have just released a new version of Bismark (v0.6.4) to address a few minor issues.

The changes include:

- Adjusted the options -u and -s so that only the non-skipped part of the input file will be transcribed and analysed. This allows splitting up very large files into smaller chunks to allow parallel processing, e.g -s 10000000 -u 20000000 would analyse sequences 10000001 to 20000000. The alignment report will be based on this reduced number of reads analysed
- In paired-end mode, the options --unmapped and --ambiguous do now output unaligned or multiply aligned reads, respectively, to their correct output files as intended
- Sequences in FastA format do now receive Phred score qualities of 40 throughout (ASCII 'I') to prevent the SAM to BAM conversion in SAMtools from failing
- If a genomic sequence could not be extracted it will now also be counted and reported for use with Bowtie 1
- Suppressed debugging warning meassages that were printed in error for Bowtie2 alignments (single-end mode only)

Bismark is available here.

Hi fkrueger,
In the report file of bismark, one line is:

Code:

Sequence pairs which were discarded because genomic sequence could not be extracted:    592

I cann't understand this term, what do you mean that the genomic sequence coud not be extracted?
thank you.

**fkrueger** · 05-15-2013, 12:47 AM

Hi pengchy,

To 1) It is true that Bismark appends segment numbers to the end of read. This is because Bowtie or Bowtie2 tend to delete these tags internally while aligning, and to make it more difficult they don't do it in the same way. To properly keep track of which read is doing what I had to do this change (btw also white spaces or tab characters are being replaced by _ in the read ID.

To 2) Bismark does not report singleton alignments for paired-end data but only reports paired alignments. In the Bismark help you can find:

Code:

--no-mixed               This option disables Bowtie 2's behavior to try to find alignments for the individual mates if
                         it cannot find a concordant or discordant alignment for a pair. This option is invariable and
                         and on by default.

--no-discordant          Normally, Bowtie 2 looks for discordant alignments if it cannot find any concordant alignments.
                         A discordant alignment is an alignment where both mates align uniquely, but that does not
                         satisfy the paired-end constraints (--fr/--rf/--ff, -I, -X). This option disables that behavior
                         and it is on by default.

If you wanted to look for singleton alignments for reads that do not produce valid paired-end alignments you could always write out unaligned reads and re-align them in single-end mode, but I would probably not advise doing this since comparing SE and PE alignments can have its own pitfalls.

To 3): In order to determine the sequence context of a read Bismark is extracting 2 extra basepairs at the start or the end of a read (where appropriate). If a read happens to align to the very end of a chromosome, Bismark can't extract 2 additional bp from the chromosomal sequence (because there is no more sequence), throws this warning message and moves on. This happens mostly for the MT, and it is normally fine to just ignore these warnings.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News