SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BLASR becoming very slow. What's the cause? myrs Pacific Biosciences 0 02-05-2015 08:27 AM
Bowtie 2 very slow for genomes containing many Ns fkrueger Bioinformatics 4 04-09-2014 12:28 AM
Hawkeye too slow to use on laptop Mike DS Bioinformatics 0 12-11-2013 05:45 PM
tophat too slow for HiSeq caddymob Bioinformatics 10 08-23-2012 01:05 PM
samtools mpileup very slow hyjkim Bioinformatics 9 03-16-2012 06:40 AM

Reply
 
Thread Tools
Old 05-16-2018, 08:41 PM   #1
craigt
Junior Member
 
Location: Utah

Join Date: May 2018
Posts: 2
Default bcftools is slow

Hi,

I am subsetting a vcf by positions stored in a tab delimited file using bcftools. I noticed the program is very slow. Here is the command:

bcftools view -R ./chr1.passing.markers.txt chr1.vcf.gz -Oz -o ./chr1.reduced.vcf.gz

where chr1.passing markers is tab delimited chromosome and position for muliple positions, no header. 68K positions. Original vcf has 524K positions.

The bcftools command is not using NFS (reading/writing to local disk, executable running from the analysis directory), no competing jobs. It is taking a really long time. Still running after 120 minutes.

I wrote an equivalent perl script that completes this in 10 minutes but uses flat files and so should be even slower than bcftools with its binary file format.

bcftools version is up to date.

Does anyone have an idea how I can speed this up or what might be wrong?

Thanks,
Craig
craigt is offline   Reply With Quote
Old 05-17-2018, 01:08 AM   #2
Markiyan
Senior Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 111
Lightbulb RDBMS are usually more efficient with these types of queries...

It looks like it tries to ungzip and scan whole input file for each chromosome postion range...
So to do it 68 thousand times... it takes a bit of time...

to fix:
1. Your perl script need the following in the VCF open section in order to be able to read gzipped files:

if($vcf_file_in=~m/\.gz$/i or $vcf_file_in=~m/\.Z$/i){
open (VCF_IN, "zcat $vcf_file_in |") or die "\nUnable to open gzipped vcf input file: $vcf_file_in\n";
}else{
open (VCF_IN, $vcf_file_in) or die "\nUnable to open vcf input file: $vcf_file_in\n";
}


OR:
2. try running it on ungzipped input file...

OR:
3. If you are good with perl, DBI, SQL and MySQL/postgres/etc you can try loading input vcf file into mysql table(s), index it properly and run a set of SQL queries to select needed data.
PS: Make sure to crank up the MySQL server memory limits (/etc/my.cnf) before attempting to do it...

Last edited by Markiyan; 05-17-2018 at 01:17 AM. Reason: Refinement
Markiyan is offline   Reply With Quote
Old 05-17-2018, 07:06 AM   #3
craigt
Junior Member
 
Location: Utah

Join Date: May 2018
Posts: 2
Default

"It looks like it tries to ungzip and scan whole input file for each chromosome postion range...
So to do it 68 thousand times... it takes a bit of time..."

> It is my impression that the program doesn't unzip anything, I believe it streams an indexed binary file and writes a binary file, that can also be indexed.

The call I made is very standard, if it is poorly phrased, let me know.

Has anyone experienced this issue of slow performance with bcftools compared to some other benchmark program and been able to resolve it?

Craig
craigt is offline   Reply With Quote
Reply

Tags
bcftools, slow

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO