SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
intersect VCF files adrian Bioinformatics 8 02-20-2014 12:36 PM
Annotating VCF files kasthuri Bioinformatics 7 07-16-2012 08:50 PM
Merging vcf files kjaja Bioinformatics 0 01-12-2012 11:06 AM
merging VCF files kjaja Bioinformatics 2 12-20-2011 06:50 AM
DBSNP - Accessing through script AmitL Bioinformatics 1 11-29-2011 05:07 PM

Reply
 
Thread Tools
Old 05-31-2011, 02:46 PM   #1
bnfoguy
Member
 
Location: New Jersey, USA

Join Date: May 2011
Posts: 17
Question Accessing .vcf.gz files on a Windows platform

Hello,

I'm a first timer to bioinformatics research and have recently been working on a project involving genetic variations. I am supposed to use the latest 1k genomes release file which is "ALL.2of4intersection.2010084.genotypes.vcf.gz". I have downloaded the file which is about 61.2GB () but am facing trouble in extracting its contents. I would really appreciate some guidance in this matter.
bnfoguy is offline   Reply With Quote
Old 05-31-2011, 10:15 PM   #2
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

As far as I know there was a new release of 1000genome SNP calls:

ftp://ftp-trace.ncbi.nih.gov/1000gen...hase1_release/

However, these files are zipped using the GNU ZIP program. I found a link for the windows version :
http://gnuwin32.sourceforge.net/packages/gzip.htm

You should then be able to decompress the file and view it using a text editor of your choice (as .vcf files are nothing but plain text files).

Nevertheless:

1) Uncompressing large files takes a very, very long time. If you use a conventional PC this could be in the hours range.

2) I don't know if you could actually open the resulting .vcf file as it is extremely large (62GB is the compressed version!)

3) Are you sure you need the genotypes file? I guess some participants of 1000genomes project will know better, but as far as I know genotypes file contain all the individual genotypes. That's what makes it that big. THe .sites file contain a more condensed version, but should include the same sites, but no inidiviual genotypes....

Hope that helps
ulz_peter is offline   Reply With Quote
Old 06-01-2011, 07:44 AM   #3
bnfoguy
Member
 
Location: New Jersey, USA

Join Date: May 2011
Posts: 17
Default

Thanks for your reply,

I have manually extracted the file but as you said the traditional text editors like Notepad or Word are having trouble opening the file due to its size. Is there a way that I could use specific command line prompts or programs to access specific protions of it? Linux is new to me and am still learning the commands. Would python snippets be of any help?
bnfoguy is offline   Reply With Quote
Old 06-01-2011, 10:35 AM   #4
RDW
Member
 
Location: London

Join Date: Oct 2008
Posts: 63
Default

Quote:
Originally Posted by bnfoguy View Post
Is there a way that I could use specific command line prompts or programs to access specific protions of it?
You could always use 'less':

http://en.wikipedia.org/wiki/Less_%28Unix%29

This is standard on Linux and other Unix-like systems, and there are versions for Windows that a quick search will find. But to do anything sensible with this file, you're going to need a program that knows how to parse it and extract what you need (this might, of course, be something simple you could write yourself, or maybe one of the utilities from the vcftools package).
RDW is offline   Reply With Quote
Old 06-01-2011, 11:57 AM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

According to some googling, zgrep will work.
swbarnes2 is offline   Reply With Quote
Old 06-01-2011, 01:32 PM   #6
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi bnfoguy,

Not quite a direct answer to your question, but I thought I would mention a tool I have been working on that addresses a similar issue of trying to view very large alignment files. It is called BAMseek and is available at http://code.google.com/p/bamseek/ . Currently, it works for BAM and SAM files but I could easily extend that to work on VCF files. This would allow you to at least view the file, get familiar with its contents, and even do some copying from the file. For more complex needs or repetitive tasks (such as extracting a large number of regions), then a knowledge of command line and scripting is always useful - such as perl or python. As mentioned above, vcftools might have what you need too.
BAMseek is offline   Reply With Quote
Old 06-05-2011, 10:17 PM   #7
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi bnfoguy,

I added VCF support to the BAMseek large file viewer. You can find the download here. You should be able to open both the uncompressed text file and the compressed gz file. Let me know how it works out for you.

While adding the VCF support, I found out some additional information that you may find useful. Those .gz files on 1000 genomes are actually BGZF-compressed files - you can decompress them as usual using gzip but you can also jump to locations within the file and begin decompressing from there which allows you to extract chunks from the file without decompressing the whole thing.

You may have noticed that those files also have an associated .tbi file in the 1000 genomes repository. These are Tabix index files which allow you to extract all features overlapping a genomic region, without requiring you to download the entire file locally first.

For example,
Quote:
tabix ftp://ftp.1000genomes.ebi.ac.uk/vol1...4.sites.vcf.gz 1:2,000,000-2,100,000
would query the file on the ftp server and give you back all features on chromosome 1 between 2 million and 2.1 million. A nice thread on this subject can be found here.

Hope that helps!
BAMseek is offline   Reply With Quote
Old 06-06-2011, 01:27 PM   #8
bnfoguy
Member
 
Location: New Jersey, USA

Join Date: May 2011
Posts: 17
Default

Hello BAMseek,

Thanks for your suggestions. I will try these and let you know if I have any success. Would you know of a method to read the .tbi file? Like opening it with notepad or any other windows program.

Thank you again,

Bnfoguy
bnfoguy is offline   Reply With Quote
Old 06-06-2011, 05:33 PM   #9
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi Bnfoguy,
The .tbi files are external indexes that help locate regions within the .gz files. It uses a binning scheme similar to the one used for quickly doing range queries on BAM files. The Tabix program can create and use the indexes to do range queries, using commands like I showed in my earlier post. Looks like you have to build the program from source and that would most easily be done on Linux or Mac. On Windows, you might have the most luck trying to work with the TabixReader.java file that comes in the Tabix download, but that would take some programming skills to create a working program, I would think.

The .tbi files are not human readable and would really only need to be looked at if you were interested in understanding how the binning scheme works. The Tabix program would take care of building and using those indexes for you. In case you are interested in how the index works, here is the .tbi schema tabix.pdf

Hope that at least partially answers your question. I know that is alot to bite off.

BAMseek
BAMseek is offline   Reply With Quote
Old 06-06-2011, 07:51 PM   #10
JohnK@Genome_Quest
Junior Member
 
Location: Worcester, MA

Join Date: Jun 2011
Posts: 7
Default

Quote:
Originally Posted by bnfoguy View Post
Hello,

I'm a first timer to bioinformatics research and have recently been working on a project involving genetic variations. I am supposed to use the latest 1k genomes release file which is "ALL.2of4intersection.2010084.genotypes.vcf.gz". I have downloaded the file which is about 61.2GB () but am facing trouble in extracting its contents. I would really appreciate some guidance in this matter.
I used to use Cygwin when I used Windows. Once you figure out how Cygwin's files are setup and tied in to Windows, which should be fairly easy to figure out, you can figure out how to navigate to whatever directory in your Windows system that is storing your data. Then you can use your gunzip, and all those other 'great' 'nix binaries:

http://cygwin.com/

Once you've done this, you can also use this basic perl command line (CML) template to parse whatever columns:

< file_name perl -e 'while(<>){ $line = $_; ($var1, $var2, $var3) = split("\t", $_); print "$var1\n"; }' > new_file

You can modify this to get the job done, and once you figure out the basic syntax from above I'm sure you'll be a perl CML guru.

Last edited by JohnK@Genome_Quest; 06-06-2011 at 07:54 PM.
JohnK@Genome_Quest is offline   Reply With Quote
Old 08-09-2012, 06:55 AM   #11
bpb9
Member
 
Location: NYC

Join Date: Aug 2012
Posts: 24
Default Split VCF by chromosome?

Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!
bpb9 is offline   Reply With Quote
Old 08-09-2012, 05:52 PM   #12
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Quote:
Originally Posted by bpb9 View Post
Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!
Hi bpb9,

You can flexibly view or output the contents of your huge file by a series of commands in the terminal. awk, sed, grep, perl one liners would be good choices.

Assuming your .vcf file is tab delimited, you can use awk in the terminal. Open a terminal window on your mac, and 'cd' (change directory) into the directory where your file is saved. (press enter to execute a command)
e.g.
Code:
cd /home/user/directorycontainingyourfile
If you do not know which directory you are when you open your terminal, type

Code:
pwd
It should show you 'where' you are. If you are unsure about directory structures in unix/linux, do some googling, it should become apparent pretty quick.

Then type the following:

Code:
awk ' { if ( $1 == "1") print $0 } ' filename.txt > output.txt
This means:
$1 is the column number, so if it is equals 1 (the chromosome ID), it will output the whole line that fits that condition. $0 means the entire line. You can substitute '1' for the name of you chromosome e.g. chr1 . You can select other columns by changing $1 into $2 etc. The '>' symbol means the output of the awk command is saved in a file called output.txt

If you just want to look at what it does first, you can make it show you the lines without outputting to a file.

Code:
awk ' { if ( $1 == "1") print $0 } ' filename.txt | head
same deal as above, except the '|' (pipe) means it takes the output from the awk command, and gives it to head, which shows the first 10 lines of the output. You can vary the number of lines by using the '-n' option. e.g. head -n 20 gives you 20 lines

There are many parameters to these commands, so you if you do some search you should be able to get pretty flexible searching.
Btw, these commands work for a Linux platform. You might need to adjust on a Mac, but just try it out first.

hope that helps.
Kennels is offline   Reply With Quote
Old 08-10-2012, 03:54 AM   #13
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

To all who struggle with large files, go http://mh-nexus.de/en/hxd/ and download HxD, you can use this to open and view any size file, at speed of light.

Best,

dong
xied75 is offline   Reply With Quote
Old 08-10-2012, 09:51 AM   #14
bpb9
Member
 
Location: NYC

Join Date: Aug 2012
Posts: 24
Default

Thanks for the tips. I will definitely try this and let you know how it goes. I ended up using R yesterday to do this, but it took about an hour just to read in the file. After that I was able to split the files by converting the txt file to a data frame and splitting it into smaller files based on the value in the first column (the chromosome number):

Code:
>colnames(txtfilename)<-c("CHR","POSITION","ID","Allele1Ref","Allele2Var","Ancestral")
>dataframe<-data.frame(txtfilename)
>apart<-split(dataframe,dataframe["CHR"])
>lapply(split(dataframe, dataframe$CHR),
function(x)write.table(x, quote=FALSE, row.names=FALSE, col.names=TRUE, file = paste(x$CHR[1], ".txt", sep = "")))

This way I have one file named after each chromosome.

The way you guys suggested is probably much faster!

P.S. How do you guys enter the code in a separate box like that?
bpb9 is offline   Reply With Quote
Old 08-10-2012, 09:58 AM   #15
bpb9
Member
 
Location: NYC

Join Date: Aug 2012
Posts: 24
Default

Dong, it appears that program is only for Windows users.

Kennels, your way was SO much faster! Thanks!
bpb9 is offline   Reply With Quote
Old 08-13-2012, 02:32 AM   #16
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

Quote:
Originally Posted by bpb9 View Post
Dong, it appears that program is only for Windows users.
Yes, but the thread title mentioned 'Windows' there.

Anyway, don't use R to process text files, that's not what it was designed for.
xied75 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:54 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO