New Resources for 1000 Genomes

This topic is closed.

This is a sticky topic.

laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#1

New Resources for 1000 Genomes

05-09-2011, 04:42 AM

New Resources for 1000 Genomes

General Info

As well as posting new announcements on the front page of http://www.1000genomes.org, we have both rss http://www.1000genomes.org/announcements/rss.xml and twitter http://twitter.com/1000genomes twitter

You can also subscribe to and announcements list we have setup. http://listserver.1000genomes.org/ma...o/1000announce [email protected]

We have started an FAQ http://www.1000genomes.org/faq to provide help as to where to find certain data sets which surround the 1000 genomes project and answers to other similar questions.

Data Search

You can now search both our website and our ftp site.

To search the main website you can use the search box which appears in the top right hand corner of each page on http://www.1000genomes.org.

Our ftp search is linked to from the top menu bar at the top of each page. For our ftp site we have an index on the ftp site called the ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree which is updated every night to reflect the contents of the ftp site. http://www.1000genomes.org/ftpsearch

The search itself will look for strings in the names of files and directories on the ftp site. This means the search can be used to find all vcf files or files associated with a particular release date or particular individual.

The search options will allow you to include md5s in the output and have the ftp paths point to either the NCBI or the EBI ftp site. Due to the volume of results which would be returned the search by default excludes fastq and bam files but you can return these results to the search. Currently the search will only return the first 1000 results due to the large volume of files on the ftp site.

Accessibility

Many of our releases contain very large files which can be challenging to download in their entirety. Both bam and vcf files have indexes which allow subsections to be downloaded using samtools or tabix respectively. There are descriptions of how to do this in our faq. We also now have a web based tool within our Ensembl browser which allows you to request a 10KB subsection of these files.

The Data Slicer (http://browser.1000genomes.org/tools.html) needs the URL of a indexed bam or vcf file and then will present a view of this file and a bam or vcf file to download. The data slicer can be accessed from the tool link at in the top right hand of all browser pages. It should work for any remotely accessible tabix indexed vcf file. It will work for any indexed bam over http but may only work for ftp bams within the EBI

You can also upload data from bam or vcf files from our ftp site. To do you you need to click on the mange your data link on the left hand menu of a page. This is best done from Location view. The section of the menu you need to click on is labeled attach remote file. Only bam files from the EBI ftp site will be visible but any remotely accessible vcf which is accompanied by a tabix index. Once your file is loaded you should be able to see the snps or aligned reads displayed and also share these links with others. This is described with screenshots in our Ensembl tutorial http://www.1000genomes.org/sites/100...l_20110506.doc

The browser also has a variant effect predictor tool which will take in up to 750 snps and indels in VCF format or an Ensembl specific format. This tool provides functional consequences with respect to the current gene and regulatory annotation which include SIFT and PolyPhen for any non synonymous snps. http://browser.1000genomes.org/tools.html. You can also download

If you have any questions about these new features or any other aspects of the project please email [email protected]
Tags: None

Stuck
laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#2

06-16-2011, 05:38 AM

We have also now added a public mysql instance for the ensembl databases which back our browser

You can find more details of this on http://www.1000genomes.org/public-en...mysql-instance
Comment
laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#3

09-01-2011, 06:30 AM

Our browser has been updated to version 63 of the ensembl code and we have a new Variation Pattern Finder tool to go along slide it

1000genomes.org - 1000genomes Resources and Information.

http://browser.1000genomes.org/Homo_sapiens/UserData/VariationsMapVCF

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/variation-pattern-finder

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

The Data Slicer now also allows you to subsample vcf files on sample and population

1000genomes.org - 1000genomes Resources and Information.

http://browser.1000genomes.org/Homo_sapiens/UserData/SelectSlice?db=core

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!
Comment
laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#4

03-01-2012, 03:17 AM

We now have a tutorial about using 1000 genomes data

1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/announcements/using-1000-genomes-data-tutorial-2012-03-01

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!
Comment
Joann

Senior Member

Join Date: Oct 2008

Posts: 233
- Share
- Tweet
#5

03-29-2012, 07:12 AM

Amazon puts it in the cloud

s3.amazonaws.com/1000genomes

1000 Genomes - Registry of Open Data on AWS

http://aws.amazon.com/1000genomes/

Last edited by Joann; 03-29-2012, 07:15 AM. Reason: additional link
Comment
Richard Finney

Senior Member

Join Date: Feb 2009

Posts: 700
- Share
- Tweet
#6

03-29-2012, 09:46 AM

Amazon 1000 genomes?

From the amazon blog : "Researchers pay only for the additional AWS resources they need for further processing or analysis of the data.".

I'm guessing that's the "gotcha": you can view chunks for free (which you can anyway ... from other sources) but you get to pay for analyzing it.

I am wary of this "we'll keep the data and you can pay us" concept of "the cloud".

I think a better model would be: here's a shell login to your own VM and you can write or use your own python/java/c/bash programs to quickly access the 200TB.

I wish TCGA would do something like this but the data is locked down pretty hard. Maybe we'll get some open access disease samples as more Asian countries provide less encumbered data.
Comment
laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#7

07-02-2012, 01:51 AM

1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/announcements/phase-1-analysis-results-including-chry-and-chrmt-variant-calls-2012-07-02

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

A relatively complete set of variant and other files associated with our Phase 1 analysis are now available on the ftp site
Comment
gsgs

Senior Member

Join Date: Oct 2009

Posts: 139
- Share
- Tweet
#8

12-05-2012, 07:46 AM

currently I estimate (wild guess) you have ~500 complete human genomes (1500GB)
at ~10fold coverage but they are scattered in lots of different formats and directories
and it would take me ~10 hours to figure out how to find the data and decompress and
convert it and another ~5 hours to just download the compressed data

I'd like to see the estimates of others

----------new estimates-------
they have all 1092 genomes(people,"samples") sequenced at 2-6 fold coverage
(which I assume means that they have lots of small segments (~500 nucleotides
per segment ?) from the genome and those may have many errors but overlap
the genome at ~2-6 fold at each position)
critical positions, those with expected mutations overlap more often (50-100 fold)
So they have a total of ~2e13 overlapping nucleotides

the data is in "vcf" files with complicated format, so I stay with my estimate
of ~10hours work to convert them into a workable format.

The data could be ~700MB only, the y-chr came in 2 files of 29MB compressed
-------------------------------------------------

Last edited by gsgs; 12-05-2012, 08:52 PM.
Comment
laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#9

12-05-2012, 07:50 AM

What would you like to do with the data, that will very much determine what the best way to approach the data set,

1000 genomes is a large data set with a variety of different data formats but to answer a single question you rarely need more than one sort of file
Comment
gsgs

Senior Member

Join Date: Oct 2009

Posts: 139
- Share
- Tweet
#10

12-05-2012, 07:55 AM

I don't know yet.
Probably compare them, #mutations,distances
calculate the consensus,ancestor, plot the distances,
make my cloud-graphics(plot amino acid mutations over nucleotide mutations),
and mutation pictures(binary arrays,sequences over positions,pixel
at (x,y) iff x differs from consensus at position y) etc.

maybe this also works for "STR"s over normal mutations (these are new to me)

calculate recombination frequency
estimate mutation rates and what changes them
statistics of codon-usage
search for retrovirus

Last edited by gsgs; 12-05-2012, 11:49 AM.
Comment
laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#11

12-05-2012, 08:19 AM

I would strongly recommend starting with our recent paper and the analysis results associated with it

http://www.nature.com/nature/journal/v491/n7422/full/nature11632.html

ftp://ftp.1000genomes.ebi.ac.uk/vol1...lysis_results/

That is a great starting point
Comment
gsgs

Senior Member

Join Date: Oct 2009

Posts: 139
- Share
- Tweet
#12

12-05-2012, 08:37 AM

thanks.
10 pages the paper (pdf) ... printing...
2 pages the readme
that will keep me busy for a while ...
well, I'll probably only read and understand parts of it

I know, there is also the "hapmap" project, I managed to get
one of their tables into computer and analyze
Comment
laura

Senior Member

Join Date: Sep 2008

Posts: 151
- Share
- Tweet
#13

12-05-2012, 10:38 AM

do feel free to email [email protected] if you have any questions

We do also have a recent set of slides which were presented in a tutorial at ASHG2012

1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/announcements/1000-genomes-tutorial-and-poster-slides-ashg2012-2012-11-09

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!
Comment
gsgs

Senior Member

Join Date: Oct 2009

Posts: 139
- Share
- Tweet
#14

12-05-2012, 11:07 AM

no Y-chromosome ?

how would I pack the data ?
I want the 1092*36.7M SNPs in 23 binary files, one per chromosome.
Bit i in chromosome j in file(sample) k should be set, iff that SNP is present.
Then compressed with gzip.
23 files, ~50MB per file, I estimate
Comment
gsgs

Senior Member

Join Date: Oct 2009

Posts: 139
- Share
- Tweet
#15

12-05-2012, 11:38 AM

wait, I have a better idea.
You compute the genetical distance between any pair of two samples, 1092^2 integers,4MB.
Just the number of set bits in the logical xor of the two 37M-bit-vectors.
Then you (circular) sort the 1092 samples so the sum of the distances between two neighbors
is minimal (traveling salesman problem, typically easy to solve for n=1092)
Then you compute the logical xors of any two adjacent samples, which presumably has lots of zeros.
1092 binary vectors of length 37M again, but this time with much better compression
via gzip or such because of the many zeros.
I can write you the programs for encoding and decoding, if you want.
Self-expanding executable, easy to use, all automatic.
The size of that file would be a measure of the genetical variability of your set of 1092 samples.
Comment

Previous 1 2 3 template Next

Advancing Precision Medicine for Rare Diseases in Children

by seqadmin

Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
- Channel: Articles
12-16-2024, 07:57 AM
Recent Advances in Sequencing Technologies

by seqadmin

Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...
- Channel: Articles
12-02-2024, 01:49 PM

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 26 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

New Resources for 1000 Genomes

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News