SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Targeted Genome Assembly for region poorly represented in reference genome? gumbos Bioinformatics 1 01-09-2012 05:01 PM
Please help: imperfect reference genome/get consensus on genome/read alignment? KAP Bioinformatics 1 08-19-2011 08:14 AM
transferring annotations from reference genome to the resequenced genome mike.t Bioinformatics 1 09-17-2010 06:35 AM
Reference genome for MAQ - split reference genome by chromosome or not? inesdesantiago Bioinformatics 4 02-18-2009 09:44 AM
Reference Genome Macki1x Bioinformatics 1 07-30-2008 05:58 PM

Reply
 
Thread Tools
Old 01-29-2010, 06:45 AM   #1
bair
Member
 
Location: London

Join Date: Jan 2010
Posts: 65
Default Reference genome

Human 1000genomes is using GRCh37 based reference (three steps to generate the reference fasta file, ensembl 56), but the reference snp is from dbsnp129 mapped to NCBI36.3 (built in Aug, 2009)

dbsnp130 was built on May 03, 2009, any reason dnsnp130 was not chosen by 1000genome? Is it matter that using NCBI36.3 for snp while NCBI37 for genome?

SNP annotation from Ensembl variation 56 is different with the NCBI dbsnp130 annotation (position may be different), I'm not sure the fasta file 1000genomes used is the same with NCBI37 /hg19 genome

Since hg18 (NCBI36.1) is a heavily used genome reference, there are some published results can be used for comparison, but more and more results are coming from 1000genomes, I'm confused which version of reference to use.

Any suggestion?
bair is offline   Reply With Quote
Old 01-31-2010, 01:39 PM   #2
whsqwghlm
Member
 
Location: Cambridge, UK

Join Date: Jun 2009
Posts: 14
Default

Second guessing 1000 genomes, I presume that they used the latest version of the assembly and dbSNP that were available when they stated their production build, which will have been a few months before the final release. Now that Ensembl and, I presume, NCBI, offer tools for converting coordinates between one assembly and another, my suggestion would be to use the latest assembly for any new work, and update legacy data as required.
whsqwghlm is offline   Reply With Quote
Old 02-01-2010, 03:57 AM   #3
bair
Member
 
Location: London

Join Date: Jan 2010
Posts: 65
Default

Thank you whsqwghlm, I found the website to download the converting and other programmes. It's very helpful. So I can use the latest assembly without any problem.
bair is offline   Reply With Quote
Old 02-03-2010, 02:55 PM   #4
bair
Member
 
Location: London

Join Date: Jan 2010
Posts: 65
Default

Hello whsqwghlm,

Since coordinates converting cannot match 100%, there are unmapped parts. If we use latest assembly for alignment, the unmapped part will affect the analysis comparison. For example,dbSNP130 has HuRef information, which is NCBI36.3 assembly based, how to link the unmapped part to GRcH37?
bair is offline   Reply With Quote
Old 02-04-2010, 01:55 AM   #5
whsqwghlm
Member
 
Location: Cambridge, UK

Join Date: Jun 2009
Posts: 14
Default

Hi Blair,
You have identified the key limitation of the approach. However, I'm trusting that GRC did a good job on the update. I.e. one can assume that any regions on 36 that did not map to 37 had underlying problems hence any information mapped to these regions should be treated with suspicion.
whsqwghlm is offline   Reply With Quote
Old 02-04-2010, 03:32 AM   #6
bair
Member
 
Location: London

Join Date: Jan 2010
Posts: 65
Default

Though dbSNP130 is NCBI36.3 based, Ensembl biomaRt provides snp coordinate based on GRCh37, not sure if some SNPs have null coordinates.

Anybody know where to download NCBI 36.3 reference assembly for aligment? It's easy to find hg18/hg19 whole genome sequences, but hard to find NCBI36.3.

Anybody knows the access ID for James watson and J.C.Venter's genome sequences on 1000genomes? like NA18507 for Yoruba?
bair is offline   Reply With Quote
Old 03-11-2010, 05:04 AM   #7
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi, I think this is what you are after

ftp://ftp.ncbi.nih.gov/genomes/H_sap...VE/BUILD.36.3/


I am trying to do the same, but I am stuck now. How can I align against the whole genome? Do I have to merge all the chromosome files in one using cat?

Dave
dnusol is offline   Reply With Quote
Old 03-11-2010, 06:31 AM   #8
bair
Member
 
Location: London

Join Date: Jan 2010
Posts: 65
Default

Quote:
Originally Posted by dnusol View Post
Hi, I think this is what you are after

ftp://ftp.ncbi.nih.gov/genomes/H_sap...VE/BUILD.36.3/


I am trying to do the same, but I am stuck now. How can I align against the whole genome? Do I have to merge all the chromosome files in one using cat?

Dave
Thanks for your email.

Yes, you have to merge all into one fasta file.
bair is offline   Reply With Quote
Old 03-11-2010, 07:42 AM   #9
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi, thanks for the help. I can´t find the mtDNA though, does anyone know where it is?

Edit: Ups, found it! it was there all the time!
dnusol is offline   Reply With Quote
Old 03-11-2010, 11:45 AM   #10
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Could you share where this information is available on the 1000 genomes project. Their website is not that detailed I guess.

Quote:
Originally Posted by bair View Post
Human 1000genomes is using GRCh37 based reference (three steps to generate the reference fasta file, ensembl 56), but the reference snp is from dbsnp129 mapped to NCBI36.3 (built in Aug, 2009)
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 07-18-2010, 09:49 PM   #11
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

when I click on the link above and I choose CHRY, I get this
subdirectory:

FTP-Verzeichnis /genomes/H_sapiens/ARCHIVE/BUILD.36.3/CHR_Y/ auf ftp.ncbi.nih.gov

--------------------------------------------------------------------------------
Eine Ebene höher

03/14/2008 12:00 153,864 hs_alt_chrY_Celera.asn.gz
03/14/2008 12:00 2,874,402 hs_alt_chrY_Celera.fa.gz
03/14/2008 12:00 4,074,665 hs_alt_chrY_Celera.gbk.gz
03/14/2008 12:00 48,082 hs_alt_chrY_Celera.gbs.gz
03/17/2008 12:00 3,057,950 hs_alt_chrY_Celera.mfa.gz
03/14/2008 12:00 156,296 hs_alt_chrY_HuRef.asn.gz
03/14/2008 12:00 5,491,065 hs_alt_chrY_HuRef.fa.gz
03/14/2008 12:00 7,785,563 hs_alt_chrY_HuRef.gbk.gz
03/14/2008 12:00 100,498 hs_alt_chrY_HuRef.gbs.gz
03/17/2008 12:00 5,842,179 hs_alt_chrY_HuRef.mfa.gz
03/04/2008 12:00 299,047 hs_ref_chrY.asn.gz
03/04/2008 12:00 7,534,471 hs_ref_chrY.fa.gz
03/04/2008 12:00 10,703,507 hs_ref_chrY.gbk.gz
03/04/2008 12:00 176,727 hs_ref_chrY.gbs.gz
03/05/2008 12:00 8,008,160 hs_ref_chrY.mfa.gz


but apparantly none of these files are aligned, they are only ~26MB,
but hapmap-files refer to positions >57M

I also got hg18 from UCSC, 60MB, but the positions don't match


genbank has info on the builds ,
http://www.ncbi.nlm.nih.gov/projects...ta/index.shtml

so build 36 , CHRY has length
57772954 , NC_000024.8
but the link has only the info, not the nucleotides.
http://www.ncbi.nlm.nih.gov/nuccore/NC000024.8

they have build 36 and build 37, but not build 36.3


I'm trying to match the positions of the files in
http://hapmap.ncbi.nlm.nih.gov/downl...mat/consensus/
and
http://hapmap.ncbi.nlm.nih.gov/downl...t/polymorphic/


I don't understand the meaning of column 2 in those files "allele"
the many letter-pairs at the end of the line are presumably different
people in that group samples on 2 different machines ?

Last edited by gsgs; 07-19-2010 at 12:54 AM.
gsgs is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO