Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reference genome

    Human 1000genomes is using GRCh37 based reference (three steps to generate the reference fasta file, ensembl 56), but the reference snp is from dbsnp129 mapped to NCBI36.3 (built in Aug, 2009)

    dbsnp130 was built on May 03, 2009, any reason dnsnp130 was not chosen by 1000genome? Is it matter that using NCBI36.3 for snp while NCBI37 for genome?

    SNP annotation from Ensembl variation 56 is different with the NCBI dbsnp130 annotation (position may be different), I'm not sure the fasta file 1000genomes used is the same with NCBI37 /hg19 genome

    Since hg18 (NCBI36.1) is a heavily used genome reference, there are some published results can be used for comparison, but more and more results are coming from 1000genomes, I'm confused which version of reference to use.

    Any suggestion?

  • #2
    Second guessing 1000 genomes, I presume that they used the latest version of the assembly and dbSNP that were available when they stated their production build, which will have been a few months before the final release. Now that Ensembl and, I presume, NCBI, offer tools for converting coordinates between one assembly and another, my suggestion would be to use the latest assembly for any new work, and update legacy data as required.

    Comment


    • #3
      Thank you whsqwghlm, I found the website to download the converting and other programmes. It's very helpful. So I can use the latest assembly without any problem.

      Comment


      • #4
        Hello whsqwghlm,

        Since coordinates converting cannot match 100%, there are unmapped parts. If we use latest assembly for alignment, the unmapped part will affect the analysis comparison. For example,dbSNP130 has HuRef information, which is NCBI36.3 assembly based, how to link the unmapped part to GRcH37?

        Comment


        • #5
          Hi Blair,
          You have identified the key limitation of the approach. However, I'm trusting that GRC did a good job on the update. I.e. one can assume that any regions on 36 that did not map to 37 had underlying problems hence any information mapped to these regions should be treated with suspicion.

          Comment


          • #6
            Though dbSNP130 is NCBI36.3 based, Ensembl biomaRt provides snp coordinate based on GRCh37, not sure if some SNPs have null coordinates.

            Anybody know where to download NCBI 36.3 reference assembly for aligment? It's easy to find hg18/hg19 whole genome sequences, but hard to find NCBI36.3.

            Anybody knows the access ID for James watson and J.C.Venter's genome sequences on 1000genomes? like NA18507 for Yoruba?

            Comment


            • #7
              Hi, I think this is what you are after

              ftp://ftp.ncbi.nih.gov/genomes/H_sap...VE/BUILD.36.3/


              I am trying to do the same, but I am stuck now. How can I align against the whole genome? Do I have to merge all the chromosome files in one using cat?

              Dave

              Comment


              • #8
                Originally posted by dnusol View Post
                Hi, I think this is what you are after

                ftp://ftp.ncbi.nih.gov/genomes/H_sap...VE/BUILD.36.3/


                I am trying to do the same, but I am stuck now. How can I align against the whole genome? Do I have to merge all the chromosome files in one using cat?

                Dave
                Thanks for your email.

                Yes, you have to merge all into one fasta file.

                Comment


                • #9
                  Hi, thanks for the help. I can´t find the mtDNA though, does anyone know where it is?

                  Edit: Ups, found it! it was there all the time!

                  Comment


                  • #10
                    Could you share where this information is available on the 1000 genomes project. Their website is not that detailed I guess.

                    Originally posted by bair View Post
                    Human 1000genomes is using GRCh37 based reference (three steps to generate the reference fasta file, ensembl 56), but the reference snp is from dbsnp129 mapped to NCBI36.3 (built in Aug, 2009)
                    --
                    bioinfosm

                    Comment


                    • #11
                      when I click on the link above and I choose CHRY, I get this
                      subdirectory:

                      FTP-Verzeichnis /genomes/H_sapiens/ARCHIVE/BUILD.36.3/CHR_Y/ auf ftp.ncbi.nih.gov

                      --------------------------------------------------------------------------------
                      Eine Ebene höher

                      03/14/2008 12:00 153,864 hs_alt_chrY_Celera.asn.gz
                      03/14/2008 12:00 2,874,402 hs_alt_chrY_Celera.fa.gz
                      03/14/2008 12:00 4,074,665 hs_alt_chrY_Celera.gbk.gz
                      03/14/2008 12:00 48,082 hs_alt_chrY_Celera.gbs.gz
                      03/17/2008 12:00 3,057,950 hs_alt_chrY_Celera.mfa.gz
                      03/14/2008 12:00 156,296 hs_alt_chrY_HuRef.asn.gz
                      03/14/2008 12:00 5,491,065 hs_alt_chrY_HuRef.fa.gz
                      03/14/2008 12:00 7,785,563 hs_alt_chrY_HuRef.gbk.gz
                      03/14/2008 12:00 100,498 hs_alt_chrY_HuRef.gbs.gz
                      03/17/2008 12:00 5,842,179 hs_alt_chrY_HuRef.mfa.gz
                      03/04/2008 12:00 299,047 hs_ref_chrY.asn.gz
                      03/04/2008 12:00 7,534,471 hs_ref_chrY.fa.gz
                      03/04/2008 12:00 10,703,507 hs_ref_chrY.gbk.gz
                      03/04/2008 12:00 176,727 hs_ref_chrY.gbs.gz
                      03/05/2008 12:00 8,008,160 hs_ref_chrY.mfa.gz


                      but apparantly none of these files are aligned, they are only ~26MB,
                      but hapmap-files refer to positions >57M

                      I also got hg18 from UCSC, 60MB, but the positions don't match


                      genbank has info on the builds ,


                      so build 36 , CHRY has length
                      57772954 , NC_000024.8
                      but the link has only the info, not the nucleotides.


                      they have build 36 and build 37, but not build 36.3


                      I'm trying to match the positions of the files in

                      and



                      I don't understand the meaning of column 2 in those files "allele"
                      the many letter-pairs at the end of the line are presumably different
                      people in that group samples on 2 different machines ?
                      Last edited by gsgs; 07-18-2010, 11:54 PM.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X