Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Human Reference Genome

    Hello everyone!

    Sorry to bother all the more experienced people with a dummy question like this but what is the Human reference genome one should use nowadays and where can I download it in fasta format? Are there different reference genomes that yield differing results when aligneing data?

    When browsing the NCBI homepage I found a remark somewhere that one should use the same reference genome as the 1000 genomes project but the links only led me to an ftp server page (ftp://ftp-trace.ncbi.nih.gov/1000gen...cal/reference/) looking like that:

    Oct 08 2009 00:00 579 README.human_g1k_v37.fasta.txt
    Aug 27 2009 00:00 136 README_gencode_gtf_format
    Aug 13 2009 00:00 4313 SNPChrPosAllele_b129.README
    Aug 13 2009 00:00 189073716 SNPChrPosAllele_b129.txt.gz
    Oct 29 2010 00:00 Directory ancestral_alignments
    Nov 03 2010 00:00 398589572 dbsnp132_20101103.vcf.gz
    Oct 13 2011 02:31 Directory exome_pull_down_targets
    Jul 22 2010 00:00 8930799 gencode.v4.pc_translations.fa.gz
    Jul 22 2010 00:00 594881 gencode.v4.polyAs.GRCh37.gtf.gz
    Jul 22 2010 00:00 15059 gencode.v4.tRNAs.GRCh37.gtf.gz
    Jul 02 2010 00:00 21227244 gencode_v4.annotation.GRCh37.gtf.gz
    Oct 27 2010 00:00 1396 human_ancestor_GRCh37_e59.README
    Oct 27 2010 00:00 794022511 human_ancestor_GRCh37_e59.tar.bz2
    May 17 2010 00:00 2746 human_g1k_v37.fasta.fai
    May 17 2010 00:00 892331003 human_g1k_v37.fasta.gz
    Nov 01 2010 00:00 33054817 merge_rs_b129_b132.txt.gz
    Sep 23 2011 02:32 Directory phase2_mapping_resources
    Jul 13 2011 02:34 Directory phase2_reference_assembly_sequence
    Jul 13 2011 02:34 Directory reference_assembly_sequence
    Feb 24 2010 00:00 22291 sample_genders.csv
    Nov 03 2010 00:00 33280 snp_info_tags_b132.xls

    Without further information.

    What do all these abbreviations mean? What's the difference between a fasta.fai and a fasta.gz file?

    The README.human_g1k_v37.fasta.txt file tells me to:

    1. Download individual chrs from ensembl ftp

    ftp://ftp.ensembl.org/pub/current_fa...o_sapiens/dna/

    2. Download the newer version of the MT (NC_012920) from:



    3. Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs. The new single fasta is posted:

    ftp://ftp.sanger.ac.uk/pub/1000genom...ect_reference/

    The sanger homepage then shows me these files:

    Parent Directory

    Oct 07 2009 00:00 579 README
    Oct 08 2009 00:00 2746 human_g1k_v37.fasta.fai
    Oct 08 2009 00:00 67 human_g1k_v37.fasta.fai.md5
    Oct 07 2009 00:00 869925027 human_g1k_v37.fasta.gz
    Oct 07 2009 00:00 57 human_g1k_v37.fasta.gz.md5
    Oct 07 2009 00:00 Directory old

    So are "human_g1k_v37.fasta.fai" and "human_g1k_v37.fasta.gz" the complete reference genomes? What das the ending ".md5" mean?

    How can I fuse different fasta files to one big file?


    Thanks beforehand for your help.

    Greetings,

    Alexander

  • #2
    Hi,

    for md5 google "md5 sum".

    The human genome should be around 3 - 3.2 Gb, depending, as you say, on if you include extra contigs

    You're partially right, human_g1k_v37.fasta.gz
    seems to me to be correct from this source.

    fai is a fasta index, which can be generated by Samtools.

    Most people seem to build a complete genome from the individual contigs.

    See the first post in
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    for a nice manual on how to build your own human genome with "cat".

    Comment


    • #3
      Not a trivial question. It depends on what you want to do with it. Many people simply can't deal with the variations such as HLA-6 on chromosome six, or VDJ regions, so they choose to ignore them. Which is a bit sad because most people working with the human genome are in medicine and should be very interested in HLA-6 as it is crucial for the immune system functioning.

      Comment


      • #4
        The reference genomes for human, mouse and zebrafish is improved, maintained and released by the Genome Reference Consortium (GRC)



        The last major release was GRCh37 which you see in most of the browsers. However since that release there have been regional fixes in the form of "patches". The latest asssembly in that case is GRCh37.p5. You can download the latest data from the above website. Other information including problematic regions or fixes are also displayed on the website.

        hope that helps.

        Comment


        • #5
          Hi,

          Need help from the sequencing community.

          I've downloaded all the GRCh37 assembled referance at ftp://ftp.ncbi.nlm.nih.gov/genbank/g...mosomes/FASTA/.

          But what i got was 48 files cosisting of individual chromosome. I was thingking of merging all the files together but then there was two types of files for each chromosome:
          1) chr*.fa.gz
          2) chr*.rm.out.gz

          Would it be ok if I merge it together with the repeat masker output (.rm.out.gz) files to build my referance chromosome?

          Also, does anyone know how to mask out the PAR from the referance?
          HTML Code:
          <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

          Comment


          • #6
            I expect merging the regular fasta files with the repeat masked files is not what you want to do, at least if you plan to use the resulting file for mapping or anything else that's standard. Just concatenate the various chr*.fa.gz files together.

            Comment


            • #7
              thanks for the input dpryan. appreciate it.

              am abit confused. what are the *.rm.out.gz files for, if I may ask?
              HTML Code:
              <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

              Comment


              • #8
                They're the output from repeatmasker, saying which regions are repeats and what type (LINEs, SINEs, LTRs, etc.). They aren't fasta files.

                Comment


                • #9
                  ok..got it now. thank you dpryan =) u've been a help.
                  HTML Code:
                  <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

                  Comment


                  • #10
                    oh..another question came to mind. how do I remove the PAR from the reference? or have it been removed already from the .fa files?
                    HTML Code:
                    <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin


                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                      Yesterday, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    39 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    41 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    35 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    55 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X