Hello everyone!
Sorry to bother all the more experienced people with a dummy question like this but what is the Human reference genome one should use nowadays and where can I download it in fasta format? Are there different reference genomes that yield differing results when aligneing data?
When browsing the NCBI homepage I found a remark somewhere that one should use the same reference genome as the 1000 genomes project but the links only led me to an ftp server page (ftp://ftp-trace.ncbi.nih.gov/1000gen...cal/reference/) looking like that:
Oct 08 2009 00:00 579 README.human_g1k_v37.fasta.txt
Aug 27 2009 00:00 136 README_gencode_gtf_format
Aug 13 2009 00:00 4313 SNPChrPosAllele_b129.README
Aug 13 2009 00:00 189073716 SNPChrPosAllele_b129.txt.gz
Oct 29 2010 00:00 Directory ancestral_alignments
Nov 03 2010 00:00 398589572 dbsnp132_20101103.vcf.gz
Oct 13 2011 02:31 Directory exome_pull_down_targets
Jul 22 2010 00:00 8930799 gencode.v4.pc_translations.fa.gz
Jul 22 2010 00:00 594881 gencode.v4.polyAs.GRCh37.gtf.gz
Jul 22 2010 00:00 15059 gencode.v4.tRNAs.GRCh37.gtf.gz
Jul 02 2010 00:00 21227244 gencode_v4.annotation.GRCh37.gtf.gz
Oct 27 2010 00:00 1396 human_ancestor_GRCh37_e59.README
Oct 27 2010 00:00 794022511 human_ancestor_GRCh37_e59.tar.bz2
May 17 2010 00:00 2746 human_g1k_v37.fasta.fai
May 17 2010 00:00 892331003 human_g1k_v37.fasta.gz
Nov 01 2010 00:00 33054817 merge_rs_b129_b132.txt.gz
Sep 23 2011 02:32 Directory phase2_mapping_resources
Jul 13 2011 02:34 Directory phase2_reference_assembly_sequence
Jul 13 2011 02:34 Directory reference_assembly_sequence
Feb 24 2010 00:00 22291 sample_genders.csv
Nov 03 2010 00:00 33280 snp_info_tags_b132.xls
Without further information.
What do all these abbreviations mean? What's the difference between a fasta.fai and a fasta.gz file?
The README.human_g1k_v37.fasta.txt file tells me to:
1. Download individual chrs from ensembl ftp
ftp://ftp.ensembl.org/pub/current_fa...o_sapiens/dna/
2. Download the newer version of the MT (NC_012920) from:
3. Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs. The new single fasta is posted:
ftp://ftp.sanger.ac.uk/pub/1000genom...ect_reference/
The sanger homepage then shows me these files:
Parent Directory
Oct 07 2009 00:00 579 README
Oct 08 2009 00:00 2746 human_g1k_v37.fasta.fai
Oct 08 2009 00:00 67 human_g1k_v37.fasta.fai.md5
Oct 07 2009 00:00 869925027 human_g1k_v37.fasta.gz
Oct 07 2009 00:00 57 human_g1k_v37.fasta.gz.md5
Oct 07 2009 00:00 Directory old
So are "human_g1k_v37.fasta.fai" and "human_g1k_v37.fasta.gz" the complete reference genomes? What das the ending ".md5" mean?
How can I fuse different fasta files to one big file?
Thanks beforehand for your help.
Greetings,
Alexander
Sorry to bother all the more experienced people with a dummy question like this but what is the Human reference genome one should use nowadays and where can I download it in fasta format? Are there different reference genomes that yield differing results when aligneing data?
When browsing the NCBI homepage I found a remark somewhere that one should use the same reference genome as the 1000 genomes project but the links only led me to an ftp server page (ftp://ftp-trace.ncbi.nih.gov/1000gen...cal/reference/) looking like that:
Oct 08 2009 00:00 579 README.human_g1k_v37.fasta.txt
Aug 27 2009 00:00 136 README_gencode_gtf_format
Aug 13 2009 00:00 4313 SNPChrPosAllele_b129.README
Aug 13 2009 00:00 189073716 SNPChrPosAllele_b129.txt.gz
Oct 29 2010 00:00 Directory ancestral_alignments
Nov 03 2010 00:00 398589572 dbsnp132_20101103.vcf.gz
Oct 13 2011 02:31 Directory exome_pull_down_targets
Jul 22 2010 00:00 8930799 gencode.v4.pc_translations.fa.gz
Jul 22 2010 00:00 594881 gencode.v4.polyAs.GRCh37.gtf.gz
Jul 22 2010 00:00 15059 gencode.v4.tRNAs.GRCh37.gtf.gz
Jul 02 2010 00:00 21227244 gencode_v4.annotation.GRCh37.gtf.gz
Oct 27 2010 00:00 1396 human_ancestor_GRCh37_e59.README
Oct 27 2010 00:00 794022511 human_ancestor_GRCh37_e59.tar.bz2
May 17 2010 00:00 2746 human_g1k_v37.fasta.fai
May 17 2010 00:00 892331003 human_g1k_v37.fasta.gz
Nov 01 2010 00:00 33054817 merge_rs_b129_b132.txt.gz
Sep 23 2011 02:32 Directory phase2_mapping_resources
Jul 13 2011 02:34 Directory phase2_reference_assembly_sequence
Jul 13 2011 02:34 Directory reference_assembly_sequence
Feb 24 2010 00:00 22291 sample_genders.csv
Nov 03 2010 00:00 33280 snp_info_tags_b132.xls
Without further information.
What do all these abbreviations mean? What's the difference between a fasta.fai and a fasta.gz file?
The README.human_g1k_v37.fasta.txt file tells me to:
1. Download individual chrs from ensembl ftp
ftp://ftp.ensembl.org/pub/current_fa...o_sapiens/dna/
2. Download the newer version of the MT (NC_012920) from:
3. Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs. The new single fasta is posted:
ftp://ftp.sanger.ac.uk/pub/1000genom...ect_reference/
The sanger homepage then shows me these files:
Parent Directory
Oct 07 2009 00:00 579 README
Oct 08 2009 00:00 2746 human_g1k_v37.fasta.fai
Oct 08 2009 00:00 67 human_g1k_v37.fasta.fai.md5
Oct 07 2009 00:00 869925027 human_g1k_v37.fasta.gz
Oct 07 2009 00:00 57 human_g1k_v37.fasta.gz.md5
Oct 07 2009 00:00 Directory old
So are "human_g1k_v37.fasta.fai" and "human_g1k_v37.fasta.gz" the complete reference genomes? What das the ending ".md5" mean?
How can I fuse different fasta files to one big file?
Thanks beforehand for your help.
Greetings,
Alexander
Comment