![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
New Resources for 1000 Genomes | laura | Bioinformatics | 37 | 08-03-2020 05:22 PM |
Fasta with one entry per chromosome (for methylation analysis and others) | mixter | Epigenetics | 0 | 06-28-2011 02:52 AM |
bam files of exomes from 1000 genomes | jorjial | Bioinformatics | 3 | 06-06-2011 04:16 AM |
1000 genomes | Nataiki | Bioinformatics | 4 | 02-04-2011 05:42 AM |
.bam alignment format from 1000 Genomes Project | CellsDividing | Bioinformatics | 2 | 02-06-2009 09:21 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: germany Join Date: Jan 2012
Posts: 2
|
![]()
Hi,
I would like to use some data from the 1000 Genome project. As I understand, the BAM files for the individuals contain difference-reads to a reference sequence (HG18/HG19). Now, given the reference sequence as (compressed) FASTA, one file for each chromosome, and the BAM file for let's say NA12283, I would like to obtain the FASTA files for each chromosome of NA12283. Is this possible, and how? I tried to download a bulk of (s/b)am2X tools, but they only extract all the reads from the BAM file and write them into one FASTA file, without (re-)creating the actual chromosomes with respect to the reference. Thanks and best Regards, Daniel |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: UK Join Date: Jan 2010
Posts: 390
|
![]()
I'm not quite sure what you're asking. BAM files contain reads (mapped and potentially unmapped) to a reference sequence. You indicate you already have the reference, so is the question "How can I extract reads from individual chromosomes from a BAM file?". If not, then this answer might be no use
![]() You could just subset the bam file (http://www.1000genomes.org/faq/how-d...ction-bam-file) by chromosome and then extract the reads with bam2fastq (http://www.hudsonalpha.org/gsl/software/bam2fastq.php) I'm assuming you want fastq and not fasta, as if you convert to fasta you will lose quality information. If you do want fasta then the fastq->fasta conversion is trivial and implemented in many forms. |
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: germany Join Date: Jan 2012
Posts: 2
|
![]()
Well, maybe I am just not quite sure, whether what I want to do really makes sense
![]() For instance, given reference Chromosome 1 (of a/the reference genome), and some reads of human NA12283 from the 1000 Genome project, I just ask myself: "What does the Chromosome 1 of NA12283 look like". I would assume (maybe this is my mistake?) that I can just overlay all the mapped reads to the reference Chromosome 1 and then obtain "the" Chromosome 1 of NA12283 - at least roughly. If all this makes (biologically) sense, then I would be interested in the FASTA file of Chromosome 1 of NA12283 - computed from the reference chromosome and the mapped reads. About your solution: Once converted to FASTQ - don't I lose all the mapping information? I really want to *combine* the reads and the corresponding reference chromosome, not only convert reads to reads. ![]() |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: UK Join Date: Jan 2010
Posts: 390
|
![]()
Ah, so you want to create a consensus sequence from the mapped reads per chromosome for a given individual?
So you could subset your bam's by chromosome and then do something like; samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693
|
![]()
For 4X coverage, there is essentially no way to generate a good consensus.
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Toronto, ON Join Date: Dec 2009
Posts: 16
|
![]()
I tried to run this but then I realized pileup was changed to mpileup and -c is not supported. so i run mpileup -uf, but I get may errors.
|
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: Quebec Join Date: Nov 2012
Posts: 8
|
![]()
I'm trying to do the exact same thing! Did you ever find a way to get a decent fasta complete chromosome from one of the 1000 genomes samples?
My understanding is the reference they use is hs37d5.fa. So, can you use a BAM/BAI file combo from patient HG00XXX and map the reads to hs37d5.fa to get the rough 'genome' of patient HG00XXX? |
![]() |
![]() |
![]() |
#9 |
Junior Member
Location: Estonia Join Date: Jan 2013
Posts: 5
|
![]()
See here for one way to get a .vcf file with SNPs and indels from the .bam file, or a consensus sequence:
http://samtools.sourceforge.net/mpileup.shtml The consensus sequence generated by this method has the problem that it only applies the SNPs to the reference sequence, but not the indels. The .vcf file is better since it includes both SNPs and indels. The .vcf file can be converted to a .fasta sequence using this tool: https://www.broadinstitute.org/gatk/...Reference.html However, note that this tool will only take into account indels of length up to 2 bases (as of January 2013). You may want to write your own script to insert all the indels (including the longer ones) from the .vcf into the .fasta. This method should get the whole sequence from the .bam file, however, I don't know how to extract individual chromosomes from it. |
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: Quebec Join Date: Nov 2012
Posts: 8
|
![]()
The mpileup method seems to work... I've two questions.
1. Is there a way to know how many SNPs I should expect in this 'consensus' sequence? Say I want to know how many SNPs HGxxxx has in his chromosome 2, etc. 2. When you say the VCF file is better, is there another way to get a VCF from 1000 genomes, other than using the mpileup method? I was under the impression the ftp site only contained BAM files? Thanks a lot! |
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: Estonia Join Date: Jan 2013
Posts: 5
|
![]()
1. There could be, but I don't know of one. I usually just check if the last position numbers in the .vcf file are close to the number of base pairs in the chromosome. (Note that if you use the whole genome, then the last positions will be the last positions in the last supercontig, not the whole genome.)
2. 1000 Genomes has some .vcf files in their /release folder (description of contents here: http://www.1000genomes.org/faq/what-bas-file). As for other methods to generate a .vcf file, there might be some, but I personally don't know of any. I'm not sure what you mean by the 1000 Genomes FTP site only containing .bam files. You can see the link above for all the file types they have. But, sorry, I'm not sure what you're asking, maybe you can clarify? I'd also like to mention that I found out that when you give mpileup a whole genome .bam file from the 1000 Genomes site, and generate a .vcf from it, then the first number of each line in the .vcf is the number of the chromosome (1...22, X, Y or the ID of a supercontig). Therefore, it is possible to write a simple script to split the .vcf into several .vcf-s, one for each chromosome, and by mapping the changes to individual chromosome fasta files, get the sequences of each chromosome separately. |
![]() |
![]() |
![]() |
#12 |
Junior Member
Location: Quebec Join Date: Nov 2012
Posts: 8
|
![]()
I mean, do the VCF files in the release folder contain all SNPs for a given sample (ex: all SNPs for HG00254)?
Say I want all the SNPs of a consensus chromosome 3 for HG00254, my understanding was that the easiest method it to get the BAM, create mpileup, create fasta from that |
![]() |
![]() |
![]() |
#13 |
Junior Member
Location: Estonia Join Date: Jan 2013
Posts: 5
|
![]()
I downloaded one of the .vcf files to see, and, as far as I can tell, they don't contain the SNPs of any samples, just the collective SNPs of all the samples, with the population frequencies of each SNP. Something like this for each SNP:
pos: 123456, ref: A, alt: T, african_freq: 0.12, american_freq: 0.01, european_freq: 0.5, asian_freq: 0.14 I think you can only get all SNPs and indels of a given sample by doing what you said, i.e. getting the .bam, running mpileup on it, and generating a .fasta from the .vcf. |
![]() |
![]() |
![]() |
#14 |
Junior Member
Location: Quebec Join Date: Nov 2012
Posts: 8
|
![]()
Thanks a lot! Will do it this way.
The other thing I don't get is how they sometimes indicate in the 1000 genomes browser a certain genotype, yet I can't find it in the 1000 genomes data. An example is rs1805007 for sample HG00108. It says T|C in the genome browser (see url below), yet I can't find any instances of a T in the 3 tracks provided by 1000 genomes. URL in 1000 genomes browser: http://www.ncbi.nlm.nih.gov/variatio...>s=rs1805007 |
![]() |
![]() |
![]() |
#15 |
Junior Member
Location: Estonia Join Date: Jan 2013
Posts: 5
|
![]()
Sorry, I don't have much experience with the genome browser. I see C in two of the three HG00108 tracks shown in the genome browser (the exome one is empty). There reference genome shows C|G. Where are you seeing the T|C? (I think linking to the genome browser doesn't work, by the way, so it would be better if you described it.)
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: Cambridge UK Join Date: Sep 2008
Posts: 151
|
![]()
The best way to get a consensus sequence based on the 1000genomes project official snp set it to use the vcf file and the vcftools
Our genotypes can be found in our release directories, we always distribute one file which contains all the sites and then one file per chr which contain the genotypes for all the individuals types The data referred to in our recent paper is available from here http://ftp.1000genomes.ebi.ac.uk/vol...ted_call_sets/ (See here for the paper http://www.1000genomes.org/announcem...mes-2012-10-31 which is freely available) You can get a specific piece of one of our files using http://www.1000genomes.org/faq/how-d...ction-vcf-file You can generate a consensus using the vcftools perl script vcf-consensus http://vcftools.sourceforge.net/perl...#vcf-consensus Please note First this will reflect snps and indels but of course misses large copy number changes (I don't know if the script would handle our deletions) Second the samtools mpileup version does work but it does not give you the consortium view on an individual but just the snps/indels predicted by samtools which is a different thing |
![]() |
![]() |
![]() |
#17 | |
Junior Member
Location: Philadelphia, Pa Join Date: Dec 2012
Posts: 7
|
![]()
""
Quote:
The link you posted for .vcf to fasta seq is not found. Can you repost with proper link? |
|
![]() |
![]() |
![]() |
#18 |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]()
I feel like a broken record.... About a year ago I modified vcftools (and posted patches to sourceforge) to work with a 'variant-only' VCF file coupled with a reference sequence, and was able to generate a consensus sequence that includes both INDELs (of any length specified in the VCF file) as well as SNPs. Code can be found at this post:
http://seqanswers.com/forums/showthr...628#post122628 |
![]() |
![]() |
![]() |
Tags |
bam, chromosomes, conversion, fasta, reference |
Thread Tools | |
|
|