SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
New Resources for 1000 Genomes laura Bioinformatics 36 04-15-2017 08:48 AM
Fasta with one entry per chromosome (for methylation analysis and others) mixter Epigenetics 0 06-28-2011 01:52 AM
bam files of exomes from 1000 genomes jorjial Bioinformatics 3 06-06-2011 03:16 AM
1000 genomes Nataiki Bioinformatics 4 02-04-2011 04:42 AM
.bam alignment format from 1000 Genomes Project CellsDividing Bioinformatics 2 02-06-2009 08:21 AM

Reply
 
Thread Tools
Old 01-16-2012, 01:26 AM   #1
ce.log
Junior Member
 
Location: germany

Join Date: Jan 2012
Posts: 2
Default Convert 1000-Genomes-proje BAM to FASTA (aligned to reference, grouped by chromosome)

Hi,

I would like to use some data from the 1000 Genome project. As I understand, the BAM files for the individuals contain difference-reads to a reference sequence (HG18/HG19).

Now, given the reference sequence as (compressed) FASTA, one file for each chromosome, and the BAM file for let's say NA12283, I would like to obtain the FASTA files for each chromosome of NA12283. Is this possible, and how?

I tried to download a bulk of (s/b)am2X tools, but they only extract all the reads from the BAM file and write them into one FASTA file, without (re-)creating the actual chromosomes with respect to the reference.

Thanks and best Regards,
Daniel
ce.log is offline   Reply With Quote
Old 01-16-2012, 05:11 AM   #2
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 388
Default

I'm not quite sure what you're asking. BAM files contain reads (mapped and potentially unmapped) to a reference sequence. You indicate you already have the reference, so is the question "How can I extract reads from individual chromosomes from a BAM file?". If not, then this answer might be no use

You could just subset the bam file (http://www.1000genomes.org/faq/how-d...ction-bam-file) by chromosome and then extract the reads with bam2fastq (http://www.hudsonalpha.org/gsl/software/bam2fastq.php) I'm assuming you want fastq and not fasta, as if you convert to fasta you will lose quality information. If you do want fasta then the fastq->fasta conversion is trivial and implemented in many forms.
Bukowski is offline   Reply With Quote
Old 01-16-2012, 05:25 AM   #3
ce.log
Junior Member
 
Location: germany

Join Date: Jan 2012
Posts: 2
Default

Well, maybe I am just not quite sure, whether what I want to do really makes sense

For instance, given reference Chromosome 1 (of a/the reference genome), and some reads of human NA12283 from the 1000 Genome project, I just ask myself: "What does the Chromosome 1 of NA12283 look like". I would assume (maybe this is my mistake?) that I can just overlay all the mapped reads to the reference Chromosome 1 and then obtain "the" Chromosome 1 of NA12283 - at least roughly.

If all this makes (biologically) sense, then I would be interested in the FASTA file of Chromosome 1 of NA12283 - computed from the reference chromosome and the mapped reads.

About your solution: Once converted to FASTQ - don't I lose all the mapping information? I really want to *combine* the reads and the corresponding reference chromosome, not only convert reads to reads.
ce.log is offline   Reply With Quote
Old 01-16-2012, 05:32 AM   #4
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 388
Default

Ah, so you want to create a consensus sequence from the mapped reads per chromosome for a given individual?

So you could subset your bam's by chromosome and then do something like;

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
Bukowski is offline   Reply With Quote
Old 01-16-2012, 05:35 AM   #5
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

For 4X coverage, there is essentially no way to generate a good consensus.
lh3 is offline   Reply With Quote
Old 04-10-2012, 08:55 AM   #6
elfuser
Member
 
Location: Toronto, ON

Join Date: Dec 2009
Posts: 16
Default

Quote:
Originally Posted by Bukowski View Post
Ah, so you want to create a consensus sequence from the mapped reads per chromosome for a given individual?

So you could subset your bam's by chromosome and then do something like;

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
I tried to run this but then I realized pileup was changed to mpileup and -c is not supported. so i run mpileup -uf, but I get may errors.
elfuser is offline   Reply With Quote
Old 04-10-2012, 06:48 PM   #7
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

perhaps you can use the variant calls from this sample along with the reference sequence to sort of come up with 'the' sequence for this individual
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 11-08-2012, 12:27 PM   #8
Gabeloooooo
Junior Member
 
Location: Quebec

Join Date: Nov 2012
Posts: 8
Question

I'm trying to do the exact same thing! Did you ever find a way to get a decent fasta complete chromosome from one of the 1000 genomes samples?

My understanding is the reference they use is hs37d5.fa.

So, can you use a BAM/BAI file combo from patient HG00XXX and map the reads to hs37d5.fa to get the rough 'genome' of patient HG00XXX?
Gabeloooooo is offline   Reply With Quote
Old 01-13-2013, 05:49 AM   #9
kriikku
Junior Member
 
Location: Estonia

Join Date: Jan 2013
Posts: 5
Default

See here for one way to get a .vcf file with SNPs and indels from the .bam file, or a consensus sequence:
http://samtools.sourceforge.net/mpileup.shtml

The consensus sequence generated by this method has the problem that it only applies the SNPs to the reference sequence, but not the indels.
The .vcf file is better since it includes both SNPs and indels.

The .vcf file can be converted to a .fasta sequence using this tool:
https://www.broadinstitute.org/gatk/...Reference.html
However, note that this tool will only take into account indels of length up to 2 bases (as of January 2013). You may want to write your own script to insert all the indels (including the longer ones) from the .vcf into the .fasta.

This method should get the whole sequence from the .bam file, however, I don't know how to extract individual chromosomes from it.
kriikku is offline   Reply With Quote
Old 01-14-2013, 08:49 AM   #10
Gabeloooooo
Junior Member
 
Location: Quebec

Join Date: Nov 2012
Posts: 8
Default

The mpileup method seems to work... I've two questions.

1. Is there a way to know how many SNPs I should expect in this 'consensus' sequence? Say I want to know how many SNPs HGxxxx has in his chromosome 2, etc.

2. When you say the VCF file is better, is there another way to get a VCF from 1000 genomes, other than using the mpileup method?

I was under the impression the ftp site only contained BAM files?

Thanks a lot!
Gabeloooooo is offline   Reply With Quote
Old 01-14-2013, 10:45 AM   #11
kriikku
Junior Member
 
Location: Estonia

Join Date: Jan 2013
Posts: 5
Default

1. There could be, but I don't know of one. I usually just check if the last position numbers in the .vcf file are close to the number of base pairs in the chromosome. (Note that if you use the whole genome, then the last positions will be the last positions in the last supercontig, not the whole genome.)

2. 1000 Genomes has some .vcf files in their /release folder (description of contents here: http://www.1000genomes.org/faq/what-bas-file). As for other methods to generate a .vcf file, there might be some, but I personally don't know of any.

I'm not sure what you mean by the 1000 Genomes FTP site only containing .bam files. You can see the link above for all the file types they have. But, sorry, I'm not sure what you're asking, maybe you can clarify?

I'd also like to mention that I found out that when you give mpileup a whole genome .bam file from the 1000 Genomes site, and generate a .vcf from it, then the first number of each line in the .vcf is the number of the chromosome (1...22, X, Y or the ID of a supercontig). Therefore, it is possible to write a simple script to split the .vcf into several .vcf-s, one for each chromosome, and by mapping the changes to individual chromosome fasta files, get the sequences of each chromosome separately.
kriikku is offline   Reply With Quote
Old 01-14-2013, 01:04 PM   #12
Gabeloooooo
Junior Member
 
Location: Quebec

Join Date: Nov 2012
Posts: 8
Default

I mean, do the VCF files in the release folder contain all SNPs for a given sample (ex: all SNPs for HG00254)?

Say I want all the SNPs of a consensus chromosome 3 for HG00254, my understanding was that the easiest method it to get the BAM, create mpileup, create fasta from that
Gabeloooooo is offline   Reply With Quote
Old 01-14-2013, 02:28 PM   #13
kriikku
Junior Member
 
Location: Estonia

Join Date: Jan 2013
Posts: 5
Default

I downloaded one of the .vcf files to see, and, as far as I can tell, they don't contain the SNPs of any samples, just the collective SNPs of all the samples, with the population frequencies of each SNP. Something like this for each SNP:
pos: 123456, ref: A, alt: T, african_freq: 0.12, american_freq: 0.01, european_freq: 0.5, asian_freq: 0.14

I think you can only get all SNPs and indels of a given sample by doing what you said, i.e. getting the .bam, running mpileup on it, and generating a .fasta from the .vcf.
kriikku is offline   Reply With Quote
Old 01-14-2013, 04:03 PM   #14
Gabeloooooo
Junior Member
 
Location: Quebec

Join Date: Nov 2012
Posts: 8
Default

Thanks a lot! Will do it this way.

The other thing I don't get is how they sometimes indicate in the 1000 genomes browser a certain genotype, yet I can't find it in the 1000 genomes data.

An example is rs1805007 for sample HG00108. It says T|C in the genome browser (see url below), yet I can't find any instances of a T in the 3 tracks provided by 1000 genomes.

URL in 1000 genomes browser:
http://www.ncbi.nlm.nih.gov/variatio...&gts=rs1805007
Gabeloooooo is offline   Reply With Quote
Old 02-02-2013, 09:22 AM   #15
kriikku
Junior Member
 
Location: Estonia

Join Date: Jan 2013
Posts: 5
Default

Sorry, I don't have much experience with the genome browser. I see C in two of the three HG00108 tracks shown in the genome browser (the exome one is empty). There reference genome shows C|G. Where are you seeing the T|C? (I think linking to the genome browser doesn't work, by the way, so it would be better if you described it.)
kriikku is offline   Reply With Quote
Old 02-15-2013, 02:13 AM   #16
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

The best way to get a consensus sequence based on the 1000genomes project official snp set it to use the vcf file and the vcftools

Our genotypes can be found in our release directories, we always distribute one file which contains all the sites and then one file per chr which contain the genotypes for all the individuals types

The data referred to in our recent paper is available from here

http://ftp.1000genomes.ebi.ac.uk/vol...ted_call_sets/

(See here for the paper http://www.1000genomes.org/announcem...mes-2012-10-31 which is freely available)

You can get a specific piece of one of our files using

http://www.1000genomes.org/faq/how-d...ction-vcf-file

You can generate a consensus using the vcftools perl script vcf-consensus

http://vcftools.sourceforge.net/perl...#vcf-consensus

Please note

First this will reflect snps and indels but of course misses large copy number changes (I don't know if the script would handle our deletions)

Second the samtools mpileup version does work but it does not give you the consortium view on an individual but just the snps/indels predicted by samtools which is a different thing
laura is offline   Reply With Quote
Old 01-13-2014, 02:23 PM   #17
slengyel
Junior Member
 
Location: Philadelphia, Pa

Join Date: Dec 2012
Posts: 7
Default vcf to fasta

""
Quote:
Originally Posted by kriikku View Post
See here for one way to get a .vcf file with SNPs and indels from the .bam file, or a consensus sequence:
http://samtools.sourceforge.net/mpileup.shtml

The consensus sequence generated by this method has the problem that it only applies the SNPs to the reference sequence, but not the indels.
The .vcf file is better since it includes both SNPs and indels.

The .vcf file can be converted to a .fasta sequence using this tool:
https://www.broadinstitute.org/gatk/...Reference.html
However, note that this tool will only take into account indels of length up to 2 bases (as of January 2013). You may want to write your own script to insert all the indels (including the longer ones) from the .vcf into the .fasta.

This method should get the whole sequence from the .bam file, however, I don't know how to extract individual chromosomes from it.
"""

The link you posted for .vcf to fasta seq is not found. Can you repost with proper link?
slengyel is offline   Reply With Quote
Old 01-13-2014, 11:35 PM   #18
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

I feel like a broken record.... About a year ago I modified vcftools (and posted patches to sourceforge) to work with a 'variant-only' VCF file coupled with a reference sequence, and was able to generate a consensus sequence that includes both INDELs (of any length specified in the VCF file) as well as SNPs. Code can be found at this post:

http://seqanswers.com/forums/showthr...628#post122628
gringer is offline   Reply With Quote
Reply

Tags
bam, chromosomes, conversion, fasta, reference

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:13 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO