Hi,
I would like to create consensus sequence for an individual from 1000G where the sequence incorporates variants typed for this individual.
Using tabix and vcf-tools (vcf-subset), I extracted the relevant variant information from 1000G SNP call files.
At this point, I have variants per chromosome for individual HG000096.
I would like to incorporate these variants into reference coding sequence.
I downloaded all protein-coding sequences for the GRCh37 version from Ensembl Biomart.
An example with trimmed sequence looks like this:
Note: typical fasta sequence file with the header including gene|transcript|protein|chr|chr_start|chr_end
My understanding is that there are a few alternatives out there to produce consensus sequences given a reference fasta sequence and variant call file: (1) AlternativeReferenceMaker, (2) vcf-consensus, (3) mpileup.
I would like to use vcf-consensus for my task.
vcftools describes the use of vcf-consensus as follows:
I presume that the reference file here refers to the reference DNA sequence including coding and non-coding parts of the genome. In that case, my protein-coding sequence above would not work as an acceptable reference sequence. Nevertheless, I am only interested in the protein-coding part.
How can I modify vcf-consensus or my input sequences to create consensus coding sequence for HG00096 given the reference coding sequence and his variants?
Thank you very much.
As soon as I figure out a way to get this done, I will have a follow-up question concerning the diploidy, that is, how can I tell whether a variant is called at a heterozygous site or not.
I would like to create consensus sequence for an individual from 1000G where the sequence incorporates variants typed for this individual.
Using tabix and vcf-tools (vcf-subset), I extracted the relevant variant information from 1000G SNP call files.
At this point, I have variants per chromosome for individual HG000096.
I would like to incorporate these variants into reference coding sequence.
I downloaded all protein-coding sequences for the GRCh37 version from Ensembl Biomart.
An example with trimmed sequence looks like this:
Code:
>ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167 ATGCTCTTTGAGGGCTTGGATCTGGTGTCGGCGCTGGCCACCCTCGCCGCGTGCCTGGTG TCCGTGACGCTGCTGCTGGCCGTGTCGCAGCAGCTGTGGCAGCTGCGCTGGGCCGCCACT CGCGACAAGAGCTGCAAGCTGCCCATCCCCAAGGGATCCATGGGCTTCCCGCTCATCGGA
My understanding is that there are a few alternatives out there to produce consensus sequences given a reference fasta sequence and variant call file: (1) AlternativeReferenceMaker, (2) vcf-consensus, (3) mpileup.
I would like to use vcf-consensus for my task.
vcftools describes the use of vcf-consensus as follows:
Code:
cat ref.fa | vcf-consensus file.vcf.gz > out.fa
How can I modify vcf-consensus or my input sequences to create consensus coding sequence for HG00096 given the reference coding sequence and his variants?
Thank you very much.
As soon as I figure out a way to get this done, I will have a follow-up question concerning the diploidy, that is, how can I tell whether a variant is called at a heterozygous site or not.
Comment