Dear all,
I am currently struggling with annotation options in Annovar.
I have paired tumor normal exome sequencing data for which VarScan2 was used to call somatic SNPs. Varscan generated VCFs were succesfully converted into Annovar input files and annotated using the standard annotation command from the tutorial:
This works perfectly and already produces very useful data!
In order to better characterize the gene candidates, I would also like to annotate the list with the full gene names from "refLink" (e.g. PRRG1: transmembrane gamma-carboxyglutamic acid protein 1 isoform 1 precursor) and the refSeq Summaries (full description of the gene function). This would make life much easier to prioritize candidate genes instead of going back and forth between excel and webbrowser...
On the Annovar website it is stated that
Since refLink database is already downloaded along with refGene db I downloaded the refSeqSummary db into my humandb folder (without errors):
However, when I run the following command to annotate my input file with refSeqSummary entries....
.... I encounter this error:
the same results for trying annotation with refLink:
After trying to fill up the residual columns with dummy values in order to have 11 fields in the file I got this:
Obviously, Annovar needs some kind of chromosomal positions to perform such annotations in "--geneanno" mode?
In general, even if UCSC databases were directly downloaded through Annovar's "-downdb" parameter, the databases have to be adjusted in order to be usable by Annovar?
So my questions are:
1.) Is there a general structure for database files in order to be suitable for gene-based annotation and is it correct to use --geneanno protocol?
2.) How to modify ucsc datatables like refLink and refSeqSummary for Annovar, so that they can be used to annotate vcf files?
3.) Optionally GeneRIFs would also be interesting to annotate. Is there a way to include NCBI GeneRIFs (obtainable via ftp://ftp.ncbi.nih.gov/gene/GeneRIF/) in vcf annotations?
Any help would be very much appreciated!!
Max
I am currently struggling with annotation options in Annovar.
I have paired tumor normal exome sequencing data for which VarScan2 was used to call somatic SNPs. Varscan generated VCFs were succesfully converted into Annovar input files and annotated using the standard annotation command from the tutorial:
perl table_annovar.pl myInputFile humandb/ -buildver hg19 -out myanno -remove -protocol refGene,phastConsElements46way,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp137,ljb2_all -operation g,r,r,f,f,f,f -nastring NA -csvout
This works perfectly and already produces very useful data!
In order to better characterize the gene candidates, I would also like to annotate the list with the full gene names from "refLink" (e.g. PRRG1: transmembrane gamma-carboxyglutamic acid protein 1 isoform 1 precursor) and the refSeq Summaries (full description of the gene function). This would make life much easier to prioritize candidate genes instead of going back and forth between excel and webbrowser...
On the Annovar website it is stated that
Most of the databases that ANNOVAR uses can be directly retrieved from UCSC Genome Browser Annotation Database. In general, users can use "-downdb" in ANNOVAR to download these files. As of Feb2012, there are 6418 databases for hg19, 6443 databases for hg18, 1841 databases for mm9, etc.
perl annotate_variation.pl -buildver hg19 -downdb -webfrom ucsc refSeqSummary humandb/
perl table_annovar.pl myInputFile humandb/ -buildver hg19 -out annovar -remove -protocol refGene,refSeqSummary,cosmic67,phastConsElements46way,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp137,ljb2_all -operation g,g,f,r,r,f,f,f,f -nastring NA -otherinfo
NOTICE: Reading gene annotation from humandb/hg19_refSeqSummary.txt ... Error: invalid record in humandb/hg19_refSeqSummary.txt (>=11 fields expected in refSeqSummary gene definition file): <NR_036941 FullLength >
Reading gene annotation from humandb/hg19_refLink.txt ... Error: invalid record in humandb/hg19_refLink.txt (>=11 fields expected in refLink gene definition file): < NR_036941 0 0 0 0>
NOTICE: Reading gene annotation from humandb/hg19_refLink11.txt ... Error: invalid dbstrand information found in humandb/hg19_refLink11.txt (dbstrand has to be + or -): < NR_036941 0 0 0 0 NA NA NA>
In general, even if UCSC databases were directly downloaded through Annovar's "-downdb" parameter, the databases have to be adjusted in order to be usable by Annovar?
So my questions are:
1.) Is there a general structure for database files in order to be suitable for gene-based annotation and is it correct to use --geneanno protocol?
2.) How to modify ucsc datatables like refLink and refSeqSummary for Annovar, so that they can be used to annotate vcf files?
3.) Optionally GeneRIFs would also be interesting to annotate. Is there a way to include NCBI GeneRIFs (obtainable via ftp://ftp.ncbi.nih.gov/gene/GeneRIF/) in vcf annotations?
Any help would be very much appreciated!!
Max
Comment