Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PhyloP in annovar

    Hello everyone,
    We use annovar for annotation. In the output is also a PhyloP score. All values in our output are between 0 and 1. Is it correct, that that value is the P-value for conservation? So is the SNP is more conserved, if the value is closer to 0? Is it significant conserved < 0.05?

    Up to now I thought, that the position is first conserved for a PhyloP score > 3 and that it is also possible that the value is < 0. So I am a little bit confused about the annovar PhyloP score. Who can help?

    Best regards
    Robby

  • #2
    Hello, I had the same problem and found this page. See below for information on PhyloP scores. Cut and pasted from here



    ANNOVAR Function III: Filter-based annotation

    1000 Genomes Project (2012 release) annotations
    1000 Genomes Project (2009 release) annotations (obselete!)
    1000 Genomes Project (2010 March/July/November and 2011 release) annotations (obselete!)
    dbSNP annotations
    LJB2 non-synonymous variants annotations
    SIFT/PolyPhen functional importance score annotations (obselete!)
    ESP (exome sequencing project) annotations
    GERP++ annotations
    CG (complete genomics) frequency annotations
    Popfreq annotation
    Generic mutation annotations

    An important and probably highly desirable feature is that ANNOVAR can help identify subsets of variants based on comparison to other variant databases, for example, variants annotated in dbSNP or variants annotated in 1000 Genome Project. The exact variant, with same start and end positions, and with same observed alleles, will be identified.

    These functionalities mentioned above can be performed using the --filter operation in ANNOVAR. The major difference between --filter and --regionanno above is that that --filter operation works on mutations (nucleotide changes), but --regionanno operation works on chromosome locations. For example, --region compare variants with things like chr1:1000-1000, but --filter compare variants with things like A->G change at the position chr1:1000-1000.

    1. 1000 Genomes Project (2012 April) annotations

    This is the latest 1000G annotation that ANNOVAR supports. This is based on phase 1 release v3 called from 20101123 alignment, and the database is prepared using input compiled here), thanks to Mehdi Pirooznia @ Hopkins. The populations include ALL, AMR, AFR, ASN and EUR.

    To download the database, use following:

    [kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -downdb 1000g2012apr humandb -buildver hg19
    NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...2012_04.txt.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an..._04.txt.idx.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...2012_04.txt.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an..._04.txt.idx.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...2012_04.txt.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an..._04.txt.idx.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...2012_04.txt.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an..._04.txt.idx.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...2012_04.txt.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an..._04.txt.idx.gz ... OK
    NOTICE: Uncompressing downloaded files
    NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

    To annotate a data set called x1.avinput by the database, use following:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 ex1_hg19.human humandb/
    NOTICE: Variants matching filtering criteria are written to ex1_hg19.human.hg19_EUR.sites.2012_04_dropped, other variants are written to ex1_hg19.human.hg19_EUR.sites.2012_04_filtered
    NOTICE: Processing next batch with 12 unique variants in 12 input lines
    NOTICE: Database index loaded. Total number of bins is 2766067 and the number of bins to be scanned is 12
    NOTICE: Scanning filter database humandb/hg19_EUR.sites.2012_04.txt...Done


    [kaiwang@biocluster ~/]$ cat ex1_hg19.human.hg19_EUR.sites.2012_04_dropped
    1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
    1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
    1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
    1000g2012apr_eur 0.01 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
    1000g2012apr_eur 0.01 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
    1000g2012apr_eur 0.53 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

    The command above annotate the ex1.human file against 1000 Genomes Project 2012 April release on European subjects. Known variants will be written to the *dropped file together with allele frequencies. The variants without matching database entries will be written to the *filtered file.

    It is possible to apply a MAF threshold to the filtering procedure:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 ex1_hg19.human humandb/ -maf 0.05
    NOTICE: Variants matching filtering criteria are written to ex1_hg19.human.hg19_EUR.sites.2012_04_dropped, other variants are written to ex1_hg19.human.hg19_EUR.sites.2012_04_filtered
    NOTICE: Processing next batch with 12 unique variants in 12 input lines
    NOTICE: Database index loaded. Total number of bins is 2766067 and the number of bins to be scanned is 12
    NOTICE: Scanning filter database humandb/hg19_EUR.sites.2012_04.txt...Done


    [kaiwang@biocluster ~/]$ cat ex1_hg19.human.hg19_EUR.sites.2012_04_dropped
    1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
    1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
    1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
    1000g2012apr_eur 0.53 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

    This means that only variants whose allele frequency is higher or equal to 0.05 are printed to the *dropped file.

    You can also reverse this threshold:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 ex1_hg19.human humandb/ -maf 0.05 -reverse
    NOTICE: Variants matching filtering criteria are written to ex1_hg19.human.hg19_EUR.sites.2012_04_dropped, other variants are written to ex1_hg19.human.hg19_EUR.sites.2012_04_filtered
    NOTICE: Processing next batch with 12 unique variants in 12 input lines
    NOTICE: Database index loaded. Total number of bins is 2766067 and the number of bins to be scanned is 12
    NOTICE: Scanning filter database humandb/hg19_EUR.sites.2012_04.txt...Done


    [kaiwang@biocluster ~/]$ cat ex1_hg19.human.hg19_EUR.sites.2012_04_dropped
    1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
    1000g2012apr_eur 0.01 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
    1000g2012apr_eur 0.01 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2

    In this case, only rare variants that are observed in 1000G will be printed out to the *dropped file.

    Similarly, to switch to other ethnicity groups, use 1000g2012apr_asn and 1000g2012apr_afr, 1000g2012apr_amr or 1000g2012apr_all as the database type. Let's try an example to see if the same sets of variants are observed in asians:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_asn -buildver hg19 ex1_hg19.human humandb/
    NOTICE: Variants matching filtering criteria are written to ex1_hg19.human.hg19_ASN.sites.2012_04_dropped, other variants are written to ex1_hg19.human.hg19_ASN.sites.2012_04_filtered
    NOTICE: Processing next batch with 12 unique variants in 12 input lines
    NOTICE: Database index loaded. Total number of bins is 2743052 and the number of bins to be scanned is 12
    NOTICE: Scanning filter database humandb/hg19_ASN.sites.2012_04.txt...Done


    [kaiwang@biocluster ~/]$ cat ex1_hg19.human.hg19_ASN.sites.2012_04_dropped
    1000g2012apr_asn 0.58 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
    1000g2012apr_asn 0.60 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    1000g2012apr_asn 0.33 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

    You'll see that the R381Q in IL23R and R702W in NOD2 (both SNPs are pretty famous) are not found in Asians from the 1000 Genomes Project.



    2. 1000 Genomes Project (2009 release) annotations (obselete)

    ANNOVAR can annotate variants based on annotated allele frequency in CEU populations used in the 1000 Genome project. This analysis requires downloading the annotation files from the 1000Genome project website (the command is "annotate_variation.pl -downdb 1000g humandb/") The current version of program download the 2009 April annotations.

    [kai@beta ~/]$ annotate_variation.pl -filter -dbtype 1000g_ceu ex1.human humandb/
    NOTICE: The --buildver is set as 'hg18' by default
    NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_1000g_ceu_dropped, other variants are written to ex1.human.hg18_1000g_ceu_filtered
    NOTICE: Processing next batch with 12 variants
    NOTICE: Scanning filter database humandb/hg18_CEU.sites.2009_04.txt...Done

    The ex1.human file contains a few common SNPs that were annotated in the 1000G project on CEU populations. The output file ex1.human.hg18_1000g_ceu_filtered contains a list of SNPs not reported in 1000G_CEU. The output file ex1.human.hg18_1000g_ceu_dropped contains the list of SNPs that are reported and their non-reference allele frequencies (as the second column below):

    [kai@beta ~/]$ cat ex1.human.hg18_1000g_ceu_dropped
    1000g_ceu 0.017544 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
    1000g_ceu 0.482456 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    1000g_ceu 0.877193 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays

    Similarly, change 1000g_ceu to 1000g_yri and 1000g_jptchb can be used to check the allele frequencies in YRI and JPT/CHB populations.

    In general, these commands are highly efficient, requiring several minutes to scan 4 million input genetic variants. Additionally, users may use settings such as "-chr 1-9" and "-chr 10-22,X" to run ANNOVAR in selected chromosomes only to further speed up searchers, if knowing that input variants are in specific chromosomes only.

    3. 1000 Genomes Project (2010 March/July/November and 2011 release) annotations (obselete)

    The procedure above was originally developed for the April 2009 release of 1000G. In March 2010, a new release of 1000G data is available, so the new keyword "1000g2010" must be used, if the users want to use the new 1000G data for the annotation. Similarly, the new keyword "1000g2010jul" and "1000g2010nov" must be used to handle additional releases. The 2010 November release is no longer a pilot release, but a full project release in hg19 coordinate.

    [kai@beta ~/]$ annotate_variation.pl -downdb 1000g2010 humandb/

    [kai@beta ~/]$ annotate_variation.pl -filter -dbtype 1000g2010_ceu ex1.human humandb/
    NOTICE: The --buildver is set as 'hg18' by default
    NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_1000g2010_ceu_dropped, other variants are written to ex1.human.hg18_1000g2010_ceu_filtered
    NOTICE: Processing next batch with 12 variants
    NOTICE: Scanning filter database humandb/hg18_CEU.sites.2010_03.txt...Done

    The results are below:

    [kai@beta ~/]$ cat ex1.human.hg18_1000g2010_ceu_dropped
    1000g2010_ceu 0.508 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    1000g2010_ceu 0.883 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
    1000g2010_ceu 0.508 2 233848107 233848107 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
    1000g2010_ceu 0.083 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2

    We will note that this result slightly differ from those generated on the 2009 release of 1000G data. This is because the 2009 and 2010 release used different genotype calling algorithms, potentially resulting in false negative calls in both sets. Users should be aware of these differences. In general, I recommend using the 2010 release, not just because it is newer, but also because it contains indel calls.

    Also note that as far as I can tell, 1000G 2010 March release does not provide chromosome X or chromosome Y calls for SNPs. They do provide chromosome X calls for indels. This should be kept in mind when using ANNOVAR for annotation. The 2010 July and Nov release does contain sex chromosome calls.

    Technical notes: Most casual ANNOVAR users can safely ignore these notes. If you want to know in detail how ANNOVAR works, read below.

    Some users may wonder how exactly ANNOVAR handles the different variant call sets in the 1000 Genomes Project. In March 2010 release of 1000 Genomes Project Pilot data, a SNP call file and an indel call file are provided for each of the three HapMap populations. The SNP call file contains consensus calls from 3 groups: a SNP is called only if at least 2 out of 3 groups call it. The indel call file contains many indel calls and it is generated by "Dindel on MAQ Illumina only BAMs ... using candidate indel set". The vast majority of indels (>90%) are annotated with QC-flags, so I decided to take only the most confident set of indels (that is, indels without any QC flags), and then generate a new "sites" file that contains both SNPs and indels. This is what users will get when "-downdb 1000g2010" is used in ANNOVAR (for example, hg18_CEU.sites.2010_03.txt, etc.). Unlike the 2009 release of 1000G data, these files are not downloaded from 1000G directly but from ANNOVAR website.

    The counts of SNPs and indels in March 2010 release in each of the HapMap populations are given below:

    CEU: SNPs=7,725,713 indels=751,528 total=8,477,241
    YRI: SNPs=10,556,876 indels=978,444 total=11,535,320
    JPTCHB: SNPs= 6,109,233 indels=687,884 total=6,797,117

    In comparison, the counts of SNPs in Aprial 2009 release in each of the HapMap populations are given below:

    CEU: SNPs=9,633,115
    YRI: SNPs=13,759,844
    JPTCHB: SNPs=10,970,708

    So it appears that 2010 release contains fewer number of SNPs compared to the 2009 release, though it provides indel calls.

    The next natural question is how much overlap there is between the 2009 release and the 2010 release. This can be done easily by ANNOVAR. Just make an ANNOVAR input file using 2009 release data, and then scan the 2010 release using the -filter operation in ANNOVAR (command line is "annotate_variation.pl -filter -dbtype 1000g2010_ceu ex10.human humandb/"). It takes just 4 minutes in my computer for CEU population, while adding "-batchsize 20m" argument further improves the speed.

    CEU: overlapping SNPs= 6,145,332
    YRI: overlapping SNPs=9,155,523
    JPTCHB: overlapping SNPs=5,393,065

    So the overlap with 2009 release is around 80%-90% for SNPs in the 1000G 2010 release, for each of the three populations.

    1000 Genomes Pilot Proejct continuously to put out updates to their variant calls. The keyword 1000g2010jul was added in July 2010 to handle the new version of data:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2010jul_ceu ex1.human ~/project/annotate_variation/humandb/

    [kaiwang@biocluster ~/]$ cat ex1.human.hg18_CEU.sites.2010_07_dropped
    1000g2010jul_ceu 0.508 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    1000g2010jul_ceu 0.883 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
    1000g2010jul_ceu 0.508 2 233848107 233848107 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
    1000g2010jul_ceu 0.083 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2

    comparing the 2010 July and 2010 March release, we can see that one additional SNP has been detected in the July release.

    Similarly, the non-pilot project also released variant calls in their website (in hg19 coordinate), and users need to use 1000g2010nov to handle the new data, with the '-buildver hg19' argument as well. The 2010Nov data does not separate CEU/YRI/ASN populations though, as they are based on all 1000G populations.

    The dbtype should include "all" as suffix as the database are for all subjects without specific popualtion identifiers.

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter inputfile humandb/ -dbtype 1000g2010nov_all -buildver hg19

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter inputfile humandb/ -dbtype 1000g2011may_all -buildver hg19





    Technical notes: ANNOVAR has the ability to handle VCF file directly. Therefore, you do not need to rely on the datasets that I compile, you can just directly interrogate 1000G data. For example, using 2010 March release of 1000G data

    [kai@beta ~/]$ annotate_variation.pl -filter -dbtype vcf -vcfdbfile hg18_CEU.SRP000031.2010_03.sites.vcf.txt ex1.human humandb/

    [kai@beta ~/]$ cat ex1.human.hg18_vcf_dropped
    vcf 0.508333333333333 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    vcf 0.883333333333333 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
    vcf 0.0833333333333333 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2

    You can do the same thing for 2011 May release of 1000G data:

    [kaiwang@biocluster ~/]$ annotate_variation.pl inputfile ./ -vcfdbfile ALL.wgs.phase1.projectConsensus.snps.sites.vcf -filter -dbtype vcf

    So there is no real need to wait for me to update the latest 1000G data in ANNOVAR. Any user can just run ANNOVAR on VCF file downloaded from 1000G yourself.

    This is especially important for indels: when I compile indels, I ONLY USE THE MOST CONFIDENT CALLS FROM 1000G THAT PASS ALL FILTERS. Some users may want to scan all potential indels called by 1000G regardless of how confident it is, and in this case, it is best that you use the VCF file directly in ANNOVAR for annotation.



    4. dbSNP annotations

    ANNOVAR can identify the variant that are already reported in dbSNP and also identify the corresponding rs identifiers. This can be a filtering step, similar to what used in the exome sequening paper to exclude non-pathogenic SNPs in Miller syndrome.

    [kaiwang@beta ~/]$ annotate_variation.pl -downdb snp130 humandb -webfrom annovar
    NOTICE: The --buildver is set as 'hg18' by default
    NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an..._snp130.txt.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...130.txt.idx.gz ... OK
    NOTICE: Uncompressing downloaded files
    NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory

    [kai@beta ~/]$ annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/
    NOTICE: The --buildver is set as 'hg18' by default
    NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_snp130_dropped, other variants are written to ex1.human.hg18_snp130_filtered
    NOTICE: Processing next batch with 12 variants
    NOTICE: Scanning filter database humandb/hg18_snp130.txt...Done

    [kai@beta ~/]$ cat ex1.human.hg18_snp130_dropped
    snp130 rs35561142 1 11326183 11326183 - AT comments: rs35561142, a 2-bp insertion
    snp130 rs59770105 1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
    snp130 rs11209026 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
    snp130 rs6576700 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
    snp130 rs1000050 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
    snp130 rs1801002 13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
    snp130 rs2066844 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
    snp130 rs2066845 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
    snp130 rs2066847 16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2

    Two output files are generated. The ex1.human.hg18_snp130_filtered file contains SNPs not in dbSNP. The ex1.human.hg18_snp130_dropped file contains variants that are annotated in dbSNP, and print out their rs identifiers (as the second column)

    NOTE: dbSNP 129 is generally regarded as the last "clean" dbSNP without "contamination" from 1000 Genomes Project and other large-scale next-generation sequencing projects. Many published papers utilize dbSNP129 only.

    NOTE: Per user request, I now make a dbSNP 129 in hg19 coordinate, so that users can actually use it to benchmark their variant calling algorithms, given that dbSNP 129 does not contain the "contaminations" from variants calls from next-generation sequencing.

    Advanced Notes: Since January 2011, per users' request, ANNOVAR now handles tri-allelic or quad-allelic SNPs. For example, rs12931472 can have four alleles (A, C, G, T) with wildtype as A, so any non-A mutation will be filtered by ANNOVAR, and rs12931472 will be printed out during filtering. In previous versions of ANNOVAR, only di-allelic SNPs are handled.

    Advanced Notes: These annotations may be assigned to "SNPs" in dbSNP: 'unknown','single','in-del','het','microsatellite','named','mixed','mnp','insertion','deletion'. ANNOVAR will only care about 'single', 'deletion', 'in-del', 'insertion' and ignore others. 'single' SNP accounts for the vast majority of dbSNP entries.

    In 2012, sevearl additional "NonFlagged" dbSNP database are provided by me. Basically, these are dbSNP files subtracting Flagged dbSNP entries. Flagged SNPs include SNPs < 1% minor allele frequency (MAF) (or unknown), mapping only once to reference assembly, flagged in dbSnp as "clinically associated". Some users have reported that some SNPs are still flagged as "clinically associated" in the NonFlagged set; this is because these SNPs are not found in the Flagged set from UCSC, possibly because they are more recently associated with diseases so are not recorded in Flagged database yet.

    The command line for downloading database and annotation is almost identical to regular dbSNP database. For example, to download the database:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -downdb -buildver hg19 -webfrom annovar snp135NonFlagged humandb
    NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...Flagged.txt.gz ... OK
    NOTICE: Downloading annotation database http://www.openbioinformatics.org/an...ged.txt.idx.gz ... OK
    NOTICE: Uncompressing downloaded files
    NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

    To give users an idea of the size of the various dbSNP databases prepared by me (first column represents number of variants):

    13610296 hg19_snp129.txt
    18396965 hg19_snp130NonFlagged.txt
    18404149 hg19_snp130.txt
    25301548 hg19_snp131NonFlagged.txt
    25312455 hg19_snp131.txt
    32249106 hg19_snp132NonFlagged.txt
    32267005 hg19_snp132.txt
    53473344 hg19_snp135NonFlagged.txt
    53502122 hg19_snp135.txt

    The dbSNP entries does not include allele frequency measure, so users should exercise caution when using dbSNP as a filtering step to identify causal variants for Mendelian diseases, as some dbSNP entries may well be related to disease susceptibility.


    IMPORTANT NOTE: the dbSNP commonSNP track from UCSC is extremely incomplete and users really should not use it for annotation under any circumstance.



    5. LJB2 (LJB version 2) non-synonymous variants annotation

    Starting from June 2013, LJB2 databases are made alive to ANNOVAR users. These include SIFT scores, PolyPhen2 HDIV scores, PolyPhen2 HVAR scores, LRT scores, MutationTaster scores, MutationAssessor score, FATHMM scores, GERP++ scores, PhyloP scores and SiPhy scores. These scores were retrieved from the dbNSFP (http://sites.google.com/site/jpopgen/dbNSFP). Big thanks to the authors (Liu, Jian, Boerwinkle), hence the name ljb.

    The keyword used for downloading these data include: ljb2_sift, ljb2_pp2hdiv, ljb2_pp2hvar, ljb2_lrt, ljb2_mt, ljb2_ma, ljb2_fathmm, ljb2_gerp++, ljb2_phylop, ljb2_siphy, ljb2_all. The ljb2_all includes ALL scores, and it is very useful in table_annovar.pl.

    Some examples were given below:

    LJB2_SIFT annotation

    EXTREMELY IMPORTANT!!!!!! in previous versions of dbNSFP (ljb_sift), the scores were calculated as 1-SIFT. In the updated version 2 (ljb2_sift), the scores were now the SIFT score itself. This mean a variant with score<0.05 is predicted as deleterious.

    In the example below, two missense variants were predicted as deleterious based on SIFT scores.

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype ljb2_sift -buildver hg19 ex1_hg19.human humandb
    NOTICE: the --dbtype ljb2_sift is assumed to be in generic ANNOVAR database format
    NOTICE: Variants matching filtering criteria are written to ex1_hg19.human.hg19_ljb2_sift_dropped, other variants are written to ex1_hg19.human.hg19_ljb2_sift_filtered
    NOTICE: Processing next batch with 12 unique variants in 12 input lines
    NOTICE: Database index loaded. Total number of bins is 187938 and the number of bins to be scanned is 7
    NOTICE: Scanning filter database humandb/hg19_ljb2_sift.txt...Done

    [kaiwang@biocluster ~/]$ cat ex1_hg19.human.hg19_ljb2_sift_dropped
    ljb2_sift 0.100000 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
    ljb2_sift 0.010000 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
    ljb2_sift 0.020000 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
    ljb2_sift 0.570000 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

    PolyPhen 2 annotation

    There are two databases for PolyPhen2: HVAR and HDIV. They are explained below:

    ljb2_pp2hvar should be used for diagnostics of Mendelian diseases, which requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles.The authors recommend calling "probably damaging" if the score is between 0.909 and 1, and "possibly damaging" if the score is between 0.447 and 0.908, and "benign" is the score is between 0 and 0.446.

    ljb2_pp2hdiv should be used when evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data. The authors recommend calling "probably damaging" if the score is between 0.957 and 1, and "possibly damaging" if the score is between 0.453 and 0.956, and "benign" is the score is between 0 and 0.452.

    An example is given below:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype ljb2_pp2hdiv -buildver hg19 ex1_hg19.human humandb
    NOTICE: the --dbtype ljb2_pp2hdiv is assumed to be in generic ANNOVAR database format
    NOTICE: Variants matching filtering criteria are written to ex1_hg19.human.hg19_ljb2_pp2hdiv_dropped, other variants are written to ex1_hg19.human.hg19_ljb2_pp2hdiv_filtered
    NOTICE: Processing next batch with 12 unique variants in 12 input lines
    NOTICE: Database index loaded. Total number of bins is 184437 and the number of bins to be scanned is 7
    NOTICE: Scanning filter database humandb/hg19_ljb2_pp2hdiv.txt...Done

    [kaiwang@biocluster ~/]$ cat ex1_hg19.human.hg19_ljb2_pp2hdiv_dropped
    ljb2_pp2hdiv 1.0 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
    ljb2_pp2hdiv 0.999 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
    ljb2_pp2hdiv 1.0 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
    ljb2_pp2hdiv 0.001 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

    As you can see, three of the four missense variants were predicted as "probably damaging".

    If you want to have the "probably damaging", "possibly damaging" and "benign" calls, you can add the -otherinfo argument:

    [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype ljb2_pp2hdiv -buildver hg19 ex1_hg19.human humandb -otherinfo
    NOTICE: the --dbtype ljb2_pp2hdiv is assumed to be in generic ANNOVAR database format
    NOTICE: Variants matching filtering criteria are written to ex1_hg19.human.hg19_ljb2_pp2hdiv_dropped, other variants are written to ex1_hg19.human.hg19_ljb2_pp2hdiv_filtered
    NOTICE: Processing next batch with 12 unique variants in 12 input lines
    NOTICE: Database index loaded. Total number of bins is 184437 and the number of bins to be scanned is 7
    NOTICE: Scanning filter database humandb/hg19_ljb2_pp2hdiv.txt...Done

    [kaiwang@biocluster ~/]$ cat ex1_hg19.human.hg19_ljb2_pp2hdiv_dropped
    ljb2_pp2hdiv 1.0,D 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
    ljb2_pp2hdiv 0.999,D 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
    ljb2_pp2hdiv 1.0,D 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
    ljb2_pp2hdiv 0.001,B 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

    In the output, the scores and predictions are separated by comma. There are three possible predictions: "D" ("porobably damaging"), "P" ("possibly damaging") and "B" ("benign").

    Comment


    • #3
      A bit late to this thread...
      ANNOVAR extracted these conservation scores from dbNSFP and not from UCSC which could have positive and negative value.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 08:47 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X