Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • problem for running ANNOVAR

    Hello,

    I'm using annovar for the first time. I started by running the example provided in the ANNOVAR website. I got no problem, so I probably don't have a problem with the software, the version...
    Now I'm trying it on my data. I started with a very short tab-delimitated file:



    1 6589054 6589231 A S
    or
    1 6589054 6589231 A S comments: rs1000050, a SNP in Illumina SNP arrays
    I think that the "comments" column is not mandatory ; am I right ?

    In the example, the file was:
    1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
    I don't see any difference between the two files, so the problem should not come from my file.
    I also pay attention to be in the correct folder :

    [... annovar]# ls
    annotate_variation.pl coding_change.pl example retrieve_seq_from_fasta.pl VICJE_nonsense211011
    auto_annovar.pl convert2annovar.pl humandb summarize_annovar.pl
    [... annovar]# cd VICJE_nonsense211011/
    [... VICJE_nonsense211011]# ls
    VICJE_nonsense VICJE_nonsense~ VICJE_nonsense.txt~
    Here are the command lines that I used in both cases:
    perl annotate_variation.pl -geneanno example/ex1.human humandb/
    perl annotate_variation.pl -geneanno VICJE_nonsense211011/VICJE_nonsense
    The error message shown below seems to indicate that I provided the wrong input.

    Does someone see where is my problem ?
    Thanks for any help !

    Code:
    [... annovar]# perl annotate_variation.pl -geneanno VICJE_nonsense211011/VICJE_nonsense
    Syntax error
    Usage:
         annotate_variation.pl [arguments] <query-file|table-name> <database-location>
    
         Optional arguments:
                -h, --help                      print help message
                -m, --man                       print complete documentation
                -v, --verbose                   use verbose output
            
                Arguments to download databases or perform annotations
                    --downdb                    download UCSC Genome Browser annotation database
                    --geneanno                  annotate variants by functional consequences on genes
                    --regionanno                annotate variants by targetting specific genomics regions
                    --filter                    filter variants based on a position list
                    --webfrom <string>          specify the source of database (default usually works fine)
            
                Arguments to control input and output
                    --outfile <file>            output file prefix
                    --zerostart                 input query file uses half-open zero-start coordinate
                    --dbtype <string>           database type
                    --buildver <string>         genome build version (default: hg18 for human)
                    --gff3dbfile <file>         specify the GFF3 DB file used in region-based annotation
                    --genericdbfile <file>      specify the generic DB file used in filter-based annotation
                    --vcfdbfile <file>          specify the DB file in VCF format in filter-based annotation
                    --bedfile <file>            specify a BED file in region-based annotation
                    --time                      print out local time during program run
                    --separate                  separately print out all function of a variant (default: one line per variant)
                    --colsWanted <string>       specify which columns to output in -regionanno by comma-delimited numbers
                    --comment                   print out comment line (those starting with #) in output files 
                    --scorecolumn <int>         the column with scores in database file (for region-based annotation)
                    --exonsort                  sort the exon number in output line (for gene-based annotation)
                    --transcript_function       use transcript name rather than gene name in gene-based annotation output
                    --hgvs                      use HGVS format for exonic annotation (c.122C>T rather than c.C122T)
            
                Arguments to fine-tune the annotation procedure
                    --batchsize <int>           batch size for processing variants per batch (default: 5m)
                    --genomebinsize <int>       bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
                    --expandbin <int>           check nearby bin to find neighboring genes (default: 2m/genomebinsize)
                    --neargene <int>            distance threshold to define upstream/downstream of a gene
                    --score_threshold <float>   minimum score of DB regions to use in annotation
                    --reverse                   reverse directionality to compare to score_threshold
                    --normscore_threshold <float> minimum normalized score of DB regions to use in annotation
                    --rawscore                  output includes the raw score (not normalized score) in UCSC Browser Track
                    --minqueryfrac <float>      minimum percentage of query overlap to define match to DB (default: 0)
                    --splicing_threshold <int>  distance between splicing variants and exon/intron boundary (default: 2)
                    --maf_threshold <float>     filter 1000G variants with MAF above this threshold (default: 0)
                    --sift_threshold <float>    SIFT threshold for deleterious prediction (default: 0.05)
                    --precedence <string>       comma-delimited to specify precedence of variant function (default: exonic>intronic...)
           
               Arguments to control memory usage
                    --memfree <int>             ensure minimum amount of free system memory (default: 100000, in the order of kb)
                    --memtotal <int>            limit total amount of memory used by ANNOVAR (default: 0, unlimited, in the order of kb)
                    --chromosome <string>       examine these specific chromosomes in database file
                
    
         Function: annotate a list of genetic variants against genome annotation 
         databases saved at local disk.
     
         Example: #download gene annotation database (for hg18 build) and save to humandb/ directory
                  annotate_variation.pl -downdb gene humandb/
                  annotate_variation.pl -buildver mm9 -downdb mce30way mousedb/
                  annotate_variation.pl -downdb snp130 humandb/
            
                  #gene-based annotation of variants in the varlist file (by default --geneanno is ON)
                  annotate_variation.pl ex1.human humandb/
              
                  #region-based annotate variants
                  annotate_variation.pl -regionanno -dbtype mce44way ex1.human humandb/
                  annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.human humandb/
              
                  #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
                  annotate_variation.pl -filter -dbtype 1000g_ceu -maf 0.01 ex1.human humandb/
                  annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/
                  annotate_variation.pl -filter -dbtype avsift ex1.human humandb/
     
         Version: $LastChangedDate: 2011-10-02 22:13:18 -0700 (Sun, 02 Oct 2011) $

  • #2
    I found some mistakes in my file: first, the positions to give in the columns 2 and 3 are the position(s) of the mutations (I gave the positions of the impacted exon). Moreover, I provided the wild and mut amino acids instead of the wild and mut nucleotides.

    I fixed these problems:
    1 6589165 6589165 G T
    but I still have the same error message ...

    Comment


    • #3
      I finally found my (big) mistake ! I forgot to provide the "humandb"directory !
      perl ./annotate_variation.pl -geneanno VICJE_nonsense211011/VICJE_nonsense humandb
      NOTICE: The --buildver is set as 'hg18' by default
      NOTICE: Reading gene annotation from humandb/hg18_refGene.txt ... Done with 38787 transcripts (including 6152 without coding sequence annotation) for 23270 unique genes
      NOTICE: Finished gene-based annotation on 5 genetic variants in VICJE_nonsense211011/VICJE_nonsense (including 4 with invalid format written to VICJE_nonsense211011/VICJE_nonsense.invalid_input)
      NOTICE: Output files were written to VICJE_nonsense211011/VICJE_nonsense.variant_function, VICJE_nonsense211011/VICJE_nonsense.exonic_variant_function
      I still have one problem: by default, hg18 is used. My alignment have been done on hg19. Do you know how to change it?

      Comment


      • #4
        I just wanted to say that
        To change the database variant to hg19, type
        -buildver hg19
        in your command. You have to download the hg19 databases for the refGene though, if you do not have this already.

        Boetsie

        Originally posted by Jane M View Post
        I finally found my (big) mistake ! I forgot to provide the "humandb"directory !


        I still have one problem: by default, hg18 is used. My alignment have been done on hg19. Do you know how to change it?

        Comment


        • #5
          Thanks ! It's working!
          Now I am using my complete list and the difficulty is to create the input file automatically. But it's no more a software problem !
          I will probably have more questions when getting the complete results !

          Comment


          • #6
            Good that it is working! I wrote myself a script to convert CLCBio's SNP's to Annovar's format. That's maybe the best thing to do.

            Good luck!

            Originally posted by Jane M View Post
            Thanks ! It's working!
            Now I am using my complete list and the difficulty is to create the input file automatically. But it's no more a software problem !
            I will probably have more questions when getting the complete results !

            Comment


            • #7
              Hello,

              I managed to run ANNOVAR on all my data set , but I have several questions concerning the results:


              1) When running:
              Code:
              perl ./annotate_variation.pl -geneanno $dir/$file humandb -buildver hg19
              to get an annotation of the variants by functional consequences on genes, I got:

              NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 38781 transcripts (including 6123 without coding sequence annotation) for 23261 unique genes
              NOTICE: Reading FASTA sequences from humandb/hg19_refGeneMrna.fa ... Done with 280 sequences
              WARNING: A total of 319 sequences will be ignored due to lack of correct ORF annotation

              ---------------------------------------------------------------------------------------
              WARNING: 83 exonic SNPs have WRONG reference alleles specified in your input file!
              WARNING: An example input line is <14 60950509 60950509 A G>
              WARNING: ANNOVAR can still annotate exonic_variant_function for the mutation correctly!
              WARNING: you may have used wrong -buildver, or specified incorrect reference allele, or used outdated mRNA FASTA file!
              ---------------------------------------------------------------------------------------

              NOTICE: Finished gene-based annotation on 179 genetic variants in VICJE_missense211011/VICJE_missense211011_file
              Is it possible that the 83 (among 179) results concerning these SNPs in the file are wrong?


              2) I am interested in the mutations occurring in the well conserved regions. So I am using:
              Code:
              perl annotate_variation.pl -regionanno -dbtype mce46way $dir/$file humandb -build hg19
              And I got file.hg19_phastConsElements46way:
              Code:
              mce46way	Score=243;Name=lod=13	1	6589165	6589165	G	T
              mce46way	Score=504;Name=lod=149	1	36915872	36915872	C	A
              I looked through the annovar website and I didn't find the description of the output file... Did I miss it? I don't know how to interpret the score and the name fields...
              Can I say that each mutation presents in this file take places in a region conserved in at least 46 species?


              3) I tried to identify a list of "benign" non-synonymous variants. Again, I haven't found any description of the result files, especially in the file.hg18_avsift_dropped file, which contains the "benign" non-synonymous variants.

              In my case for example, I get:
              Code:
              avsift	0.32	1	152191890	152191890	A	G
              avsift	0.82	1	158990190	158990190	T	G
              Does (AV)SIFT stands for Scale-invariant feature transform?

              I hope someone can help me on any of these points! Thanks!

              Comment


              • #8
                Originally posted by Jane M View Post
                1) When running:
                Code:
                perl ./annotate_variation.pl -geneanno $dir/$file humandb -buildver hg19
                to get an annotation of the variants by functional consequences on genes, I got:

                Is it possible that the 83 (among 179) results concerning these SNPs in the file are wrong?
                See the FAQ of Annovar:


                2) I am interested in the mutations occurring in the well conserved regions. So I am using:
                Code:
                perl annotate_variation.pl -regionanno -dbtype mce46way $dir/$file humandb -build hg19
                And I got file.hg19_phastConsElements46way:
                Code:
                mce46way	Score=243;Name=lod=13	1	6589165	6589165	G	T
                mce46way	Score=504;Name=lod=149	1	36915872	36915872	C	A
                I looked through the annovar website and I didn't find the description of the output file... Did I miss it? I don't know how to interpret the score and the name fields...
                See this page:
                http://www.openbioinformatics.org/an...ar_region.html at the section 'Most conserved element annotation'

                3) I tried to identify a list of "benign" non-synonymous variants. Again, I haven't found any description of the result files, especially in the file.hg18_avsift_dropped file, which contains the "benign" non-synonymous variants.
                "SIFT is a sequence homology-based tool that sorts intolerant from tolerant amino acid substitutions and predicts whether an amino acid substitution in a protein will have a phenotypic effect. The score ranges from 0 to 1. The amino acid substitution is predicted damaging is the score is <= 0.05, and tolerated if the score is > 0.05." See the help at their website:http://sift.jcvi.org/www/SIFT_help.html

                Comment


                • #9
                  Thanks for the links !
                  I am focusing on the point 2, the well preserved regions.
                  The description of the output file is
                  The second column is the normalized score assigned by UCSC Genome Browser, and this score range from 0 to 1000. (Note that the --score_threshold or --normscore_threshold can also be used to filter out specific variants with low conservation scores.) The second column also shows "Name=lod=x" which is used to tell the user the LOD score for the region.
                  In a way, the score represents the "probability" for a region to be really a preserved region... The range in my data is from 200 to 700. Is there a "common threshold" above which we can say that a region is a preserved region?

                  Then, the lod score is an other way to measure the "probability" for a region to be really a preserved region... Is it a tool more precise than the simple score?
                  I read that a "relation" is significant if the lod score is >=3. In my case, I have 115 mutations presents in the output file.
                  Can I conclude that 115 mutations occur in a significant way in well preserved regions?

                  I try to see the statistical conclusion, because there is probably a statistical test behind, to test each mutation. The null hypothesis can be "the mutation x is occurring in a conserved region" and the alternative hypothesis "the mutation x is not occurring in a conserved region"...

                  Comment


                  • #10
                    Hello again !

                    I haven't managed to fix my first problem :
                    perl ./annotate_variation.pl -geneanno $dir/$file humandb -build hg19
                    NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 38781 transcripts (including 6123 without coding sequence annotation) for 23261 unique genes
                    NOTICE: Reading FASTA sequences from humandb/hg19_refGeneMrna.fa ... Done with 280 sequences
                    WARNING: A total of 319 sequences will be ignored due to lack of correct ORF annotation
                    ---------------------------------------------------------------------------------------
                    WARNING: 83 exonic SNPs have WRONG reference alleles specified in your input file!
                    WARNING: An example input line is <14 60950509 60950509 A G>
                    WARNING: ANNOVAR can still annotate exonic_variant_function for the mutation correctly!
                    WARNING: you may have used wrong -buildver, or specified incorrect reference allele, or used outdated mRNA FASTA file!
                    I checked the link pointed out by boetie and the solutions are:
                    Why ANNOVAR complains "exonic SNPs have WRONG reference alleles " in gene-based annotation?

                    This happens when ANNOVAR thinks the "reference allele" in your input does not fit the "reference allele" in the mRNA FASTA file in ANNOVAR's database. This could be due to several reason, (1) wrong -buildver, or (2) you did not specify the correct reference allele, or (3) mRNA FASTA file is outdated as the gene model gets updated pretty quickly by UCSC.

                    To solve this problem, first check (1) and (2) to make sure that you did have the correct input. If you cannot find an error, then update the FASTA file by retrieve_seq_from_fasta.pl command, with more details here.
                    (1) wrong buildver: the sequencing have not been done in the lab but we were told that hg19 was used. That's why I'm using it.
                    I tried to use hg18 and I got:
                    perl ./annotate_variation.pl -geneanno $dir/$file humandb -build hg18
                    NOTICE: Reading gene annotation from humandb/hg18_refGene.txt ... Done with 38787 transcripts (including 6152 without coding sequence annotation) for 23270 unique genes
                    NOTICE: Reading FASTA sequences from humandb/hg18_refGeneMrna.fa ... Done with 6 sequences
                    WARNING: A total of 373 sequences will be ignored due to lack of correct ORF annotation
                    ---------------------------------------------------------------------------------------
                    WARNING: 2 exonic SNPs have WRONG reference alleles specified in your input file!
                    WARNING: An example input line is <5 140203159 140203159 C T>
                    WARNING: ANNOVAR can still annotate exonic_variant_function for the mutation correctly!
                    WARNING: you may have used wrong -buildver, or specified incorrect reference allele, or used outdated mRNA FASTA file!
                    There are only 2 SNPs with wrong alleles. Is it by chance that I have less WRONG reference alleles with hg18 than hg19, whereas hg19 is supposed to be the one used?

                    (2) correct input. Well, I cannot be sure about that, because we got the reference codons and the mutated codons. I've just selected the reference nucleotides and the mutated nucleotides.

                    (3) mRNA FASTA file is outdated: I installed ANNOVAR (the version of 11/22) few days ago so I doubt that this file is outdated...
                    I would like to try anyway, just in case. In the website, an example is given:
                    annotate_variation.pl --buildver panTro2 --downdb seq chimpdb/panTro2_seq
                    retrieve_seq_from_fasta.pl chimpdb/panTro2_refGene.txt -seqdir chimpdb/panTro2_seq -format refGene -outfile chimpdb/panTro2_refGeneMrna.fa
                    So I tried:
                    perl annotate_variation.pl -build hg19 --downdb seq humandb/hg19_seq
                    perl retrieve_seq_from_fasta.pl humandb/hg19_refGene.txt -seqdir humandb/human_seq -format refGene -outfile humandb/humandb_refGeneMrna.fa
                    but the first command didn't work...
                    perl annotate_variation.pl -build hg19 --downdb seq humandb/hg19_seq
                    NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
                    NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/golden...ps/chromFa.zip ... Failed
                    NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/golden...chromFa.tar.gz ... ^C
                    Does someone have a suggestion for my issue???

                    Comment


                    • #11
                      Another update to ANNOVAR was put online 3 days ago that fixes the warning about exonic snps having the wrong reference, so try downloading the latest version and doing it again. You can always plug your reference coordinates into UCSC and see if the bases match the reference bases you think are correct.

                      For your last download command, I think it should end with humandb/ and nothing after the "/".

                      Comment


                      • #12
                        Thanks Heisman. I am looking for the newest version (2011 Nov 28), but I cannot see a link on the website. I can download the 2011 Nov 20 only. Can you indicate where I can download it please?

                        I checked on UCSC some of the nucleotides which are given as my reference nucleotides and they do not agree with the UCSC database. But, they agree with the NCBI database (when checking the coding sequences only)...

                        My command line is still not working...
                        perl annotate_variation.pl -build hg19 --downdb seq humandb
                        NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
                        NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/golden...ps/chromFa.zip ... Failed
                        NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/golden...chromFa.tar.gz ... ^C
                        So I cannot run
                        perl retrieve_seq_from_fasta.pl humandb/hg19_refGene.txt -seqdir humandb/human_seq -format refGene -outfile humandb/humandb_refGeneMrna.fa

                        Comment


                        • #13
                          Same download link as before. Read the first couple of paragraphs: http://www.openbioinformatics.org/an..._download.html

                          A lot of the annovar databases are based on UCSC, I think, so that could give you problems if there is no option to specify that correctly in annovar.

                          As for : [code=perl annotate_variation.pl -build hg19 --downdb seq humandb[/url]

                          Where on the website does it describe downloading "seq"? I cannot find that.

                          Comment


                          • #14
                            Ok, so in fact I am using the new version of Nov 28 !
                            I thought it was the one of Nov 22 because of this
                            ANNOVAR (2011Nov20 version, ~79Mb) can be downloaded here (a mirror site here). This file was slightly updated a few days later to fix a bug in not printing out het/hom for indels for VCF files in convert2annovar.pl. Note that you can just run "annotate_variation.pl -downdb null ." to check if a new major release of ANNOVAR is available. Updated 2011Nov28: a slightly modifed version is uploaded that fixed the problem of complaining "new version is available" in the 2011Nov20 upload and that contains a slightly updated convert2annovar.pl (contributed by Dr. Hiroyuki MISHIMA) fixed issues in handling certain indels in 500SOLiD-LifeScope GFF3 files.
                            but I checked the modification time of the perl files.
                            Finally, I will try to understand why the references are different between UCSC and NCBI. Maybe I am misunderstanding something...

                            Thanks for your help !

                            Comment


                            • #15
                              Hello,

                              I still have the problem with the warning message: "83 exonic SNPs have WRONG reference alleles specified in your input file".
                              I think this is due to the fact that I'm using the information obtained in reverse sense:

                              Code:
                              SNP position chr         strand    wild.codon    mut.codon
                              60950509	chr14         -            AAA               GAA
                              78177183	chr14         +           CAG               AAG
                              Therefore, in the database, there is no AAA, but rather TTT in this example. Is there a way to use the information obtained in reverse sense in ANNOVAR? Or do I have to turn it as "direct sense" myself ?

                              Thanks

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X