Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about ANNOVAR

    Hello everyone,

    I would like to determine whether or not my calling SNPs are in coding regions and whether they impact the protein sequence. So I use ANNOVAR for annotation.
    However, my research target species is maize ,which even not have the UCSC-type annotation database. So I think I shoud convert my GFF3 maize annotation file to a UCSC-type file. Could you give me any suggestion about the format of the UCSC-type file or any ideas for annotation for maize snps?

    The file "hg18_refGene.txt" in the example database of ANNOVAR
    585 NR_028269 chr1 - 4224 7502 7502 7502 7 4224,4832,5658,6469,6719,7095,7468, 4692,4901,5810,6631,6918,7231,7502, 0 LOC100288778 unk unk -1,-1,-1,-1,-1,-1,-1,


    what is the meaning of the row?



    Thank you advance
    Best wishes
    Xujie

  • #2
    In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

    The fields are described in the ANNOVAR website:

    For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.




    You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

    Here is a sample of what my Arabidopsis refgene file looks like:

    Code:
    1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
    1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
    1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
    1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
    1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
    1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk
    Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

    Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.
    Last edited by chadn737; 02-23-2012, 09:37 PM.

    Comment


    • #3
      Originally posted by chadn737 View Post
      In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

      The fields are described in the ANNOVAR website:

      For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.




      You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

      Here is a sample of what my Arabidopsis refgene file looks like:

      Code:
      1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
      1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
      1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
      1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
      1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
      1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk
      Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

      Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.
      Thank you so much for your reply and the information means too much for me.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      31 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      53 views
      0 likes
      Last Post seqadmin  
      Working...
      X