Seqanswers Leaderboard Ad

**chadn737** · 02-23-2012, 09:29 PM

In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

The fields are described in the ANNOVAR website:

For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.

404 Not Found

http://www.openbioinformatics.org/annovar/annovar_faq.html#othergenome

You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

Here is a sample of what my Arabidopsis refgene file looks like:

Code:

1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk

Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.

**xujie** · 02-23-2012, 10:27 PM

Originally posted by chadn737 View Post

In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

The fields are described in the ANNOVAR website:

For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.

404 Not Found

http://www.openbioinformatics.org/annovar/annovar_faq.html#othergenome

You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

Here is a sample of what my Arabidopsis refgene file looks like:

Code:

1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk

Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.

Thank you so much for your reply and the information means too much for me.

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 22 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Questions about ANNOVAR

Comment

Comment

Latest Articles

ad_right_rmr

News