SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
PhyloP in annovar Robby Bioinformatics 2 12-21-2015 09:52 PM
annovar question kenietz Bioinformatics 5 02-06-2012 01:20 AM
Annovar Format AmitL Bioinformatics 0 09-13-2011 06:03 AM
tutorial annovar abakelaar RNA Sequencing 0 07-27-2011 01:28 AM
Annovar files Masta General 1 02-22-2011 02:57 PM

Reply
 
Thread Tools
Old 02-23-2012, 08:46 PM   #1
xujie
Member
 
Location: China

Join Date: Nov 2010
Posts: 11
Unhappy Questions about ANNOVAR

Hello everyone,

I would like to determine whether or not my calling SNPs are in coding regions and whether they impact the protein sequence. So I use ANNOVAR for annotation.
However, my research target species is maize ,which even not have the UCSC-type annotation database. So I think I shoud convert my GFF3 maize annotation file to a UCSC-type file. Could you give me any suggestion about the format of the UCSC-type file or any ideas for annotation for maize snps?

The file "hg18_refGene.txt" in the example database of ANNOVAR
585 NR_028269 chr1 - 4224 7502 7502 7502 7 4224,4832,5658,6469,6719,7095,7468, 4692,4901,5810,6631,6918,7231,7502, 0 LOC100288778 unk unk -1,-1,-1,-1,-1,-1,-1,


what is the meaning of the row?



Thank you advance
Best wishes
Xujie
xujie is offline   Reply With Quote
Old 02-23-2012, 09:29 PM   #2
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

The fields are described in the ANNOVAR website:

For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.


http://www.openbioinformatics.org/an...ml#othergenome

You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

Here is a sample of what my Arabidopsis refgene file looks like:

Code:
1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk
Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.

Last edited by chadn737; 02-23-2012 at 09:37 PM.
chadn737 is offline   Reply With Quote
Old 02-23-2012, 10:27 PM   #3
xujie
Member
 
Location: China

Join Date: Nov 2010
Posts: 11
Thumbs up

Quote:
Originally Posted by chadn737 View Post
In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

The fields are described in the ANNOVAR website:

For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.


http://www.openbioinformatics.org/an...ml#othergenome

You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

Here is a sample of what my Arabidopsis refgene file looks like:

Code:
1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk
Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.
Thank you so much for your reply and the information means too much for me.
xujie is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO