Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract Ensemble GeneBank Annotation file

    I have downloaded 1.3 gb from here:



    Called "Annotated sequence (GenBank)" for Homo Sapiens.

    This is quite a few *.dat.gz files, but how do I get the .gbk (gene bank) file extracted?

  • #2
    What Im looking for is a "GRCh37.74.gbk" file with annotations.

    Or I would actually prefer the NCBI hg19 version.

    Comment


    • #3
      Genbank format files for each chromosome are available here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/

      Ensembl 37.74 annotation file: ftp://ftp.ensembl.org/pub/release-74...Ch37.74.gtf.gz

      NCBI annotations: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/
      Last edited by GenoMax; 02-01-2014, 08:23 PM.

      Comment


      • #4
        Thank you very much!
        These look interesting, thought Im not sure whats the difference..

        ref_GRCh37.p13_scaffolds.gff3.gz
        ref_GRCh37.p13_top_level.gff3.gz


        Also, how may I convert from gff3 –genbank?

        Comment


        • #5
          Originally posted by HTnoob View Post
          Thank you very much!
          These look interesting, thought Im not sure whats the difference..

          ref_GRCh37.p13_scaffolds.gff3.gz
          ref_GRCh37.p13_top_level.gff3.gz


          Also, how may I convert from gff3 –genbank?
          Explanations are in this README file: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/README


          {alt,ref}_{assembly_name}_scaffolds.gff3.gz
          -------------------------------------------
          Features annotated on {assembly_name} in scaffold coordinates.
          {alt,ref}_{assembly_name}_top_level.gff3.gz
          -------------------------------------------
          Features annotated on {assembly_name} in top-level object coordinates.
          The top-level objects are: assembled chromosomes, unlocalized
          scaffolds (those scaffolds that are associated with a specific
          chromosome but which cannot be ordered or oriented on that
          chromosome), unplaced scaffolds (those scaffolds that are not
          associated with any chromosome), and in some cases scaffolds from
          alternate locus groups or genome patches (see the NCBI Assembly Model
          web page for an explanation of these terms:
          http://www.ncbi.nlm.nih.gov/genome/assembly/model).
          If you can explain what is it that you are trying to do then there may be a simple option available.

          Comment


          • #6
            Thank you again!

            Im trying to set up a searchable SNPs database, as described here:

            Background A typical bacterial pathogen genome mapping project can identify thousands of single nucleotide polymorphisms (SNP). Interpreting SNP data is complex and it is difficult to conceptualise the data contained within the large flat files that are the typical output from most SNP calling algorithms. One solution to this problem is to construct a database that can be queried using simple commands so that SNP interrogation and output is both easy and comprehensible. Results Here we present snp-search, a tool that manages SNP data and allows for manipulation and searching of SNP data. After creation of a SNP database from a VCF file, snp-search can be used to convert the selected SNP data into FASTA sequences, construct phylogenies, look for unique SNPs, and output contextual information about each SNP. The FASTA output from snp-search is particularly useful for the generation of robust phylogenetic trees that are based on SNP differences across the conserved positions in whole genomes. Queries can be designed to answer critical genomic questions such as the association of SNPs with particular phenotypes. Conclusions snp-search is a tool that manages SNP data and outputs useful information which can be used to test important biological hypotheses.


            To make the database I need two files, my reference genome used to make the VCF file (in EMBLE or GENBANK) along with the actual VCF file. Like this:

            snp-search -create -d my_snp_db.sqlite3 -r my_ref.gbk -v my_vcf_file.vcf


            I used the NCBI hg19 genes.GTF annotation file. Best thing would be to convert this to GTF to Genbank/EMBLE. But I had no luck in doing that..

            Comment


            • #7
              I do not think there is a single genbank file available for human genome. There are genbank files available for individual chromosomes. I suppose you could merge them into a single file.

              The paper you referred to above seems to be using small files (plasmids) for creating the database. You will need to check if that can handle a human genome size dataset.

              On a different note why aren't you searching against dbSNP and what do you want to get from the search (sequence?)

              Comment


              • #8
                Good points!

                We have ran GATK on RNAseq samples and want to explore the results, personally I don't know which questions might be answered?

                Comment


                • #9
                  Hi HTnoob,
                  I ran into the same problem trying to run snp-search. I will try to find a solution for this issue. Meanwhile, if you come up with a good solution I would appreciate you posting it here.

                  Cheers
                  MSc Bioinformatics student at the Free University Berlin , Germany

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X