Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • get Gene Coordinates of human genes names

    Dear all,

    I have list of gene names, for example, the first five of them are:
    ABCB1
    ABCG2
    ACHE
    ACVR2A
    ACVR2B

    I want to get the genomic coordinates of these genes on human genome (hg19). I want one record per gene, so, I do not want information about exons,utrs...etc. for example I want the output to be as follows (or in any format e.g. bed gtf gff..etc):
    ABCB1 gene chr1 120434 134324 +
    ABCG2 gene chr1 324312 393431 -
    ...etc

    I tried to use UCSC to query the list but, it provided me with information about all exons,utrs, but no information about gene features.

  • #2
    Originally posted by Fernas View Post
    Dear all,

    I have list of gene names, for example, the first five of them are:
    ABCB1
    ABCG2
    ACHE
    ACVR2A
    ACVR2B

    I want to get the genomic coordinates of these genes on human genome (hg19). I want one record per gene, so, I do not want information about exons,utrs...etc. for example I want the output to be as follows (or in any format e.g. bed gtf gff..etc):
    ABCB1 gene chr1 120434 134324 +
    ABCG2 gene chr1 324312 393431 -
    ...etc

    I tried to use UCSC to query the list but, it provided me with information about all exons,utrs, but no information about gene features.
    Go to UCSC GB, browse the tables and download RefGene track. It has genomic coordinates and "name2" field that is a HGNC gene symbol. Of course each gene could have several transcripts (NM_* identifiers), so you either use all of them, or the longest one (aka canonical).

    Comment


    • #3
      Thanks @mikesh.
      I think you mean: "download RefSeq track" instead of "download RefGene track". correct?

      I followed the steps that you mentioned above and got what I want. However, I am wondering if there is any tool on galaxy or others that give the longest transcript (canonical) from the outputs, so, i have one record per gene.

      many thanks!

      Comment


      • #4
        @Fernas,

        For hg19, export the "RefSeq Gene track", and merge the overlapped transcripts of the same gene, you will get what you want.

        For GRCh37.p13, download "ref_GRCh37.p13_top_level.gff3.gz" from NCBI FTP ( ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/ )

        Comment


        • #5
          Originally posted by Fernas View Post
          I followed the steps that you mentioned above and got what I want. However, I am wondering if there is any tool on galaxy or others that give the longest transcript (canonical) from the outputs, so, i have one record per gene.

          many thanks!
          You are going to have to do some parsing if you need only one record per gene. As people have pointed out you can get the GFF or GTF (ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens). Then you could do the following (assuming that your file of gene names is "genes")

          Code:
          $ grep -f genes ensembl_gff/gtf file > gtf_records_you _need (use a logical file name here)
          Then you will have to parse the resulting file to get the longest entry (if that is what you need) in the exact format you need.

          Comment


          • #6
            Thanks a lot @jameslz and @GenoMax.

            If I want to get one record per gene, what is the best strategy to query the output file? shall I find the longest entry? or I merge overlapped entries (using bedtools mergebed tool)?

            Comment


            • #7
              I write perl script to parse the information from "RefSeq Gene track".

              And the result looks like:
              #gene chromosome chromosome_length locus transcript_number transcripts transcript_location
              FIBCD1 chr9 141213431 133777824-133814455 2 NM_032843;NM_001145106 133777824-133814239|133777824-133814455

              Comment


              • #8
                Thanks @jameslz for your reply.

                Is this script available on the web?
                One more question: how did you define the (locus) start/end position? is it: the gene start position is the starting position of closest transcript to the chromsome start, and the locus end position is the end position of furthest transcript?

                Comment


                • #9
                  @Fernas
                  I use the following procedure:

                  1. sort all transcripts of the same gene by location.
                  2. overlap and merge
                  3. use the leftmost position and rightmost position.

                  I can give the perl script and the final result. [email protected]

                  Comment


                  • #10
                    Originally posted by jameslz View Post
                    @Fernas
                    I use the following procedure:

                    1. sort all transcripts of the same gene by location.
                    2. overlap and merge
                    3. use the leftmost position and rightmost position.

                    I can give the perl script and the final result. [email protected]
                    If you are willing please post the script here. (Use Edit --> Go Advanced --> Then use the "paper clip" icon to attach the file to a post).

                    This way you would be helping others who may have a need for something similar.

                    Comment


                    • #11
                      @GenoMax, OK!
                      The perl script uses ucsc refseq track (export as "all fields from selected table" format)

                      such as:

                      #bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
                      0 NM_032291 chr1 + 66999824 67210768 67000041 67208778 25 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,67101698,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210768, 0 SGIP1 cmpl cmpl 0,1,2,0,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,
                      1 NM_032785 chr1 - 48998526 50489626 48999844 50489468 14 48998526,49000561,49005313,49052675,49056504,49100164,49119008,49128823,49332862,49511255,49711441,50162984,50317067,50489434, 48999965,49000588,49005410,49052838,49056657,49100276,49119123,49128913,49332902,49511472,49711536,50163109,50317190,50489626, 0 AGBL4 cmpl cmpl 2,2,1,0,0,2,1,1,0,2,0,1,1,0,

                      usage: perl track_trans.pl hg19_refGene.tbl hg19_refGene_trans.tbl
                      Attached Files
                      Last edited by jameslz; 11-17-2013, 08:34 PM.

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        If you are willing please post the script here. (Use Edit --> Go Advanced --> Then use the "paper clip" icon to attach the file to a post).

                        This way you would be helping others who may have a need for something similar.
                        Here's my attempt [attached], which parses GTF output as linked by GenoMax. It might work with GFF files as well, but I haven't tested that (changes may be needed for the regular expression on line 112). It's a little bit over-engineered due to being derived from something else, for hackability purposes, and because I'm trying to get used to this pod documentation stuff. The usual "there will be bugs" disclaimer applies. Here's the command line syntax:

                        Code:
                        $ ./gtf2genePos.pl -help
                        Usage:
                            ./gtf2genePos.pl <lookup GTF file>\n";
                        
                            output:
                              a CSV file containing gene names, and locations
                        
                          Basic Options:
                            -summarise
                              Produce gene summaries, rather than individual region information
                        
                            -list *file*
                              Filter gene names by including only genes from this list file
                        
                            -help
                              Show this help message
                        
                            -v
                              increase verbosity of output
                        Attached Files

                        Comment


                        • #13
                          Hey,

                          I am looking to get all the gene coordinates for Pseudomonas aeruginosa genes. I had a look at UCSC GB but I can't seem to use it as a reference genome.

                          Does anyone know of any software via PubMed that would allow me to upload an excel file containing the gene names, or paste the list of those names?

                          I would really appreciate any help you can provide me. I am a bit of a newbie with this bioinformatics technique.

                          Comment


                          • #14
                            Originally posted by KE8 View Post
                            Hey,

                            I am looking to get all the gene coordinates for Pseudomonas aeruginosa genes. I had a look at UCSC GB but I can't seem to use it as a reference genome.

                            Does anyone know of any software via PubMed that would allow me to upload an excel file containing the gene names, or paste the list of those names?

                            I would really appreciate any help you can provide me. I am a bit of a newbie with this bioinformatics technique.
                            Go here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Pick out the particular Pseudomonas strain you are interested in. Then go into that folder and get the ".gff" file (right click on the name and then save as, open in excel if you want). That will give you the gene coordinates (e.g. P. aeruginosa PAO1 ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_002516.gff)

                            Comment


                            • #15
                              Originally posted by GenoMax View Post
                              Go here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Pick out the particular Pseudomonas strain you are interested in. Then go into that folder and get the ".gff" file (right click on the name and then save as, open in excel if you want). That will give you the gene coordinates (e.g. P. aeruginosa PAO1 ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_002516.gff)
                              My genes are all listed in PA# format. Is there a way to convert these names into the NP format that PubMed uses?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X