Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Fernas
    Member
    • Apr 2013
    • 74

    get Gene Coordinates of human genes names

    Dear all,

    I have list of gene names, for example, the first five of them are:
    ABCB1
    ABCG2
    ACHE
    ACVR2A
    ACVR2B

    I want to get the genomic coordinates of these genes on human genome (hg19). I want one record per gene, so, I do not want information about exons,utrs...etc. for example I want the output to be as follows (or in any format e.g. bed gtf gff..etc):
    ABCB1 gene chr1 120434 134324 +
    ABCG2 gene chr1 324312 393431 -
    ...etc

    I tried to use UCSC to query the list but, it provided me with information about all exons,utrs, but no information about gene features.
  • mikesh
    Member
    • Jul 2012
    • 29

    #2
    Originally posted by Fernas View Post
    Dear all,

    I have list of gene names, for example, the first five of them are:
    ABCB1
    ABCG2
    ACHE
    ACVR2A
    ACVR2B

    I want to get the genomic coordinates of these genes on human genome (hg19). I want one record per gene, so, I do not want information about exons,utrs...etc. for example I want the output to be as follows (or in any format e.g. bed gtf gff..etc):
    ABCB1 gene chr1 120434 134324 +
    ABCG2 gene chr1 324312 393431 -
    ...etc

    I tried to use UCSC to query the list but, it provided me with information about all exons,utrs, but no information about gene features.
    Go to UCSC GB, browse the tables and download RefGene track. It has genomic coordinates and "name2" field that is a HGNC gene symbol. Of course each gene could have several transcripts (NM_* identifiers), so you either use all of them, or the longest one (aka canonical).

    Comment

    • Fernas
      Member
      • Apr 2013
      • 74

      #3
      Thanks @mikesh.
      I think you mean: "download RefSeq track" instead of "download RefGene track". correct?

      I followed the steps that you mentioned above and got what I want. However, I am wondering if there is any tool on galaxy or others that give the longest transcript (canonical) from the outputs, so, i have one record per gene.

      many thanks!

      Comment

      • jameslz
        Member
        • Nov 2009
        • 20

        #4
        @Fernas,

        For hg19, export the "RefSeq Gene track", and merge the overlapped transcripts of the same gene, you will get what you want.

        For GRCh37.p13, download "ref_GRCh37.p13_top_level.gff3.gz" from NCBI FTP ( ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/ )

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          Originally posted by Fernas View Post
          I followed the steps that you mentioned above and got what I want. However, I am wondering if there is any tool on galaxy or others that give the longest transcript (canonical) from the outputs, so, i have one record per gene.

          many thanks!
          You are going to have to do some parsing if you need only one record per gene. As people have pointed out you can get the GFF or GTF (ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens). Then you could do the following (assuming that your file of gene names is "genes")

          Code:
          $ grep -f genes ensembl_gff/gtf file > gtf_records_you _need (use a logical file name here)
          Then you will have to parse the resulting file to get the longest entry (if that is what you need) in the exact format you need.

          Comment

          • Fernas
            Member
            • Apr 2013
            • 74

            #6
            Thanks a lot @jameslz and @GenoMax.

            If I want to get one record per gene, what is the best strategy to query the output file? shall I find the longest entry? or I merge overlapped entries (using bedtools mergebed tool)?

            Comment

            • jameslz
              Member
              • Nov 2009
              • 20

              #7
              I write perl script to parse the information from "RefSeq Gene track".

              And the result looks like:
              #gene chromosome chromosome_length locus transcript_number transcripts transcript_location
              FIBCD1 chr9 141213431 133777824-133814455 2 NM_032843;NM_001145106 133777824-133814239|133777824-133814455

              Comment

              • Fernas
                Member
                • Apr 2013
                • 74

                #8
                Thanks @jameslz for your reply.

                Is this script available on the web?
                One more question: how did you define the (locus) start/end position? is it: the gene start position is the starting position of closest transcript to the chromsome start, and the locus end position is the end position of furthest transcript?

                Comment

                • jameslz
                  Member
                  • Nov 2009
                  • 20

                  #9
                  @Fernas
                  I use the following procedure:

                  1. sort all transcripts of the same gene by location.
                  2. overlap and merge
                  3. use the leftmost position and rightmost position.

                  I can give the perl script and the final result. [email protected]

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    Originally posted by jameslz View Post
                    @Fernas
                    I use the following procedure:

                    1. sort all transcripts of the same gene by location.
                    2. overlap and merge
                    3. use the leftmost position and rightmost position.

                    I can give the perl script and the final result. [email protected]
                    If you are willing please post the script here. (Use Edit --> Go Advanced --> Then use the "paper clip" icon to attach the file to a post).

                    This way you would be helping others who may have a need for something similar.

                    Comment

                    • jameslz
                      Member
                      • Nov 2009
                      • 20

                      #11
                      @GenoMax, OK!
                      The perl script uses ucsc refseq track (export as "all fields from selected table" format)

                      such as:

                      #bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
                      0 NM_032291 chr1 + 66999824 67210768 67000041 67208778 25 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,67101698,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210768, 0 SGIP1 cmpl cmpl 0,1,2,0,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,
                      1 NM_032785 chr1 - 48998526 50489626 48999844 50489468 14 48998526,49000561,49005313,49052675,49056504,49100164,49119008,49128823,49332862,49511255,49711441,50162984,50317067,50489434, 48999965,49000588,49005410,49052838,49056657,49100276,49119123,49128913,49332902,49511472,49711536,50163109,50317190,50489626, 0 AGBL4 cmpl cmpl 2,2,1,0,0,2,1,1,0,2,0,1,1,0,

                      usage: perl track_trans.pl hg19_refGene.tbl hg19_refGene_trans.tbl
                      Attached Files
                      Last edited by jameslz; 11-17-2013, 08:34 PM.

                      Comment

                      • gringer
                        David Eccles (gringer)
                        • May 2011
                        • 845

                        #12
                        Originally posted by GenoMax View Post
                        If you are willing please post the script here. (Use Edit --> Go Advanced --> Then use the "paper clip" icon to attach the file to a post).

                        This way you would be helping others who may have a need for something similar.
                        Here's my attempt [attached], which parses GTF output as linked by GenoMax. It might work with GFF files as well, but I haven't tested that (changes may be needed for the regular expression on line 112). It's a little bit over-engineered due to being derived from something else, for hackability purposes, and because I'm trying to get used to this pod documentation stuff. The usual "there will be bugs" disclaimer applies. Here's the command line syntax:

                        Code:
                        $ ./gtf2genePos.pl -help
                        Usage:
                            ./gtf2genePos.pl <lookup GTF file>\n";
                        
                            output:
                              a CSV file containing gene names, and locations
                        
                          Basic Options:
                            -summarise
                              Produce gene summaries, rather than individual region information
                        
                            -list *file*
                              Filter gene names by including only genes from this list file
                        
                            -help
                              Show this help message
                        
                            -v
                              increase verbosity of output
                        Attached Files

                        Comment

                        • KE8
                          Junior Member
                          • May 2015
                          • 3

                          #13
                          Hey,

                          I am looking to get all the gene coordinates for Pseudomonas aeruginosa genes. I had a look at UCSC GB but I can't seem to use it as a reference genome.

                          Does anyone know of any software via PubMed that would allow me to upload an excel file containing the gene names, or paste the list of those names?

                          I would really appreciate any help you can provide me. I am a bit of a newbie with this bioinformatics technique.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            Originally posted by KE8 View Post
                            Hey,

                            I am looking to get all the gene coordinates for Pseudomonas aeruginosa genes. I had a look at UCSC GB but I can't seem to use it as a reference genome.

                            Does anyone know of any software via PubMed that would allow me to upload an excel file containing the gene names, or paste the list of those names?

                            I would really appreciate any help you can provide me. I am a bit of a newbie with this bioinformatics technique.
                            Go here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Pick out the particular Pseudomonas strain you are interested in. Then go into that folder and get the ".gff" file (right click on the name and then save as, open in excel if you want). That will give you the gene coordinates (e.g. P. aeruginosa PAO1 ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_002516.gff)

                            Comment

                            • KE8
                              Junior Member
                              • May 2015
                              • 3

                              #15
                              Originally posted by GenoMax View Post
                              Go here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Pick out the particular Pseudomonas strain you are interested in. Then go into that folder and get the ".gff" file (right click on the name and then save as, open in excel if you want). That will give you the gene coordinates (e.g. P. aeruginosa PAO1 ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_002516.gff)
                              My genes are all listed in PA# format. Is there a way to convert these names into the NP format that PubMed uses?

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 11:10 AM
                              0 responses
                              6 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              102 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              124 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...