Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • retrieve gene name

    Hi everybody,
    I'm fabio and I wished to ask you if anybody knows how to retrieve from the genomic locations the gene name. Up to now I'm using the USCS genome browser, but It's quite complex to retrieve the gene name one by one! Thanks a lot,
    Fabio

  • #2
    Hey Fabio,

    Keep in mind I'm not a programmer, so I'm sure someone else here has a better solution! But it's pretty easy to retrieve gene names (or anything really) using the Table Browser at UCSC combiend with some simple perl. I've used the following subroutine to get info about any gene (from the "knownGene" table) given the chromosome, start and end position. It will undoubtedly need updating as it's a few years old, and could certainly be coded better (it uses LWP::Simple)

    Code:
    sub knownGene{
        my %knowngene;
        #my $location = "chr" . $7 . ":" . $9 . "-" . $10;
        my ($chr,$start,$end) = @_;
        my $location = "chr" . $chr . ":" . $start . "-" . $end;
    
        my $p = "http://genome.cse.ucsc.edu/cgi-bin/hgText?";
        my $q = "db=hg16&table=hg16.knownGene&phase=Get+all+fields&position=$location&submit=submit&";
    
        my $c = get ("$p"."$q");
    
        my @b = split ('\n',$c);
    
        foreach my $line (@b) {
             if ( $line =~ /^\#/){
             next;
         }
         if ($line =~ /^(\w+)\s+(\w+)\s+([-+])\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+([\w\,]+)\s+([\w\,]+)/){
             $knowngene{'name'} = $1;
             $knowngene{'chr'} = $2;
             $knowngene{'strand'} = $3;
             $knowngene{'txStart'} = $4;
             $knowngene{'txEnd'} = $5;
             $knowngene{'cdsStart'} = $6;
             $knowngene{'cdsEnd'} = $7;
             $knowngene{'exonCount'} = $8;
             $knowngene{'exonStarts'} = $9;
             $knowngene{'exonEnds'} = $10;         
         }
         
        
         if ($knowngene{'name'}) { 
             return \%knowngene;
         }else{
             return undef;
         }
         }
    }
    I can't do it right now, but it's pretty easy to adapt this to read in a list of "chr:XXXXXX-YYYYYY" data and output the genes. Hope that helps.

    Comment


    • #3
      hi eco thank you very much for your reply. My problem is that I'm not familiar with Perl scripting, and so I'll start to learn it. Untill now I worked only in R and bioconductor, but unfortunately I didn't find any package to manage properly chip-seq data. Sorry for the stupid question...where do you insert the PERL code???

      Comment


      • #4
        Dear Fabio,

        this question is more complex that it seems at first glance.
        When having large numbers of regions from a NGS experiment a big number of regions won´t fall into annotated regions. Then, is gene name really what you want or is it rather the transcript or exon, or promoter, or UTR, or..., or...
        NGS not alway is strand specific, so you need to look at the sense strand and anti-sense strand, both upstream and downstream.

        An easy way to get all this annotation for a bed-file is RegionMiner

        If you are interested in just the gene names overlaping with your regions, ECO´s script might help

        Cheers

        Klaus

        Comment


        • #5
          hi Kmay,
          thank you very much for your help. I was trying to use the RegionMiner (genomatix), but my bed file (raw data)was to0 big, aroung 60 Mb and the server told me that I cannot up-load it. Then I up-load the .wig file (analyzed by someone other else) in uscs browser and then I downloaded it as bed file, but the table browser didn't insert the data points, only the chromosonal locations. Do you know how I can do?

          Comment


          • #6
            Fabio,

            before uploading the data, you have to cluster the raw data into regions of significant tag enrichment. Annotating the raw data will most likely give you almost every gene in the genome.
            You cannot upload all raw data tags in the on-line version for visualization nor annotation ( as said, the latter seems not very useful to me). For such you would need to have GGA on site.
            Our clustering is available only on the GGA.
            However, you might give Shirely Liu´s MACS a try and upload the cluster regions thereafter.

            Cheers

            Klaus

            Comment


            • #7
              Originally posted by fabio25 View Post
              hi eco thank you very much for your reply. My problem is that I'm not familiar with Perl scripting, and so I'll start to learn it. Untill now I worked only in R and bioconductor, but unfortunately I didn't find any package to manage properly chip-seq data. Sorry for the stupid question...where do you insert the PERL code???
              Hey Fabio. Klaus is right, there are more comprehensive solutions out there, but they are costly, and rarely let you do the exact analysis you need.

              If you are interested in learning perl (which will undoubtedly help you at some point), there are a ton of great resources out there for learning it free...like here: http://www.perl.com/pub/a/2000/10/begperl1.html

              You'll need some sort of interpreter if you're working on windows...ActivePerl is a good place to start. Good luck!

              Comment


              • #8
                Hi Fabio,
                I never used GALAXY for NGS data but you can have a try:
                Galaxy is a community-driven web-based analysis platform for life science research.
                gabriele bucci

                Comment


                • #9
                  by fabio

                  hi Gbucci
                  thanks a lot for your advise. I tried one time to work with it but it dxoesn't work so fine with custom track in wiz format. probably it's me and I'll try again. May I ask you what do you use usually?

                  Comment


                  • #10
                    fabio,

                    can I ftp your data? I´ll do a quick run on them and send you the results. Will take about 15 minutes.

                    if it helps...

                    Klaus

                    Comment


                    • #11
                      If your organism is in Ensembl you can use the Biomart tool to extract genes (or other elements) by location.

                      Comment


                      • #12
                        hi dcfargo
                        i did that, but it's not so precised. I retrieves me even te genes around doing it in R. probably I 'll have to try on the website.

                        Comment


                        • #13
                          hi kmay,
                          I would like to do that, but the data are not mine and I cannot send them.
                          However, I'm ostinate to find an open source way how to deal with these data, but if I'm not able I'll work with GGA, how you suggested me before. Thank you very much.

                          Comment


                          • #14
                            Fabio,

                            Galaxy and the UCSC tables browser should do exactly what you need. Use some basic logic before trying to do it all in one go. I would:

                            1) Choose a subset of my query data e.g grep -w "chr1" file.bed > chr1.mydata.bed
                            2) Go to UCSC tables browser
                            3) Select the Gene Table
                            4) Select the Union/Intersection option
                            5) Intersect the chr1.mydata.bed with the Genes track
                            6) output the intersection results in comma/tab separated format
                            7) Import file into MS Excel or some spreadsheet program

                            If this can work then u just need to generalize it to you whole dataset and not try to do too many steps at once. THis is only one possible solution and there are probably more elegant open source methods.

                            Comment


                            • #15
                              Originally posted by fabio25 View Post
                              hi Gbucci
                              thanks a lot for your advise. I tried one time to work with it but it dxoesn't work so fine with custom track in wiz format. probably it's me and I'll try again. May I ask you what do you use usually?
                              Hi Fabio,
                              when I deal with long list of [chr\tstart\tend\tstrand] genomic coordinates I use a perl script pretty like the one ECO suggested you. The script parses your file, reading in the coords and passes them to the UCSC remote database, using a mysql query.
                              I'm quite sure that does exist a Bioconductor's way of doing it, but I can't tell you more since I never experimented it. You may have a look in the BioC mailing list.

                              Ask if you need help with perl scripting.

                              My Best

                              G
                              gabriele bucci

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Today, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              37 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X