Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Directly analyze GO numbers

    I want to perform a gene ontology analysis on a list of significant genes using the GO numbers directly and not the sequences/gene IDs.

    I started using Blast2GO but the program first takes all of your sequences through NCBI BLAST, which is a very time consuming process. This seemed unnecessary to me because I already know my gene ID's. I decided to pull GO numbers directly from an online database, and successfully achieved this in 1 day as opposed to 7+...

    I now have gene ontology information (GO:0005515, GO:0009540, etc.) for several thousand genes but cannot find a tool to analyze the meaning & distribution of these numbers directly. It shouldn't be too hard because all I have really done is skipped the first, and most time intensive, part of the Blast2GO process.
    Something that could provide graphical output would obviously be ideal but it's not absolutely necessary.

    Any help is much appreciated!

  • #2
    There are a couple of tools available, but the best (in my opinion) are:
    - GOrilla for human, mouse, rat, c.elegans (http://cbl-gorilla.cs.technion.ac.il/)
    - AgriGO for all plant stuff (http://bioinfo.cau.edu.cn/agriGO/analysis.php)
    - GOeast for all non-model stuff (http://omicslab.genetics.ac.cn/GOEAS...microarray.php)

    All of them are easy to handle and produce nice visual outputs. With AgriGO and GOeast you may also create your own database, so you're not restricted to the available datasets (if this is relevant for you)

    Comment


    • #3
      Thanks, I am working with tomato so I used agriGO and the results seem good but I'm not quiteee there yet.

      Do you have experience with agriGO? I would be interested to hear how you have used it in the past because I am having a little bit of trouble interpreting the output. I like that this tool provides very specific GO annotations but the bar graphs that it gives leave a lot to be desired. The stats that it gives seem a little strange to me too...

      I also used WEGO, (http://wego.genomics.org.cn/cgi-bin/wego/index.pl) which despite providing pretty looking graphs... has disappointed me so far. It only provides very vague categories and doesn't really allow for much customization of the x-axis (even though they claim that it does).

      Thanks in advance to anyone who can help me out or suggest a tool to use.

      Comment


      • #4
        Concerning the settings:
        I have used AgriGO for analyzing Arabidopsis and fungal microarray data.
        For Arabidopsis I run the Parametric Analysis of Gene Set Enrichment (PAGE) with "Hochberg (FDR)" for adjustment. For SEA I had to use a customized annotation reference, as I designed the chip based on the TAIR10 release which is not available yet. For the statistical methods I used "hypergeometric" and "Hochberg (FDR)", but this may be different for you depending on the size of your dataset.
        For the fungal data I only run SEA with a custom annotation reference and the same settings as for Arabidopsis.

        Concerning the output:
        I never used the bar chart as I like the other one better. The stats deplayed in the individual boxes are as follows (I just use data from the provided example for SEA):
        GO:0050896(1.52e-05) <- This is your FDR-corrected significance value. If it is below your set threshold (Std: 0.05) the box will be coloured. The lower the value, the higher the significance value (box becoming more red); These values should be equal to the values in the "Detail information" table
        response to stimulus <- the GO name
        49/168 | 3107/22479 <- the two values right of the forward slashes "/" are the number of genes in your input (168) and the number of genes in the background reference (22479 <- all A. thaliana genes). These values never change between boxes; the left two values are the number of genes annotated with this specific GO (in this case "response to stimulus") in your input (49) and in the the reference (3107).
        I hope this helps. If you have other questions please provide some example data and explain on these what exactly "seems a little strange"

        Comment


        • #5
          All genes under a GO Term

          I have a similar issue... I have some specific GO categories in mind and I would like to get a list of my genes that match those categories. I have about 35,000 expressed genes (in the form of Entrez Gene ID numbers). Any simple solutions for mapping these genes against a single or small set of selected GO terms (i.e. which genes are involved in GO:0006950 response to stress and GO:0007568 aging)?

          Comment


          • #6
            @cacti:
            Were are these 35k genes stored and how do you access them? Cause if they are in the NCBI database and you can select them via the NCBI search bar, you could simply add a GO-name field (e.g. "response to stress"[GO]) to your search.
            If you have a txt file of your genes with associated GO numbers/names, a simple "grep"(assuming your familiar with linux command line progs) would be the fastest solution.
            Otherwise please provide an example of your desired input and output first.

            Comment


            • #7
              @WhatsOEver:

              Thanks... they are in a text file of results parsed from a -blastx against the NCBI database, then I used a mapping file to get Entrez Gene IDs from the associated gi numbers. My text file has columns for:
              (1) contig # from de novo transcriptome assembly
              (2) Entrez Gene ID
              (3) annotation (i.e. sodium channel, actin-binding protein, etc)
              (4) raw reads, etc

              I don't have GO terms for each gene. I used GOseq to find overrepresented GO categories, but since this is a non-model species with no published genome, I couldn't figure out a good way to reverse map to find which of my genes are in each category.

              So now I have some GO categories of interest and I want to find which of my genes are involved.

              Comment


              • #8
                sry, but if you used GOseq you had to provide GO mappings to the method?!

                Originally posted by GOseq Manual
                goseq obtains length data from UCSC and GO mappings from the organim packages (see link{getgo} and getlength for details). If your data is in an unsupported format you will need to obtain the GO category mapping and supply them to the goseq function using the gene2cat arguement.
                How did you calculate overrepresentation without a GO mapping?

                Comment


                • #9
                  I supplied my length data and used the database for the most closely related organism. It gave me some interesting leads for biological processes to look into... now I want to work with my whole set of genes.

                  Comment


                  • #10
                    Ah, OK, so you're actually working with the annotations of a reference - and you're Entrez Gene ID's are those of the "closest relative"?!

                    There are then 2 possibilities:
                    1) (the easy way): Here is a link to a post in the bioconductor forum (https://stat.ethz.ch/pipermail/bioco...er/041019.html) which I used some time ago to do what you want (I was, however, working with an organism from the species package) What you are looking for in particular is the reversemapping function
                    genes2go=getgo(names(YourGeneData),'hg19','ensGene')
                    go2genes=goseq:::reversemapping(genes2go)
                    2) (the more accurate way): Although it will get you where you want, I would suggest to run a complete analysis of your gene set using Blast2GO (http://www.blast2go.com/b2ghome). You just need your genes in fasta format. Within the program you then perform (a) blastx vs ncbi, (b) go mapping, (c) go annotation, (d) interpro scan, (e) merging of interpro go's to existing ones. The drawback of the method is that it may take you up to 2 weeks to finish everything with 35k genes. It is possible to speed it up by separating your data and running multiple Blast2GO instances in parallel (the individual Blast2GO projects can afterwards be merged in the program - you should, however, not be to greedy, because if the blast server gets to many requests from the same IP you may be blocked for some time)

                    The reason I favour 2 over 1 is that you're so far only working on proteins which have homologs in your relative. Running a complete analysis of your genes would give you a more complete list.

                    Comment


                    • #11
                      hello to all and @WhatsOEver,
                      the sample i have been working with also a non model species. after finding differential expressed unigene in its transcriptome data i have used Blast2GO done exactly same what @whatsoever have said above[ (a) blastx vs ncbi, (b) go mapping, (c) go annotation, (d) interpro scan, (e) merging of interpro GOs]. then i went into the blast2go charts menu get the results like which biological processes are most enriched for the up regulated transcripts. but what i want to know here is which transcripts/unigenes (at their IDs in fasta file) go into which biological process for example. how can i do that, any advice?

                      Comment


                      • #12
                        It sounds like you're looking to reverse map from your GO category to the list of your genes that are in that category. How did you do your GO mapping initially? Can you just do a grep search to pull out your GO categories (and the associated genes) of interest?

                        If you don't have a category mapping file BUT you do have a common gene ID (like Ensembl, or Entrez gene ID), you can make one by using a flat file from NCBI (and merge functions in R or a simple python script) to link genes to their GO categories. And then do the search to pull out only your categories of interest.

                        HTH

                        Comment


                        • #13
                          Notice there are a couple of large flat files if you are programmatically inclined

                          You can use wget and gunzip to get "gene2go" and "gene_info" from NCBI.

                          wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
                          wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
                          -

                          head -1 gene2go
                          #Format: tax_id GeneID GO_ID Evidence Qualifier GO_term PubMed Category (tab is used as a separator, pound sign - start of a comment)
                          head -1 gene_info
                          #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority Nomenclature_status Other_designations Modification_date (tab is used as a separator, pound sign - start of a comment)

                          Comment


                          • #14
                            hi @cacti and @Richard Finney
                            what i have is only multiple transcript sequences (which came from RNA-seq assembly), so i used "blast2go" software blasted them with ncbi nr, then interpro scan (which assign the sequences to their corresponding GO term based on their domain), merged the GOs. i have done all this on win7 machine and i only get the graphical results of GO enrichment. but i wanna know which transcripts go into which biological process. here i only have sequences ,not Go mapping, not Ensembl or Entrez gene ID. so could you please give me some advise how i could do that?
                            Last edited by kurban910; 08-03-2015, 04:04 AM.

                            Comment


                            • #15
                              Is there a manual for blast 2 go ?

                              Have you tried the command line version:

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X