Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering annotated sequences based on their GO terms

    Dear all,

    I have a set of 10000 sequences from an RNAseq experiment annotated with GO terms, I would like to cluster the sequences in biological meaningful groups using the GO terms information for each sequence. Is there any software to do that?

    Pau

    Thank you!

  • #2
    Hi,
    Out of curiosity, are these (10000) sequences already clusetered (de novo clustering) or do they represent only sequences with GO from your assembly?

    I had writen something down exactly as this one in my to dos list. I'd bookmarked this page to explored in future. I have not used it but it might help:


    The grouping algorithm is based on the hypothesis that similar annotations should have similar gene members.

    HTH

    Comment


    • #3
      Hi Apexy,

      No I have not yest clustered de sequences. I have just annotated de sequences obtained from the assembly using Blast2go. Now I would like to go un step further and cluster de sequences based on their GO terms in order to obtain groups of genes involved in similar function.
      Thanks, I have already had a look at DAVID website. I think it could be a good option, but the web only accepts 3000 sequences each time and I would like to cluster all the sequences and the same time....I will keep on searching for alternative websites.

      Thank you for your answer!!

      Pau

      Comment


      • #4
        You could probably write some small bash script. What kind of separators are you using in your headers? Which field is GO? Are there line-breaks in your sequences?
        savetherhino.org

        Comment


        • #5
          Hi,
          Its better as you have not done any clustering on them before annotation since de novo clustering sometimes assigns different transcripts from paralogous gene into the same clusters and for species with extensive gene duplications, it can be a potential nightmare. Are these functional labels from annotation transfer with BLAST or with INTERPRO or both in Blast2go? Can I also know what fraction these sequences (10,000) represent the entire assembly and what database was Blast2go set to if you used BLAST?

          @rhinoceros, a cluster should be defined by the degree of overlap in GOs shared by sequences. This will certainly introduce a new challenge as to what threshold of GOs required to put sequences in one cluster. Do you mean using cat, cut,sort and grep in a loop to write a clustering algorithm?

          Thanks,

          Comment


          • #6
            Originally posted by Apexy View Post
            Hi,
            @rhinoceros, a cluster should be defined by the degree of overlap in GOs shared by sequences. This will certainly introduce a new challenge as to what threshold of GOs required to put sequences in one cluster. Do you mean using cat, cut,sort and grep in a loop to write a clustering algorithm?
            I thought the aim was to sort sequences so that in file Z there would be all the sequences that had GO X in their header. It's not really clustering at all but sorting. But anyway, maybe I misunderstood OP.
            Last edited by rhinoceros; 04-29-2013, 02:55 AM.
            savetherhino.org

            Comment


            • #7
              Originally posted by rhinoceros View Post
              I thought the aim was to sort sequences so that in file Z there would be all the sequences that had GO X in their header. It's not really clustering at all but sorting. But anyway, maybe I misunderstood OP.
              This would have been an appealing solution if each sequence had only one GO term.

              Comment


              • #8
                Hi Apexy and rhinoceros, thank you for your information. Yes, Apexy is right in the sense that each sequences has more than one GO term and this make the process more complex. The annotation come from GO terms, motif (Interproscan) and enzyme code. All them came from the best first 10 hits from a blastX against de nr database from NCBI with a treshold of 10e-6.
                From 16000 sequences I got significant blast hits for 14000 sequences. Then for these sequences I performed the different annotation steps and I got around 10000 annotated. Now as you say, I want to cluster this 10000 sequences usig the information coming from the annotations. I tried DAVID and BABELOMICS but they have some limitations in the number of sequences they can run each time. I was wondering if it could be any program based on R or UNIX to that locally...

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 11:49 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-24-2024, 08:47 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                61 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Working...
                X