Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • de novo RNAseq contig clustering

    Hi there,

    I performed a de novo RNA seq analysis using oases and trinity and ended up with a list of contigs.

    I now want to cluster the contigs to group them by similarity to see the redundancy level I have encountered. I am after the idea that if I have say 50k contigs and get 1 cluster, then the redundancy will be 100% since all detected transcripts will be the same, and the opposite, if I get 50K clusters, I would have 0% redundancy and thus all 50k contigs will be different. What do you think?

    I thought of using blastclust but apparently it has been removed from latest blast instalations. From the NCBI blast manual: "Please note that the NCBI C Toolkit applications seedtop and blastclust are not available in this release."

    Does anyone know where to get it or if there is another program I could use to achieve this?

    Thanks for your help,

    Dave

  • #2
    I'm not sure which version you are looking at but the latest release of the C toolkit blast (NOT BLAST+) which is 2.2.26 has blastclust. See ftp://ftp.ncbi.nih.gov//blast/execut...elease/2.2.26/

    As an alternative I have used CD-HIT very successfully for clustering de novo transcript assemblies.

    Comment


    • #3
      Dear kmcarr,

      thanks for your help. I found blastclust and also tried CD-HIT as you suggested.

      Do you know if it there are any guidelines as to how to select a representative from each cluster? Is it possible just to pick one at random since they are "similar" after all? Maybe the longest of all?

      Also, is there anything that can be done with the clusters that only contain one sequence in? How can I handle them?

      Cheers,

      Dave
      Last edited by dnusol; 07-16-2012, 01:58 AM.

      Comment


      • #4
        Hi again,

        does anyone know if the maximum header length in the input FASTA file for CD-HIT is 20 characters? that seems rather short, doesn't it? Is there a way to allow increasing it? I have 50 or so characters in my headers and I get this

        >Cluster 7
        0 15913nt, >Locus_555_Transcrip... *
        1 10294nt, >Locus_555_Transcrip... at +/99.82%
        2 9400nt, >Locus_555_Transcrip... at +/95.45%
        3 15896nt, >Locus_555_Transcrip... at +/98.25%
        4 15511nt, >Locus_555_Transcrip... at +/99.52%
        5 9164nt, >Locus_555_Transcrip... at +/96.75%
        6 14825nt, >Locus_555_Transcrip... at +/98.37%
        7 7308nt, >Locus_555_Transcrip... at +/95.84%
        8 15877nt, >Locus_555_Transcrip... at +/98.34%

        So I cannot choose the representative of each cluster

        Cheers,

        Dave

        Edit: OK, so the -d flag seems to allow specifying a longer defline
        Last edited by dnusol; 07-25-2012, 12:15 AM. Reason: found answer

        Comment


        • #5
          Originally posted by dnusol View Post
          Dear kmcarr,

          thanks for your help. I found blastclust and also tried CD-HIT as you suggested.

          Do you know if it there are any guidelines as to how to select a representative from each cluster? Is it possible just to pick one at random since they are "similar" after all? Maybe the longest of all?

          Also, is there anything that can be done with the clusters that only contain one sequence in? How can I handle them?

          Cheers,

          Dave
          I don't know if you are still working on the clustering but what i have done with my denovo transcripts that were generated from three different assembly algorithms is to cluster them using blastclust and then select the representative from each cluster based on gene length (longest). For those clusters that only contain one sequence i have selected as it is.

          Comment


          • #6
            USEARCH might also be an option:



            After clustering at any level of ID, you can output either a consensus sequence or a centroid sequence for each cluster.

            Comment


            • #7
              Originally posted by themerlin View Post
              USEARCH might also be an option:



              After clustering at any level of ID, you can output either a consensus sequence or a centroid sequence for each cluster.
              Thanks for the info. Do you know what should be the optimum value of i.d in USEARCH to be able to cluster the denovo transcripts generated by different assembler.

              Comment


              • #8
                I think that this will require some testing. Start high and work down until you hit the sweet spot for your analysis.

                Comment


                • #9
                  Originally posted by themerlin View Post
                  USEARCH might also be an option:



                  After clustering at any level of ID, you can output either a consensus sequence or a centroid sequence for each cluster.
                  Could you tell me what option in blastclust would you use to output the consensus sequence? I searched all options but couldn't find one.

                  Thanks
                  Upendra

                  Comment


                  • #10
                    Hi all,

                    I got an output from cd-hit-est as follows.
                    >Cluster 1
                    0 1997nt, >Locus_3753_Transcript_3/6_Confidence_0.182_Length_1997_UP10_UP11... at 208:1784:3900:5486/+/92.67%
                    1 15188nt, >Locus_416_Transcript_101/105_Confidence_0.255_Length_15188_UP1... at 11777:1:4159:15952/-/85.81%
                    2 15605nt, >Locus_2273_Transcript_25/30_Confidence_0.598_Length_15605_UP7... at 3700:15605:4159:16064/+/100.00%
                    3 16064nt, >Locus_2273_Transcript_26/30_Confidence_0.576_Length_16064_UP7... *
                    4 15812nt, >Locus_2273_Transcript_30/30_Confidence_0.598_Length_15812_UP7... at 1844:15812:2097:16064/+/99.90%
                    5 1973nt, >Locus_1056_Transcript_4/7_Confidence_0.185_Length_1973_UP4... at 340:1760:4052:5486/+/93.33%
                    6 15398nt, >Locus_2370_Transcript_21/28_Confidence_0.628_Length_15398_UP2... at 2321:14533:2883:15100/+/99.27%

                    In the above what does the tailing information tell us.. For eg: at 2321:14533:2883:15100/+/99.27%.

                    What individual number means,

                    Thanks,

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X