SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Contig vs contig or map against contig lib? JackieBadger Bioinformatics 1 05-30-2016 05:34 AM
SRMA Problem SAMRecord contig does not match the current reference sequence contig gavin.oliver Bioinformatics 5 07-05-2011 05:28 AM
clustering RNASeq data PFS Bioinformatics 0 06-21-2011 09:14 AM
Efficient frequency-based de novo short-read clustering for error trimming in NGS strob Literature Watch 0 07-02-2009 01:37 AM
PubMed: Efficient frequency-based de novo short read clustering for error trimming in Newsbot! Literature Watch 0 05-15-2009 05:00 AM

Reply
 
Thread Tools
Old 06-27-2012, 09:03 AM   #1
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default de novo RNAseq contig clustering

Hi there,

I performed a de novo RNA seq analysis using oases and trinity and ended up with a list of contigs.

I now want to cluster the contigs to group them by similarity to see the redundancy level I have encountered. I am after the idea that if I have say 50k contigs and get 1 cluster, then the redundancy will be 100% since all detected transcripts will be the same, and the opposite, if I get 50K clusters, I would have 0% redundancy and thus all 50k contigs will be different. What do you think?

I thought of using blastclust but apparently it has been removed from latest blast instalations. From the NCBI blast manual: "Please note that the NCBI C Toolkit applications seedtop and blastclust are not available in this release."

Does anyone know where to get it or if there is another program I could use to achieve this?

Thanks for your help,

Dave
dnusol is offline   Reply With Quote
Old 06-27-2012, 12:22 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

I'm not sure which version you are looking at but the latest release of the C toolkit blast (NOT BLAST+) which is 2.2.26 has blastclust. See ftp://ftp.ncbi.nih.gov//blast/execut...elease/2.2.26/

As an alternative I have used CD-HIT very successfully for clustering de novo transcript assemblies.
kmcarr is offline   Reply With Quote
Old 07-16-2012, 01:09 AM   #3
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Dear kmcarr,

thanks for your help. I found blastclust and also tried CD-HIT as you suggested.

Do you know if it there are any guidelines as to how to select a representative from each cluster? Is it possible just to pick one at random since they are "similar" after all? Maybe the longest of all?

Also, is there anything that can be done with the clusters that only contain one sequence in? How can I handle them?

Cheers,

Dave

Last edited by dnusol; 07-16-2012 at 01:58 AM.
dnusol is offline   Reply With Quote
Old 07-23-2012, 11:08 PM   #4
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi again,

does anyone know if the maximum header length in the input FASTA file for CD-HIT is 20 characters? that seems rather short, doesn't it? Is there a way to allow increasing it? I have 50 or so characters in my headers and I get this

>Cluster 7
0 15913nt, >Locus_555_Transcrip... *
1 10294nt, >Locus_555_Transcrip... at +/99.82%
2 9400nt, >Locus_555_Transcrip... at +/95.45%
3 15896nt, >Locus_555_Transcrip... at +/98.25%
4 15511nt, >Locus_555_Transcrip... at +/99.52%
5 9164nt, >Locus_555_Transcrip... at +/96.75%
6 14825nt, >Locus_555_Transcrip... at +/98.37%
7 7308nt, >Locus_555_Transcrip... at +/95.84%
8 15877nt, >Locus_555_Transcrip... at +/98.34%

So I cannot choose the representative of each cluster

Cheers,

Dave

Edit: OK, so the -d flag seems to allow specifying a longer defline

Last edited by dnusol; 07-25-2012 at 12:15 AM. Reason: found answer
dnusol is offline   Reply With Quote
Old 10-30-2012, 02:10 PM   #5
upendra_35
Senior Member
 
Location: USA

Join Date: Apr 2010
Posts: 102
Default

Quote:
Originally Posted by dnusol View Post
Dear kmcarr,

thanks for your help. I found blastclust and also tried CD-HIT as you suggested.

Do you know if it there are any guidelines as to how to select a representative from each cluster? Is it possible just to pick one at random since they are "similar" after all? Maybe the longest of all?

Also, is there anything that can be done with the clusters that only contain one sequence in? How can I handle them?

Cheers,

Dave
I don't know if you are still working on the clustering but what i have done with my denovo transcripts that were generated from three different assembly algorithms is to cluster them using blastclust and then select the representative from each cluster based on gene length (longest). For those clusters that only contain one sequence i have selected as it is.
upendra_35 is offline   Reply With Quote
Old 10-30-2012, 03:42 PM   #6
themerlin
Member
 
Location: Flagstaff, AZ

Join Date: Feb 2010
Posts: 51
Default

USEARCH might also be an option:

http://www.drive5.com/usearch/

After clustering at any level of ID, you can output either a consensus sequence or a centroid sequence for each cluster.
themerlin is offline   Reply With Quote
Old 11-15-2012, 11:36 AM   #7
upendra_35
Senior Member
 
Location: USA

Join Date: Apr 2010
Posts: 102
Default

Quote:
Originally Posted by themerlin View Post
USEARCH might also be an option:

http://www.drive5.com/usearch/

After clustering at any level of ID, you can output either a consensus sequence or a centroid sequence for each cluster.
Thanks for the info. Do you know what should be the optimum value of i.d in USEARCH to be able to cluster the denovo transcripts generated by different assembler.
upendra_35 is offline   Reply With Quote
Old 11-15-2012, 02:32 PM   #8
themerlin
Member
 
Location: Flagstaff, AZ

Join Date: Feb 2010
Posts: 51
Default

I think that this will require some testing. Start high and work down until you hit the sweet spot for your analysis.
themerlin is offline   Reply With Quote
Old 01-03-2013, 09:04 AM   #9
upendra_35
Senior Member
 
Location: USA

Join Date: Apr 2010
Posts: 102
Default

Quote:
Originally Posted by themerlin View Post
USEARCH might also be an option:

http://www.drive5.com/usearch/

After clustering at any level of ID, you can output either a consensus sequence or a centroid sequence for each cluster.
Could you tell me what option in blastclust would you use to output the consensus sequence? I searched all options but couldn't find one.

Thanks
Upendra
upendra_35 is offline   Reply With Quote
Old 09-29-2013, 10:16 PM   #10
sivasubramani
Member
 
Location: India

Join Date: Apr 2011
Posts: 14
Default

Hi all,

I got an output from cd-hit-est as follows.
>Cluster 1
0 1997nt, >Locus_3753_Transcript_3/6_Confidence_0.182_Length_1997_UP10_UP11... at 208:1784:3900:5486/+/92.67%
1 15188nt, >Locus_416_Transcript_101/105_Confidence_0.255_Length_15188_UP1... at 11777:1:4159:15952/-/85.81%
2 15605nt, >Locus_2273_Transcript_25/30_Confidence_0.598_Length_15605_UP7... at 3700:15605:4159:16064/+/100.00%
3 16064nt, >Locus_2273_Transcript_26/30_Confidence_0.576_Length_16064_UP7... *
4 15812nt, >Locus_2273_Transcript_30/30_Confidence_0.598_Length_15812_UP7... at 1844:15812:2097:16064/+/99.90%
5 1973nt, >Locus_1056_Transcript_4/7_Confidence_0.185_Length_1973_UP4... at 340:1760:4052:5486/+/93.33%
6 15398nt, >Locus_2370_Transcript_21/28_Confidence_0.628_Length_15398_UP2... at 2321:14533:2883:15100/+/99.27%

In the above what does the tailing information tell us.. For eg: at 2321:14533:2883:15100/+/99.27%.

What individual number means,

Thanks,
sivasubramani is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO