Hey all,
I have the following problem. I have a plasmid sequence database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/) that is heavily redundant. I have been trying to remove redundancy and to obtain a set of representative sequences using cd-hit-est (http://weizhong-lab.ucsd.edu/cd-hit/...hit_user_guide) as follows:
The results of this are one file containing the clusters, and another containing the representative sequences.
Now to my problem: Removing the redundancy from the database does not seem to work. Two sequences that are 100% identical over 100% of the sequence length (they have the same length) end up in different clusters instead of the same one. I have checked the similarity of the sequences aligning them through BLAST, and as stated above, the sequences are identical.
The output clustering file looks like this:
The sequences that are 6222 bp long are at least 99% identical, so they should end up in the same cluster.
Does anyone know what the problem here might be? Am I missing something?
Thanks in advance!
I have the following problem. I have a plasmid sequence database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/) that is heavily redundant. I have been trying to remove redundancy and to obtain a set of representative sequences using cd-hit-est (http://weizhong-lab.ucsd.edu/cd-hit/...hit_user_guide) as follows:
Code:
cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9 -g 1
Now to my problem: Removing the redundancy from the database does not seem to work. Two sequences that are 100% identical over 100% of the sequence length (they have the same length) end up in different clusters instead of the same one. I have checked the similarity of the sequences aligning them through BLAST, and as stated above, the sequences are identical.
The output clustering file looks like this:
Code:
>Cluster 39 0 6222nt, >gi|410475454|ref|NC... * >Cluster 40 0 6211nt, >gi|387504713|ref|NC... at +/98.10% 1 6222nt, >gi|41056918|ref|NC_... * 2 6222nt, >gi|118480566|ref|NC... at +/98.09% >Cluster 41 0 6222nt, >gi|844749291|ref|NZ... *
Does anyone know what the problem here might be? Am I missing something?
Thanks in advance!