View Single Post
Old 12-07-2016, 02:41 AM   #9
Markiyan
Senior Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 115
Lightbulb

Quote:
Originally Posted by Brian Bushnell View Post
There's no support for that planned, but nothing technically preventing it. However, Clumpify is not a universal compression utility - it will only increase compression when there is coverage depth (meaning, redundant information). So, for a big 10GB file of amino acid sequences - if they were all different proteins, there would not be redundant information, and they would not compress; on the other hand, if there were many copies of the same proteins from different but very closely-related organisms, or different isoforms of the same proteins scattered around randomly in the file, then Clumpify would group them together, which would increase compression.
OK, so in order to cluster aminoacid sequences with current clumpify version it means:
1. parse fasta, reverse translate to DNA. Using a single codon for each aminoacid;
2. save as nt fastq;
3. clumpify;
4. parse fastq, translate;
5. save as aa fasta.
Markiyan is offline   Reply With Quote