SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Weird file compression harrike Bioinformatics 0 04-01-2011 06:29 PM
Fastq compression - proof of concept jkbonfield Bioinformatics 6 08-10-2010 03:12 PM
RNA-Seq: Next-generation sequencing techniques for eukaryotic microorganisms: sequenc Newsbot! Literature Watch 0 07-06-2010 02:00 AM
wig file compression anna_vt Bioinformatics 2 02-19-2010 11:28 AM
Fastq data compression with pigz darked89 Bioinformatics 4 02-02-2010 05:53 PM

Reply
 
Thread Tools
Old 01-09-2009, 05:49 AM   #1
foolishbrat
Member
 
Location: South East Asia

Join Date: Nov 2008
Posts: 44
Default Tag (string) Compression Techniques

Does anybody know any string compression technique (in R or Perl)?

In particular we usually would like to compress tag of length 35 above,
to store them in data structure for further processing. Note also
that we are talking about ~10million tags to process.

I have such implementation in R to convert tag to numerical value.
But it get overflow error when handling tag of length > 30

Code:
tagsequence2tagnum  <- function (tags, length) 
{
    new.tags <- tolower(unlist(strsplit(as.character(tags), "")))
    new.tags[!(new.tags == "a" | new.tags == "g" | new.tags == 
        "c" | new.tags == "t" | new.tags == "s" | new.tags == 
        "y" | new.tags == "b" | new.tags == "k")] <- "n"
    new.tags <- matrix(as.numeric(chartr("acgtnsybk", "012301112", 
        new.tags)), nrow = length)
    colSums(new.tags * 4^((length - 1):0)) + 1
}

tagnum2tagsequence <- function (tags, length) 
{
    new.tags <- t(matrix((rep(tags - 1, each = length)%/%4^((length -  1):0))%%4, nrow = length))
    new.tags <- apply(new.tags, 1, paste, collapse = "")
    chartr("0123", "acgt", new.tags)
}
foolishbrat is offline   Reply With Quote
Old 01-09-2009, 10:31 AM   #2
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default Check out nuID

We have used nuID (https://prod.bioinformatics.northwestern.edu/nuID/) in the past for something similar. nuID provides a perl utility and is also part of the 'lumi' R-package available through bioconductor. It is a nice utility because it provides mechanisms for compression, and uncompression with error checking. On a side note, if anyone knows of something similar written in python I would love to hear about it!
lgoff is offline   Reply With Quote
Old 01-09-2009, 08:32 PM   #3
foolishbrat
Member
 
Location: South East Asia

Join Date: Nov 2008
Posts: 44
Default

Thank you so much, Igoff. This is invaluable.
foolishbrat is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:55 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO