Seqanswers Leaderboard Ad

**robs** · 07-02-2010, 08:05 PM

Do you need to count the barcodes or the unique sequences between the barcodes?
You could count the number of unique (full) sequences first (probably 2-3 mins) to reduce the number of sequences to process and then use those sequences to check for the barcodes using regex or some algo for approximate string matching.
Do you know the original barcodes? Do your mismatches include indels?

**shainaporter** · 07-06-2010, 09:23 AM

Oops, sorry, we refer to the unknown 20bp sequence as a "barcode", but I realize that the term means something else to the rest of the world.
We do not know the original 20bp sequence, as they were created from randomized oligos. The mismatches will not include indels.
A line of our data looks like this:
GGCGCGCCNNNNNNNNNNNNNNNNNNNNGGCCAT
With the ends being our unknown sequences, flanked by "known" sequence.
Basically we are wanting to compare bases 9-29 of each line of data, and be able to count how many times each is found among the ~30 million lines of data.
I hope that is clearer, thanks so much for your help!

**robs** · 07-06-2010, 09:51 AM

One more thing to think about. Since you want to group the sequences with 2 allowed mismatches, you run into the problem of clustering. You basically have to calculate the distance between all the sequences and then group them. There are different approaches on how to cluster or classify and each of them might give you a different number.

I would suggest the following:
1) extract the "unknown" sequence
2) remove duplicates, but keep the counts
3) calculate distance between all sequences (I would suggest hamming distance, since no indels)
4) use cluster or classification method to get number of "groups" with max 2 mismatches

This is kind of similar of finding OTUs for e.g. 16S sequences. There are a bunch of programs already designed to do the work for you. You could do step (1) and (2) and then input the data into one of those programs.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

comparing seq data to itself for frequency

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News