I have an interesting problem that I'm not sure how to approach. I have a library of randomized short inserts (21 nt) that has been sequenced using the SOLiD platform, with 25 nt reads. The insert will be at the very start of the reads. I want to count the distinct insert sequences. The straightforward way appears to be to convert the reads to fastq, filter based on quality, and count in base space. I'm worried about errors, as I have no way of checking for them that I can see, other than the last 4 nt (22-25) which should be identical in all reads. Any suggestions or interesting approaches to accomplish this?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
There are two levels to your question.
The first level is, how do I summarize and count sequences that may vary by zero, one or a handful of bases? In other words how does a person group sequences together that have a small edit distance from each other? While I am unaware of a program that does this, multiple sequence alignment programs or sequence comparison programs could potentially help out here. Or I suspect that a custom program could be written relatively quickly (I have done something similar and I don't recall it being too hard with the proper modules or helper program handy.)
The need to find a edit-distance-aware counting program would exist no matter what sequencing platform you are using. Unless you are willing to settle for edit distances of zero -- i.e., exact matches only, in which case 'cut', 'sort', 'uniq' and 'wc' are your friends.
The second level is, how can I do the above edit-distance-aware-counting with SOLiD data? Here you are working with gold because any color-space sequences with an edit distance of 0 or 1 are almost certainly the same. An edit distance (or mismatch) of 1 in color-space means a machine error. That is it. An edit distance of 1 in any other platform (454, Illumina, 3730) could mean either a machine error or a SNP -- no one can tell without further inquiry (which might include looking at quality values.) However .... and this is the big however ... you must do all of your work within color-space (or its bastard cousin 'double-encoded' space if required by the counting program) because as soon as you convert from color-space to base-space then you not only lose the advantage of edit distance but you also potentially screw up the base calls.
The cardinal rule of thumb when working in color-space is to not convert to base-space until the very last possible step.
Hope this helps a bit. Sorry I do not have a specific program to recommend.
-
-
Originally posted by kumar View PostThis is getting slightly OT, but if expect a constant region of sequence at the end of my reads (ideally the same 4 nt) then can I expect the last 3 colorspace calls to be identical across reads? Assuming everything went perfectly as planned.
>one
AAAAGTCA
>two
ACCCGTCA
>three
ACGTGTCA
>four
GGGGGTCA
>five
GTTGGTCA
>six
CCGGGTCA
>seven
TATAGTCA
>one
A0002121
>two
A1003121
>three
A1311121
>four
G0000121
>five
G1010121
>six
C0300121
>seven
T3332121
So you can see that '121' is always going to be there no matter what your start bases are.
However this does not mean that '121' is always going to stand for 'GTCA'. Inverse conversion shows from
>one-rev
A3212121
>two-rev
A0000121
>three-rev
A1111121
>four-rev
A2123121
to
>one-rev
ATCAGTCA
>two-rev
AAAAACTG
>three-rev
ACACACTG
>four-rev
AGTCGTCA
Different ending bases. This is yet another example of why to do all of your work in color-space before, at the very end, converting into base-space.
Comment
-
@westerman Thanks again. I'm learning how colorspace can be your friend (hopefully I don't have to eat those words). Any suggestions on places to look for sequence comparison algorithms using colorspace? Are there any libraries (python preferred, but perl and C acceptable) for working with sequences in colorspace?
Comment
-
There are a number of mapping programs that work with color-space. If required, you can always convert your 0123 CS into the dreaded (but sometimes useful) ACGT "double-encoded-color-space" and use a base-space-aware package to work in that pseudo-color-space.
As far as your project, no one has chimed in yet with a "yes, here is a good edit-distance aware comparative" program (which is what you need) so it may be time to write your own. I don't think that it would be difficult. The time I did something similar I used Perl's Bio::Grep and the agrep and vmatch options within it. That was for a base-space project so the tool should work with double-encoded-color-space.
If you are unaware of double-encoded space basically each 0 in CS is replaced with an 'A', each 1 with a 'C' and so on. Telling the difference between a double-encoded file and a true base-space file is left up to the imagination. :-(
Comment
-
Originally posted by westerman View PostAs far as your project, no one has chimed in yet with a "yes, here is a good edit-distance aware comparative" program (which is what you need) so it may be time to write your own. I don't think that it would be difficult. The time I did something similar I used Perl's Bio::Grep and the agrep and vmatch options within it. That was for a base-space project so the tool should work with double-encoded-color-space.
Comment
Latest Articles
Collapse
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-17-2024, 10:28 AM
|
0 responses
33 views
0 likes
|
Last Post
by seqadmin
12-17-2024, 10:28 AM
|
||
Started by seqadmin, 12-13-2024, 08:24 AM
|
0 responses
49 views
0 likes
|
Last Post
by seqadmin
12-13-2024, 08:24 AM
|
||
Started by seqadmin, 12-12-2024, 07:41 AM
|
0 responses
34 views
0 likes
|
Last Post
by seqadmin
12-12-2024, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
46 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
Comment