Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting distinct sequences in csfasta

    I have an interesting problem that I'm not sure how to approach. I have a library of randomized short inserts (21 nt) that has been sequenced using the SOLiD platform, with 25 nt reads. The insert will be at the very start of the reads. I want to count the distinct insert sequences. The straightforward way appears to be to convert the reads to fastq, filter based on quality, and count in base space. I'm worried about errors, as I have no way of checking for them that I can see, other than the last 4 nt (22-25) which should be identical in all reads. Any suggestions or interesting approaches to accomplish this?

  • #2
    There are two levels to your question.

    The first level is, how do I summarize and count sequences that may vary by zero, one or a handful of bases? In other words how does a person group sequences together that have a small edit distance from each other? While I am unaware of a program that does this, multiple sequence alignment programs or sequence comparison programs could potentially help out here. Or I suspect that a custom program could be written relatively quickly (I have done something similar and I don't recall it being too hard with the proper modules or helper program handy.)

    The need to find a edit-distance-aware counting program would exist no matter what sequencing platform you are using. Unless you are willing to settle for edit distances of zero -- i.e., exact matches only, in which case 'cut', 'sort', 'uniq' and 'wc' are your friends.

    The second level is, how can I do the above edit-distance-aware-counting with SOLiD data? Here you are working with gold because any color-space sequences with an edit distance of 0 or 1 are almost certainly the same. An edit distance (or mismatch) of 1 in color-space means a machine error. That is it. An edit distance of 1 in any other platform (454, Illumina, 3730) could mean either a machine error or a SNP -- no one can tell without further inquiry (which might include looking at quality values.) However .... and this is the big however ... you must do all of your work within color-space (or its bastard cousin 'double-encoded' space if required by the counting program) because as soon as you convert from color-space to base-space then you not only lose the advantage of edit distance but you also potentially screw up the base calls.

    The cardinal rule of thumb when working in color-space is to not convert to base-space until the very last possible step.

    Hope this helps a bit. Sorry I do not have a specific program to recommend.

    Comment


    • #3
      @westerman That helps a lot. I hadn't considered the benefits of colorspace, only the difficulties. So I can do all my counting in colorspace and decode only at the end when I need to recover the base sequence. Very cool. Thanks!
      Last edited by kumar; 07-20-2011, 04:46 PM.

      Comment


      • #4
        @eacker Could you post your message publicly or set your preferences to allow reception of private messages? Thanks.

        Comment


        • #5
          This is getting slightly OT, but if expect a constant region of sequence at the end of my reads (ideally the same 4 nt) then can I expect the last 3 colorspace calls to be identical across reads? Assuming everything went perfectly as planned.

          Comment


          • #6
            Originally posted by kumar View Post
            This is getting slightly OT, but if expect a constant region of sequence at the end of my reads (ideally the same 4 nt) then can I expect the last 3 colorspace calls to be identical across reads? Assuming everything went perfectly as planned.
            Yes. Example conversion of 7 sequences all with the same 4 ending bases.


            >one
            AAAAGTCA
            >two
            ACCCGTCA
            >three
            ACGTGTCA
            >four
            GGGGGTCA
            >five
            GTTGGTCA
            >six
            CCGGGTCA
            >seven
            TATAGTCA

            >one
            A0002121
            >two
            A1003121
            >three
            A1311121
            >four
            G0000121
            >five
            G1010121
            >six
            C0300121
            >seven
            T3332121

            So you can see that '121' is always going to be there no matter what your start bases are.

            However this does not mean that '121' is always going to stand for 'GTCA'. Inverse conversion shows from

            >one-rev
            A3212121
            >two-rev
            A0000121
            >three-rev
            A1111121
            >four-rev
            A2123121

            to

            >one-rev
            ATCAGTCA
            >two-rev
            AAAAACTG
            >three-rev
            ACACACTG
            >four-rev
            AGTCGTCA

            Different ending bases. This is yet another example of why to do all of your work in color-space before, at the very end, converting into base-space.

            Comment


            • #7
              @westerman Thanks again. I'm learning how colorspace can be your friend (hopefully I don't have to eat those words). Any suggestions on places to look for sequence comparison algorithms using colorspace? Are there any libraries (python preferred, but perl and C acceptable) for working with sequences in colorspace?

              Comment


              • #8
                There are a number of mapping programs that work with color-space. If required, you can always convert your 0123 CS into the dreaded (but sometimes useful) ACGT "double-encoded-color-space" and use a base-space-aware package to work in that pseudo-color-space.

                As far as your project, no one has chimed in yet with a "yes, here is a good edit-distance aware comparative" program (which is what you need) so it may be time to write your own. I don't think that it would be difficult. The time I did something similar I used Perl's Bio::Grep and the agrep and vmatch options within it. That was for a base-space project so the tool should work with double-encoded-color-space.

                If you are unaware of double-encoded space basically each 0 in CS is replaced with an 'A', each 1 with a 'C' and so on. Telling the difference between a double-encoded file and a true base-space file is left up to the imagination. :-(

                Comment


                • #9
                  Originally posted by westerman View Post
                  As far as your project, no one has chimed in yet with a "yes, here is a good edit-distance aware comparative" program (which is what you need) so it may be time to write your own. I don't think that it would be difficult. The time I did something similar I used Perl's Bio::Grep and the agrep and vmatch options within it. That was for a base-space project so the tool should work with double-encoded-color-space.
                  I was planning on writing something, which is why I asked about libraries/modules. Always better to check before rolling your own. If I come up with something useful, I'll post back. Thanks!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Advancing Precision Medicine for Rare Diseases in Children
                    by seqadmin




                    Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                    12-16-2024, 07:57 AM
                  • seqadmin
                    Recent Advances in Sequencing Technologies
                    by seqadmin



                    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                    Long-Read Sequencing
                    Long-read sequencing has seen remarkable advancements,...
                    12-02-2024, 01:49 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 12-17-2024, 10:28 AM
                  0 responses
                  33 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-13-2024, 08:24 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-12-2024, 07:41 AM
                  0 responses
                  34 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-11-2024, 07:45 AM
                  0 responses
                  46 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X