Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering SOLiD reads

    I've got 120 million 50bp SOLiD reads from a Eukaryote, and I'd like to remove anything plastid related. I've got the assembled genome of the plastid, but I need to do the matching in color space, correct? Normally I'd just do this with blast.. is there a tool in Corona that will do this?

    Thanks!

  • #2
    1. Align you reads againt plastid (I personally like bwa and bfast).
    2. Once you have the alignments is trivial to separate reads that come
    from one or the other organisms.

    If you want to go the ABi way use Bioscope instead of corona.
    -drd

    Comment


    • #3
      Originally posted by drio View Post
      2. Once you have the alignments is trivial to separate reads that come
      from one or the other organisms.
      Hmmm I beg to differ that its trivial to separate the reads.
      Getting the ids of the reads that map to two different is simple.

      but working with the large number of reads isn't.
      you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Originally posted by KevinLam View Post
        Hmmm I beg to differ that its trivial to separate the reads.
        Getting the ids of the reads that map to two different is simple.

        but working with the large number of reads isn't.
        you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads
        Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.
        -drd

        Comment


        • #5
          Originally posted by drio View Post
          Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.
          I would love to look at your code if you got it working the way you mentioned.
          for me?

          I needed to extract 40 mil ids from a 70 mil csfasta.
          looping thru the csfasta is simple.
          but I found that I had memory issues if I stored 40 mil ids in a normal hash.
          So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

          next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

          So if you got it working like the way you said I would really love to c how I got it wrong.
          http://kevin-gattaca.blogspot.com/

          Comment


          • #6
            Originally posted by KevinLam View Post
            I would love to look at your code if you got it working the way you mentioned.
            for me?

            I needed to extract 40 mil ids from a 70 mil csfasta.
            looping thru the csfasta is simple.
            but I found that I had memory issues if I stored 40 mil ids in a normal hash.
            So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

            next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

            So if you got it working like the way you said I would really love to c how I got it wrong.
            If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.

            Comment


            • #7
              Originally posted by nilshomer View Post
              If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.
              I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
              I am actually not sure if they are sorted already (coming out of the machine)
              http://kevin-gattaca.blogspot.com/

              Comment


              • #8
                Originally posted by KevinLam View Post
                I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
                I am actually not sure if they are sorted already (coming out of the machine)
                They are sorted coming off the machine, so no need to resort.

                Comment


                • #9
                  I agree with drio. It's a old classical computer science problem. Google "Intersection of sorted lists". If your lists aren't sorted then use GNU sort beforehand. You only need to write a shell script, no requirement for huge hashes in RAM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  50 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X