Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question on running SOCS program

    I tried to use SOCS to map Solid reads (50bp long each) to a set of reference sequences (with varying length, 50bp and longer each). Basically, I used default parameter settings: tolerance and mismatch sensitivity set to 2.

    In the output file "alignments.txt", I found one alignment as follows,

    one of the reads:
    TAATTGATCTAGATAGTGTTCGGCTGATCCATTCGGAAACAGGAAAACACG

    is aligned to the reference sequence:
    TAATTGATCTAGATAGTGTTCGGCTGATCCAAAGCCTTTGTCCTTTCACATG

    the first 31nts of the read and the aligned reference sequence are the same, but the rest part of the read is complement to the reference sequence. Seems only the first part of the read is used for alignment. Is this result reasonable? Any suggestions are highly appreciately!

  • #2
    Hi jinghanna,

    Did you get the bases for the read by directly translating from color space to base space? If you compare the color space sequences:

    T30301232232233211102303212320130230200112020001113
    T30301232232233211102303212320100230200112020021113

    There is most likely a sequencing error at color 31. SOLiD errors change every base to the right of them if you translate from left to right (in this case changing them to their complements). That's why SOLiD aligners do alignment in color space. This allows errors to be distinguished, since it's very unlikely that these color space sequences were the same (except for one color) just by chance. The chance gets higher for color space mismatches close to the end of the read, but in this case you can be pretty sure that the reference sequence is actually what the base space sequence of the read is.

    By default, SOCS will not give you a translation, since it assumes it's just the reference sequence (I did this to keep the output files small). If you tell it to look for short variants, alignments.txt will show translations of the reads with any variants detected.
    Last edited by ondovb; 05-14-2010, 08:03 AM. Reason: misspelled jinghanna...

    Comment


    • #3
      Thanks a lot, ondovb. Your reply completely resolved my puzzle.

      Earlier I did not realize that one error in the base space could lead to all wrong bases following that base. The alignment needs to be done in color space.

      One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

      socs -N 5

      Thanks a lot for your help!

      Comment


      • #4
        Originally posted by jinghanna View Post
        One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

        socs -N 5
        You also need to tell each node which one it is with -n, ie:

        socs -N 5 -n 1 ...
        socs -N 5 -n 2 ...
        socs -N 5 -n 3 ...
        ...

        Comment


        • #5
          Got it, thanks again!

          Comment


          • #6
            Hi there,

            I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

            Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

            I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

            Thanks!
            Last edited by Haneko; 06-29-2010, 07:02 PM. Reason: Added question

            Comment


            • #7
              run SOCS on computer clusters

              Below is what I did to run SOCS on computer cluster:

              First create a template script with the command "socs" and add "-n [datagram]" to the command. The template script should look something like this:
              input1 = [datagram1]
              input2 = [datagram2]
              socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d [datagram1] -N 3 -n [datagram2]

              Do not forget the parameter -p, which is necessary for batch or cluster runs.

              Then create the datagram file. In this case, it will be the numbers from 1 to N:
              ~~~
              output1 1
              output2 2
              output3 3
              ~~~

              Finally, you will need a general cluster submission script, which should contain all environment settings and your template script, to submit jobs to the computer cluster, something like

              submitjobs.sh --script template_script --datagrams datagram_file

              Hope this helps.

              Comment


              • #8
                For estimate on running time, please refer to this paper published by the original authors,

                Brian D. Ondov, Anjana Varadarajan, Karla D. Passalacqua, and Nicholas H. Bergman, "Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications," Bioinformatics 2008 December 1; 24(23): 2776–2777.

                Comment


                • #9
                  Haneko, we have an MPI version of novoalign that is able to map color space reads using as many nodes as you like. If you would like to give it a run then PM me. I have been running these sorts of tests on large reference genomes such as human and mouse.



                  Originally posted by Haneko View Post
                  Hi there,

                  I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

                  Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

                  I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

                  Thanks!

                  Comment


                  • #10
                    Hi jinghanna,

                    Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

                    script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
                    script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
                    script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

                    Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

                    Hi zee,

                    I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

                    Comment


                    • #11
                      Hi Haneko,

                      I believe you can do that. After all the jobs are done, you will need to run combineAlignments.pl to join the results from different output directories.

                      Comment


                      • #12
                        Hi jinghanna,

                        Thanks a lot for your help!!

                        Comment


                        • #13
                          FYI and just for clarification , novoalign does bisulfite alignment but currently not for SOLiD reads.
                          In fact I'm not aware of anybody who are doing bisulfite sequencing with SOLiD as yet.

                          Originally posted by Haneko View Post
                          Hi jinghanna,

                          Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

                          script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
                          script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
                          script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

                          Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

                          Hi zee,

                          I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

                          Comment


                          • #14
                            Hi zee,

                            Oh ok! But I'm dealing with SOLiD reads now, unfortunately.

                            Comment


                            • #15
                              jinghanna, thanks for answering Haneko's questions.

                              A couple other notes-

                              - The output directories can be the same for each node, since they will each include their node # in their output file names. If your nodes have a shared file system, this can save you some copying.

                              - Running times for bisulfite are a lot longer than for the standard algorithm. For reference, we aligned ~55M bisulfite reads to Arabidopsis in about 30 hours using 16 threads (with sensitivity=3).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X