Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question on running SOCS program

    I tried to use SOCS to map Solid reads (50bp long each) to a set of reference sequences (with varying length, 50bp and longer each). Basically, I used default parameter settings: tolerance and mismatch sensitivity set to 2.

    In the output file "alignments.txt", I found one alignment as follows,

    one of the reads:
    TAATTGATCTAGATAGTGTTCGGCTGATCCATTCGGAAACAGGAAAACACG

    is aligned to the reference sequence:
    TAATTGATCTAGATAGTGTTCGGCTGATCCAAAGCCTTTGTCCTTTCACATG

    the first 31nts of the read and the aligned reference sequence are the same, but the rest part of the read is complement to the reference sequence. Seems only the first part of the read is used for alignment. Is this result reasonable? Any suggestions are highly appreciately!

  • #2
    Hi jinghanna,

    Did you get the bases for the read by directly translating from color space to base space? If you compare the color space sequences:

    T30301232232233211102303212320130230200112020001113
    T30301232232233211102303212320100230200112020021113

    There is most likely a sequencing error at color 31. SOLiD errors change every base to the right of them if you translate from left to right (in this case changing them to their complements). That's why SOLiD aligners do alignment in color space. This allows errors to be distinguished, since it's very unlikely that these color space sequences were the same (except for one color) just by chance. The chance gets higher for color space mismatches close to the end of the read, but in this case you can be pretty sure that the reference sequence is actually what the base space sequence of the read is.

    By default, SOCS will not give you a translation, since it assumes it's just the reference sequence (I did this to keep the output files small). If you tell it to look for short variants, alignments.txt will show translations of the reads with any variants detected.
    Last edited by ondovb; 05-14-2010, 08:03 AM. Reason: misspelled jinghanna...

    Comment


    • #3
      Thanks a lot, ondovb. Your reply completely resolved my puzzle.

      Earlier I did not realize that one error in the base space could lead to all wrong bases following that base. The alignment needs to be done in color space.

      One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

      socs -N 5

      Thanks a lot for your help!

      Comment


      • #4
        Originally posted by jinghanna View Post
        One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

        socs -N 5
        You also need to tell each node which one it is with -n, ie:

        socs -N 5 -n 1 ...
        socs -N 5 -n 2 ...
        socs -N 5 -n 3 ...
        ...

        Comment


        • #5
          Got it, thanks again!

          Comment


          • #6
            Hi there,

            I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

            Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

            I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

            Thanks!
            Last edited by Haneko; 06-29-2010, 07:02 PM. Reason: Added question

            Comment


            • #7
              run SOCS on computer clusters

              Below is what I did to run SOCS on computer cluster:

              First create a template script with the command "socs" and add "-n [datagram]" to the command. The template script should look something like this:
              input1 = [datagram1]
              input2 = [datagram2]
              socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d [datagram1] -N 3 -n [datagram2]

              Do not forget the parameter -p, which is necessary for batch or cluster runs.

              Then create the datagram file. In this case, it will be the numbers from 1 to N:
              ~~~
              output1 1
              output2 2
              output3 3
              ~~~

              Finally, you will need a general cluster submission script, which should contain all environment settings and your template script, to submit jobs to the computer cluster, something like

              submitjobs.sh --script template_script --datagrams datagram_file

              Hope this helps.

              Comment


              • #8
                For estimate on running time, please refer to this paper published by the original authors,

                Brian D. Ondov, Anjana Varadarajan, Karla D. Passalacqua, and Nicholas H. Bergman, "Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications," Bioinformatics 2008 December 1; 24(23): 2776–2777.

                Comment


                • #9
                  Haneko, we have an MPI version of novoalign that is able to map color space reads using as many nodes as you like. If you would like to give it a run then PM me. I have been running these sorts of tests on large reference genomes such as human and mouse.



                  Originally posted by Haneko View Post
                  Hi there,

                  I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

                  Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

                  I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

                  Thanks!

                  Comment


                  • #10
                    Hi jinghanna,

                    Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

                    script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
                    script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
                    script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

                    Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

                    Hi zee,

                    I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

                    Comment


                    • #11
                      Hi Haneko,

                      I believe you can do that. After all the jobs are done, you will need to run combineAlignments.pl to join the results from different output directories.

                      Comment


                      • #12
                        Hi jinghanna,

                        Thanks a lot for your help!!

                        Comment


                        • #13
                          FYI and just for clarification , novoalign does bisulfite alignment but currently not for SOLiD reads.
                          In fact I'm not aware of anybody who are doing bisulfite sequencing with SOLiD as yet.

                          Originally posted by Haneko View Post
                          Hi jinghanna,

                          Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

                          script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
                          script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
                          script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

                          Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

                          Hi zee,

                          I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

                          Comment


                          • #14
                            Hi zee,

                            Oh ok! But I'm dealing with SOLiD reads now, unfortunately.

                            Comment


                            • #15
                              jinghanna, thanks for answering Haneko's questions.

                              A couple other notes-

                              - The output directories can be the same for each node, since they will each include their node # in their output file names. If your nodes have a shared file system, this can save you some copying.

                              - Running times for bisulfite are a lot longer than for the standard algorithm. For reference, we aligned ~55M bisulfite reads to Arabidopsis in about 30 hours using 16 threads (with sensitivity=3).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              58 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              45 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X