Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jinghanna
    Member
    • May 2010
    • 10

    #16
    Ondov, thanks for the notes!

    Comment

    • Haneko
      Member
      • Jan 2010
      • 36

      #17
      Hi ondovb,

      Many thanks for the notes! That was actually my concern.. My team had a discussion over the running time as we were not looking at Arabidopsis samples.

      Which brings me to something else I just thought of: What is the difference between running on multiple threads and multiple nodes? I currently put threads=1 while total nodes=10..

      On a side note, I realised that my processes that have been split into 10 nodes are all going into sleep mode.. Could this be because I didn't allocate enough RAM?

      Comment

      • ondovb
        Member
        • Jan 2010
        • 20

        #18
        Threads: should be the number of cores you want to use on each node. You mentioned you have 8 cores per node, so you'll want 8 threads to use them all.

        Running time: will be linear with respect to genome length. Our data took 480 cpu hours, so yours (assuming a similar # of reads) should take 480 * 30 = 14400 cpu hours. If you use all 40 * 8 cores on your cluster, you're looking at about 45 hours.

        Sleeping: if you remembered to include the -p flag, I'm not sure what else could cause this. Have you tried running it locally with the same settings and watching the output?

        Comment

        • Haneko
          Member
          • Jan 2010
          • 36

          #19
          Hi ondovb,

          Yes, Im running it locally but it seems to be stuck at the aligning stage:

          Round 1 / 4 (2101986 reads):

          Sensitivity 4:

          EDIT: I have used strace on the process and found it to be at the following state:

          futex(0x40dd79d0, FUTEX_WAIT, 25312, NULL
          Last edited by Haneko; 07-01-2010, 07:06 PM.

          Comment

          • ondovb
            Member
            • Jan 2010
            • 20

            #20
            I think each instance might appear to be sleeping to the OS because the parent thread just sits and waits for the child threads to finish their computation (even if only one thread is chosen). What does the CPU usage look like?

            Sensitivity 4 will take a pretty long time (even on your cluster), which could make it appear to be stuck. I wouldn't recommend going higher than 3. If you set the trim to at least 3, that should get rid of a lot of the errors and you should still be able to align a lot of reads.

            Comment

            • volks
              Member
              • Jun 2010
              • 80

              #21
              Originally posted by ondovb View Post
              Running time: will be linear with respect to genome length. Our data took 480 cpu hours, so yours (assuming a similar # of reads) should take 480 * 30 = 14400 cpu hours.
              can you estimate how the running time behaves in respect to # of reads?

              Comment

              • ondovb
                Member
                • Jan 2010
                • 20

                #22
                Running time is also approximately linear with respect to # of reads, and exponential with respect to sensitivity.

                Comment

                • sci_guy
                  Member
                  • Jan 2008
                  • 83

                  #23
                  Originally posted by zee View Post
                  In fact I'm not aware of anybody who are doing bisulfite sequencing with SOLiD as yet.
                  I also also another unfortunate soul dealing with SOLiD bisulfite reads. I also know that Thomas Preiss' group in Sydney is working on RNA methylation using SOLiD.

                  Comment

                  • fwessely
                    Junior Member
                    • Oct 2011
                    • 3

                    #24
                    I have aligned a subset of the reads on my machine and have some questions.

                    I received several warnings (e.g. '5719579 substrings of chr1.fa ignored due to 5718003 character(s) other than [ACGTacgt]'). The Ns in the reference file(s) cause this problem and I don't know the impact of the warnings on the overall analysis.

                    At the end of the aligning part is says 'computing error frequencies'. What does this mean?

                    Does SOCS-B run faster, if all reference files would be merged into one multiFASTA reference file?

                    I struggle to understand the difference between the mismatch sensitivity (s) and the tolerance (t). Could you briefly explain these two parameters? Can I set them independently?

                    Comment

                    • ondovb
                      Member
                      • Jan 2010
                      • 20

                      #25
                      Originally posted by fwessely View Post
                      I received several warnings (e.g. '5719579 substrings of chr1.fa ignored due to 5718003 character(s) other than [ACGTacgt]'). The Ns in the reference file(s) cause this problem and I don't know the impact of the warnings on the overall analysis.
                      These are just to keep you informed. If you were expecting that many Ns, you can ignore them. The only way they will affect your results is that you can expect coverage dips within a read's length of any Ns in the reference, since SOCS will not map to any substrings that contain an N.

                      Originally posted by fwessely View Post
                      At the end of the aligning part is says 'computing error frequencies'. What does this mean?
                      It is outputting the observed frequency of color-space errors for each position in the read length. The output should be in the stats folder.

                      Originally posted by fwessely View Post
                      Does SOCS-B run faster, if all reference files would be merged into one multiFASTA reference file?
                      The speed shouldn’t be affected by separate files. My only suggestion for efficiency in large genomes is to limit the number of ambiguous matches to keep (assuming you don't need all of them). Each read could map to thousands of places in the whole genome, which affects RAM estimation and can cause multiple "rounds" of alignment when you give it bigger chunks of reads.

                      Originally posted by fwessely View Post
                      I struggle to understand the difference between the mismatch sensitivity (s) and the tolerance (t). Could you briefly explain these two parameters? Can I set them independently?
                      I admit sensitivity and tolerance are confusing...here's an example: if the sensitivity (-s) is 3, you are guaranteed to find the best alignment in the genome with 3 or fewer color space mismatches (ignoring bisulfite changes). However, a lot of alignments will also be found by chance that have 4 or more mismatches. If a read only has alignments with 4 or more, you may or may not want to report the best one that was found, since it is not guaranteed to be the best in the whole genome. The threshold for reporting these is set by the tolerance (-t), and this should always be at least as high as sensitivity.

                      Comment

                      Latest Articles

                      Collapse

                      • GATTACAT
                        Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by GATTACAT
                        Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                        Today, 11:43 AM
                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM
                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 05:37 AM
                      0 responses
                      7 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-26-2026, 11:10 AM
                      0 responses
                      17 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      52 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-09-2026, 11:58 AM
                      0 responses
                      110 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...