SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
tophat Error running running 'prep_reads' victoryhe Bioinformatics 2 10-17-2011 05:53 AM
SOLID and SOCS ssharma SOLiD 1 09-19-2011 04:41 AM
Bambus running time question user1313 Bioinformatics 0 07-04-2011 02:40 AM
SOCS output SOLiD_User Bioinformatics 3 09-01-2009 06:17 AM
SOCS: Efficient mapping of Applied Biosystems SOLiD sequence data to a ref genome... ECO Literature Watch 0 10-20-2008 08:53 PM

Reply
 
Thread Tools
Old 05-13-2010, 06:35 PM   #1
jinghanna
Member
 
Location: San Jose

Join Date: May 2010
Posts: 10
Default question on running SOCS program

I tried to use SOCS to map Solid reads (50bp long each) to a set of reference sequences (with varying length, 50bp and longer each). Basically, I used default parameter settings: tolerance and mismatch sensitivity set to 2.

In the output file "alignments.txt", I found one alignment as follows,

one of the reads:
TAATTGATCTAGATAGTGTTCGGCTGATCCATTCGGAAACAGGAAAACACG

is aligned to the reference sequence:
TAATTGATCTAGATAGTGTTCGGCTGATCCAAAGCCTTTGTCCTTTCACATG

the first 31nts of the read and the aligned reference sequence are the same, but the rest part of the read is complement to the reference sequence. Seems only the first part of the read is used for alignment. Is this result reasonable? Any suggestions are highly appreciately!
jinghanna is offline   Reply With Quote
Old 05-14-2010, 09:02 AM   #2
ondovb
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 20
Default

Hi jinghanna,

Did you get the bases for the read by directly translating from color space to base space? If you compare the color space sequences:

T30301232232233211102303212320130230200112020001113
T30301232232233211102303212320100230200112020021113

There is most likely a sequencing error at color 31. SOLiD errors change every base to the right of them if you translate from left to right (in this case changing them to their complements). That's why SOLiD aligners do alignment in color space. This allows errors to be distinguished, since it's very unlikely that these color space sequences were the same (except for one color) just by chance. The chance gets higher for color space mismatches close to the end of the read, but in this case you can be pretty sure that the reference sequence is actually what the base space sequence of the read is.

By default, SOCS will not give you a translation, since it assumes it's just the reference sequence (I did this to keep the output files small). If you tell it to look for short variants, alignments.txt will show translations of the reads with any variants detected.

Last edited by ondovb; 05-14-2010 at 09:03 AM. Reason: misspelled jinghanna...
ondovb is offline   Reply With Quote
Old 05-14-2010, 12:04 PM   #3
jinghanna
Member
 
Location: San Jose

Join Date: May 2010
Posts: 10
Default

Thanks a lot, ondovb. Your reply completely resolved my puzzle.

Earlier I did not realize that one error in the base space could lead to all wrong bases following that base. The alignment needs to be done in color space.

One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

socs -N 5

Thanks a lot for your help!
jinghanna is offline   Reply With Quote
Old 05-14-2010, 01:53 PM   #4
ondovb
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 20
Default

Quote:
Originally Posted by jinghanna View Post
One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

socs -N 5
You also need to tell each node which one it is with -n, ie:

socs -N 5 -n 1 ...
socs -N 5 -n 2 ...
socs -N 5 -n 3 ...
...
ondovb is offline   Reply With Quote
Old 05-14-2010, 01:53 PM   #5
jinghanna
Member
 
Location: San Jose

Join Date: May 2010
Posts: 10
Default

Got it, thanks again!
jinghanna is offline   Reply With Quote
Old 06-29-2010, 07:20 PM   #6
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi there,

I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

Thanks!

Last edited by Haneko; 06-29-2010 at 08:02 PM. Reason: Added question
Haneko is offline   Reply With Quote
Old 06-29-2010, 10:28 PM   #7
jinghanna
Member
 
Location: San Jose

Join Date: May 2010
Posts: 10
Default run SOCS on computer clusters

Below is what I did to run SOCS on computer cluster:

First create a template script with the command "socs" and add "-n [datagram]" to the command. The template script should look something like this:
input1 = [datagram1]
input2 = [datagram2]
socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d [datagram1] -N 3 -n [datagram2]

Do not forget the parameter -p, which is necessary for batch or cluster runs.

Then create the datagram file. In this case, it will be the numbers from 1 to N:
~~~
output1 1
output2 2
output3 3
~~~

Finally, you will need a general cluster submission script, which should contain all environment settings and your template script, to submit jobs to the computer cluster, something like

submitjobs.sh --script template_script --datagrams datagram_file

Hope this helps.
jinghanna is offline   Reply With Quote
Old 06-29-2010, 10:35 PM   #8
jinghanna
Member
 
Location: San Jose

Join Date: May 2010
Posts: 10
Default

For estimate on running time, please refer to this paper published by the original authors,

Brian D. Ondov, Anjana Varadarajan, Karla D. Passalacqua, and Nicholas H. Bergman, "Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications," Bioinformatics 2008 December 1; 24(23): 27762777.

http://www.ncbi.nlm.nih.gov/pmc/arti...pdf/btn512.pdf
jinghanna is offline   Reply With Quote
Old 06-29-2010, 10:45 PM   #9
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Haneko, we have an MPI version of novoalign that is able to map color space reads using as many nodes as you like. If you would like to give it a run then PM me. I have been running these sorts of tests on large reference genomes such as human and mouse.



Quote:
Originally Posted by Haneko View Post
Hi there,

I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

Thanks!
zee is offline   Reply With Quote
Old 06-29-2010, 11:09 PM   #10
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi jinghanna,

Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

Hi zee,

I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!
Haneko is offline   Reply With Quote
Old 06-29-2010, 11:13 PM   #11
jinghanna
Member
 
Location: San Jose

Join Date: May 2010
Posts: 10
Default

Hi Haneko,

I believe you can do that. After all the jobs are done, you will need to run combineAlignments.pl to join the results from different output directories.
jinghanna is offline   Reply With Quote
Old 06-29-2010, 11:15 PM   #12
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi jinghanna,

Thanks a lot for your help!!
Haneko is offline   Reply With Quote
Old 06-29-2010, 11:24 PM   #13
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

FYI and just for clarification , novoalign does bisulfite alignment but currently not for SOLiD reads.
In fact I'm not aware of anybody who are doing bisulfite sequencing with SOLiD as yet.

Quote:
Originally Posted by Haneko View Post
Hi jinghanna,

Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

Hi zee,

I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!
zee is offline   Reply With Quote
Old 06-29-2010, 11:39 PM   #14
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi zee,

Oh ok! But I'm dealing with SOLiD reads now, unfortunately.
Haneko is offline   Reply With Quote
Old 06-30-2010, 06:59 AM   #15
ondovb
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 20
Default

jinghanna, thanks for answering Haneko's questions.

A couple other notes-

- The output directories can be the same for each node, since they will each include their node # in their output file names. If your nodes have a shared file system, this can save you some copying.

- Running times for bisulfite are a lot longer than for the standard algorithm. For reference, we aligned ~55M bisulfite reads to Arabidopsis in about 30 hours using 16 threads (with sensitivity=3).
ondovb is offline   Reply With Quote
Old 06-30-2010, 10:57 AM   #16
jinghanna
Member
 
Location: San Jose

Join Date: May 2010
Posts: 10
Default

Ondov, thanks for the notes!
jinghanna is offline   Reply With Quote
Old 06-30-2010, 06:19 PM   #17
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi ondovb,

Many thanks for the notes! That was actually my concern.. My team had a discussion over the running time as we were not looking at Arabidopsis samples.

Which brings me to something else I just thought of: What is the difference between running on multiple threads and multiple nodes? I currently put threads=1 while total nodes=10..

On a side note, I realised that my processes that have been split into 10 nodes are all going into sleep mode.. Could this be because I didn't allocate enough RAM?
Haneko is offline   Reply With Quote
Old 07-01-2010, 06:24 AM   #18
ondovb
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 20
Default

Threads: should be the number of cores you want to use on each node. You mentioned you have 8 cores per node, so you'll want 8 threads to use them all.

Running time: will be linear with respect to genome length. Our data took 480 cpu hours, so yours (assuming a similar # of reads) should take 480 * 30 = 14400 cpu hours. If you use all 40 * 8 cores on your cluster, you're looking at about 45 hours.

Sleeping: if you remembered to include the -p flag, I'm not sure what else could cause this. Have you tried running it locally with the same settings and watching the output?
ondovb is offline   Reply With Quote
Old 07-01-2010, 06:11 PM   #19
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi ondovb,

Yes, Im running it locally but it seems to be stuck at the aligning stage:

Round 1 / 4 (2101986 reads):

Sensitivity 4:

EDIT: I have used strace on the process and found it to be at the following state:

futex(0x40dd79d0, FUTEX_WAIT, 25312, NULL

Last edited by Haneko; 07-01-2010 at 08:06 PM.
Haneko is offline   Reply With Quote
Old 07-07-2010, 06:57 AM   #20
ondovb
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 20
Default

I think each instance might appear to be sleeping to the OS because the parent thread just sits and waits for the child threads to finish their computation (even if only one thread is chosen). What does the CPU usage look like?

Sensitivity 4 will take a pretty long time (even on your cluster), which could make it appear to be stuck. I wouldn't recommend going higher than 3. If you set the trim to at least 3, that should get rid of a lot of the errors and you should still be able to align a lot of reads.
ondovb is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO