SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
tophat Error running running 'prep_reads' victoryhe Bioinformatics 2 10-17-2011 05:53 AM
SOLID and SOCS ssharma SOLiD 1 09-19-2011 04:41 AM
Bambus running time question user1313 Bioinformatics 0 07-04-2011 02:40 AM
SOCS output SOLiD_User Bioinformatics 3 09-01-2009 06:17 AM
SOCS: Efficient mapping of Applied Biosystems SOLiD sequence data to a ref genome... ECO Literature Watch 0 10-20-2008 08:53 PM

Reply
 
Thread Tools
Old 07-14-2010, 02:14 AM   #21
volks
Member
 
Location: hd.de

Join Date: Jun 2010
Posts: 81
Default

Quote:
Originally Posted by ondovb View Post
Running time: will be linear with respect to genome length. Our data took 480 cpu hours, so yours (assuming a similar # of reads) should take 480 * 30 = 14400 cpu hours.
can you estimate how the running time behaves in respect to # of reads?
volks is offline   Reply With Quote
Old 07-14-2010, 01:28 PM   #22
ondovb
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 20
Default

Running time is also approximately linear with respect to # of reads, and exponential with respect to sensitivity.
ondovb is offline   Reply With Quote
Old 07-22-2010, 09:30 PM   #23
sci_guy
Member
 
Location: Sydney, Australia

Join Date: Jan 2008
Posts: 83
Default

Quote:
Originally Posted by zee View Post
In fact I'm not aware of anybody who are doing bisulfite sequencing with SOLiD as yet.
I also also another unfortunate soul dealing with SOLiD bisulfite reads. I also know that Thomas Preiss' group in Sydney is working on RNA methylation using SOLiD.
sci_guy is offline   Reply With Quote
Old 11-22-2011, 01:07 PM   #24
fwessely
Junior Member
 
Location: UK

Join Date: Oct 2011
Posts: 3
Default

I have aligned a subset of the reads on my machine and have some questions.

I received several warnings (e.g. '5719579 substrings of chr1.fa ignored due to 5718003 character(s) other than [ACGTacgt]'). The Ns in the reference file(s) cause this problem and I don't know the impact of the warnings on the overall analysis.

At the end of the aligning part is says 'computing error frequencies'. What does this mean?

Does SOCS-B run faster, if all reference files would be merged into one multiFASTA reference file?

I struggle to understand the difference between the mismatch sensitivity (s) and the tolerance (t). Could you briefly explain these two parameters? Can I set them independently?
fwessely is offline   Reply With Quote
Old 11-22-2011, 01:28 PM   #25
ondovb
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 20
Default

Quote:
Originally Posted by fwessely View Post
I received several warnings (e.g. '5719579 substrings of chr1.fa ignored due to 5718003 character(s) other than [ACGTacgt]'). The Ns in the reference file(s) cause this problem and I don't know the impact of the warnings on the overall analysis.
These are just to keep you informed. If you were expecting that many Ns, you can ignore them. The only way they will affect your results is that you can expect coverage dips within a read's length of any Ns in the reference, since SOCS will not map to any substrings that contain an N.

Quote:
Originally Posted by fwessely View Post
At the end of the aligning part is says 'computing error frequencies'. What does this mean?
It is outputting the observed frequency of color-space errors for each position in the read length. The output should be in the stats folder.

Quote:
Originally Posted by fwessely View Post
Does SOCS-B run faster, if all reference files would be merged into one multiFASTA reference file?
The speed shouldn’t be affected by separate files. My only suggestion for efficiency in large genomes is to limit the number of ambiguous matches to keep (assuming you don't need all of them). Each read could map to thousands of places in the whole genome, which affects RAM estimation and can cause multiple "rounds" of alignment when you give it bigger chunks of reads.

Quote:
Originally Posted by fwessely View Post
I struggle to understand the difference between the mismatch sensitivity (s) and the tolerance (t). Could you briefly explain these two parameters? Can I set them independently?
I admit sensitivity and tolerance are confusing...here's an example: if the sensitivity (-s) is 3, you are guaranteed to find the best alignment in the genome with 3 or fewer color space mismatches (ignoring bisulfite changes). However, a lot of alignments will also be found by chance that have 4 or more mismatches. If a read only has alignments with 4 or more, you may or may not want to report the best one that was found, since it is not guaranteed to be the best in the whole genome. The threshold for reporting these is set by the tolerance (-t), and this should always be at least as high as sensitivity.
ondovb is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO