SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
Targeted de novo assembly skingan De novo discovery 8 09-12-2011 06:58 AM
De novo assembly mihir.karnik General 1 09-07-2011 01:49 PM
De novo assembling taking into account base quality richenlaw Bioinformatics 1 03-12-2011 04:31 AM
de novo assembly vs. reference assembly fadista General 3 02-15-2011 11:11 PM
de novo assembly (velvet or others) strob Bioinformatics 1 01-20-2010 04:53 AM

Reply
 
Thread Tools
Old 12-15-2009, 07:16 AM   #1
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default dual base encoding and de novo assembly

AB takes a lot of flak for what is arguably a major strength of their methodology: dual base encoding. Yesterday someone mentioned to me that dual base encoding was only an advantage when one had a reference sequence to map to.

We haven't tried de novo assembly on SOLiD data sets much, so I did not really argue the point at the time. However, in retrospect, I do not see any reason that the benefits of dual base encoding would not play out in de novo assembly. That is, a single base miscall might be recognized as a miscall in the context of other reads assembling into the same contig.

But do color space aware de novo assemblers take advantage of dual base encoding? Anyone know?

--
Phillip
pmiguel is offline   Reply With Quote
Old 12-15-2009, 08:28 AM   #2
aguffanti
Member
 
Location: Milano, Italy

Join Date: Dec 2008
Posts: 29
Default

Yes there is a wrapper around velvet which uses something like that, it is available on the SOLiD web site. I am also working at something like this, but sloowly

HTH

Alessandro


Quote:
Originally Posted by pmiguel View Post
AB takes a lot of flak for what is arguably a major strength of their methodology: dual base encoding. Yesterday someone mentioned to me that dual base encoding was only an advantage when one had a reference sequence to map to.

We haven't tried de novo assembly on SOLiD data sets much, so I did not really argue the point at the time. However, in retrospect, I do not see any reason that the benefits of dual base encoding would not play out in de novo assembly. That is, a single base miscall might be recognized as a miscall in the context of other reads assembling into the same contig.

But do color space aware de novo assemblers take advantage of dual base encoding? Anyone know?

--
Phillip
aguffanti is offline   Reply With Quote
Old 12-19-2009, 11:15 PM   #3
snetmcom
Senior Member
 
Location: USA

Join Date: Oct 2008
Posts: 158
Default

It definitely does when you build your contigs. I'm still trying to wrap my ahead around their new tool, but it's supposed to do just that.
http://solidsoftwaretools.com/gf/project/saet/
snetmcom is offline   Reply With Quote
Old 12-20-2009, 09:13 AM   #4
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by snetmcom View Post
It definitely does when you build your contigs. I'm still trying to wrap my ahead around their new tool, but it's supposed to do just that.
http://solidsoftwaretools.com/gf/project/saet/
Whoa! Thanks for pointing that out snetmcom!

From the blurb on given on the page, that is not all what I thought the SOLiD Accuracy Enhancement Tool did. I thought it was a tool to discard reads judged to contain too many errors.

But after reading the pdf, looks like you are right. The tool actually appears be using both quality values and dual-base encoding to correct base calling errors. Interesting.

--
Phillip
pmiguel is offline   Reply With Quote
Old 12-20-2009, 11:08 PM   #5
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 199
Default

Quote:
Originally Posted by pmiguel View Post
Whoa! Thanks for pointing that out snetmcom!

From the blurb on given on the page, that is not all what I thought the SOLiD Accuracy Enhancement Tool did. I thought it was a tool to discard reads judged to contain too many errors.

But after reading the pdf, looks like you are right. The tool actually appears be using both quality values and dual-base encoding to correct base calling errors. Interesting.

--
Phillip
the saet looks interesting indeed. any solid developers here?
I am curious so if i were to use this tool would the base space data after this running this tool be good enough for de novo assembly?

would you still need to run the saet if you are already doing de novo assembly with a color space aware program?
KevinLam is offline   Reply With Quote
Old 12-20-2009, 11:48 PM   #6
aguffanti
Member
 
Location: Milano, Italy

Join Date: Dec 2008
Posts: 29
Default

However, you still need a conversion (in the current state of the art) from color space to 'pseudo' nucelotides in order to use an assembler like Velvet. They are not number aware. Since I am workink with colleagues on a porting of SSAKE to Color Space I was thinking that maybe it could be the good opportunity to work directly in colors with the assembler ? But I am worried by the intial T ... Anyone here working on de novo assembly with SOLiD?

A simpler idea would be to use SAET to obtain an high-quality dataset, convert the sequences in real nucleotide space following the established rules and then use a 'traditional' assembler. I think I could use this strategy on a small de novo viral genome. Anybody interested, just give me whistle.

Alessandro
aguffanti is offline   Reply With Quote
Old 12-21-2009, 01:18 PM   #7
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Unless I skipped something in reading the documentation, all that SAET does is correct reads with missing (dot/period) color-space calls in them. While that is nice it is hardly significant. One of our last runs had 678K (missing calls) out of 65000K (total) reads. Or an missing call rate of about 1%.
westerman is offline   Reply With Quote
Old 12-21-2009, 02:16 PM   #8
aguffanti
Member
 
Location: Milano, Italy

Join Date: Dec 2008
Posts: 29
Default

Hi. No apparently he does more than this, but I am testing it just now with genome resequencing fragment.

It is also evident from the doc examples that it actually corrects sequences even without dots:

Input: reads.csfasta
>1015_1635_189_F3_I1
T0320310030001120012311330

Output: reads.csfasta
>1015_1635_189_F3_I1
T0320310030001122012311330

HTH

Alessandro



Quote:
Originally Posted by westerman View Post
Unless I skipped something in reading the documentation, all that SAET does is correct reads with missing (dot/period) color-space calls in them. While that is nice it is hardly significant. One of our last runs had 678K (missing calls) out of 65000K (total) reads. Or an missing call rate of about 1%.
aguffanti is offline   Reply With Quote
Old 12-22-2009, 07:32 AM   #9
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I see what you mean. That di-base gets changed and the quality goes from 8 to 0.

I must say that I am uncomfortable with the idea of changing data without knowing what the data will be eventually used for. However it is nice to see SOLiD using the quality values. Corona lite and, I believe, Bioscope do not take QVs into account. SAET does take QVs into account and what seems to be in a safe manner. The documentation says "... Positions with quality values above 10 should not be corrected."
westerman is offline   Reply With Quote
Old 12-22-2009, 12:20 PM   #10
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

So I downloaded SAET and put one of our datasets called 'rojo' through it. This is the one that had 678K beads with dots. Out of 65.4 million beads (this was a partial run) the SAET program corrected 30.6 million beads. Very interesting! This is a SNP calling project thus it will be interesting to see what extra SNPs can be called (or are now missing) with the corrected data. I'll report when I can.

BTW, SAET took 4 hours to process the 65.4M data using a 16-core, high memory computer.
westerman is offline   Reply With Quote
Old 12-22-2009, 11:27 PM   #11
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 199
Default

Quote:
Originally Posted by westerman View Post

BTW, SAET took 4 hours to process the 65.4M data using a 16-core, high memory computer.
Neat am doing benchmark testing as well but on sample data going to generate some fake data later to test de novo assembly.

how do you get a 16 core machine?
did u run it on a cluster with PBS? how does one do that?
KevinLam is offline   Reply With Quote
Old 12-23-2009, 05:46 AM   #12
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

1) You buy a 16-core machine. Ours cost about USD $10,000 with 128 GB memory. With the cost of the SOLiD itself hovering around $500,000 and the cost of a SOLiD run hovering around $10,000 it becomes quite easy to convince the powers-that-be to throw a bit of money towards computer hardware. If nothing else I point out that the lab techs can go through $10K of reagents in seconds :-)

2) SAET doesn't run under PBS as far as I can tell; i.e., I believe it runs only on a single machine but can use all of the cores on that single machine. Corona Lite and Bioscope do use PBS.

3) I still don't have SNP calling done on the SAET-corrected data. I was hoping that this process would complete overnight but no such luck. I think that Bioscope's messaging service crashed on me. :-( So ... probably no SNP results until after the upcoming holidays.
westerman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO