SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: Parallelized short read assembly of large genomes using de Bruijn graphs. Newsbot! Literature Watch 0 12-30-2011 02:00 AM
Assembly of Large Genomes using Cloud Computing by Contrail Gangcai De novo discovery 9 11-23-2011 07:42 AM
Scaffolding tool glacerda Bioinformatics 0 08-04-2010 03:54 PM
PubMed: BFAST: An Alignment Tool for Large Scale Genome Resequencing. Newsbot! Literature Watch 0 11-13-2009 02:10 AM
BFAST: Blat-like Fast Accurate Search Tool for Large-Scale Genome Resequencing nilshomer Bioinformatics 1 11-06-2008 09:36 PM

Reply
 
Thread Tools
Old 12-10-2010, 09:01 AM   #1
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default SSPACE: a new stand-alone scaffolding tool for small and large genomes

Hi all,

during my Master thesis I developed a stand-alone scaffolding tool named SSPACE for scaffolding pre-assembled contigs using paired-read data. I developed this program since I couldn't find a program which was able to do this, except from Bambus. However, we had lots of issues on Bambus, including errors and complicated input datasets.

Therefore, SSPACE was developed. The main featues are;

* Inputs are simple FASTA contig sequences as well as (multiple) FASTA/FASTQ paired-read data
* High-quality scaffolds in a short runtime and limited memory requirements
* High reduction of the amount of contigs stored into scaffolds and high N50 value
* Multiple library input of both paired-end and/or mate pair datasets
* Possible contig extension of unmapped sequence reads
* Easy interpretation of the final scaffolds
* Visualization of the final scaffolds using GraphViz

SSPACE has been tested on the E.coli, Grosmannia clavigera and Giant Panda genomes and showed less scaffolds and higher N50 value compared with the produced scaffolds from common de novo assemblers, like Abyss and SOAPdeNovo.

SSPACE is freely available at
http://www.baseclear.com/sequencing/...-tools/sspace/

The publication is accepted at bioinformatics and will be online soon. Publication shows more detailed information about the produced scaffolds and their quality, including time and memory information.

Hope it could be useful and any comments or questions are ofcourse welcome.

Cheers,
Boetsie
boetsie is offline   Reply With Quote
Old 12-19-2010, 11:40 PM   #2
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Hi all,

publication of SSPACE is now available at;

http://bioinformatics.oxfordjournals...s.btq683.short

Boetsie
boetsie is offline   Reply With Quote
Old 12-20-2010, 01:32 AM   #3
ganga.jeena
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 15
Default congrats


Its grt to hear such an achievement.
Is your paper freely available.
Can you mail me downloadable software copy
Regards,
Ganga Jeena
ganga.jeena is offline   Reply With Quote
Old 12-20-2010, 01:38 AM   #4
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default

Congrats!

Before I get into the paper, can I ask if this tool supports 'hierarchical scaffolding' in the way that Bambus (supposedly) does? i.e. If I want to add in 'scaffolding' information based on gene synteny from some related organisms, can I add that in but with a lower priority than the true PE/MP data?

Does it detect repeats from the graph structure like Bambus does now?

I'm curious because Bambus promises a lot of nice functionality, which is why I keep hammering away at it. However, I'm starting to wonder if it's time to jump ship to a tool that is more robust (if perhaps less feature rich).


Cheers,
Dan.
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 12-20-2010, 02:27 AM   #5
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default

Nice paper! The question that arises is weather we can feed PE data directly to the algorithm, rather than being shoehorned through Bowtie?

For example, Bowtie may not be the best tool for aligning 454 reads to contigs, but I'd still like to use 454 PE data to scaffold my assembly. Is there some intermediate file or Bowtie like PE format that we can feed to SSPACE?

Unfortunately parts of http://bioinformatics.oxfordjournals.org are down, so I can't see the supplementary figure, sorry if that would help address my question.
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 12-20-2010, 09:13 AM   #6
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Hi Dan,

thanks for your reply!

It does not fully supports the same hierarchical scaffolding as Bambus. We use a simple approach;

1) Produce scaffolds using the first library
2) Use scaffolds from 1), and produce scaffolds using the second library
3) and so on...

we do not use a priority for the libraries, like Bambus. We let the user determine what order of library is used.

It is able to detect repeats by determining the number of incoming and outcoming 'links' between contigs. Repeats are outputted by the program.

Bambus has indeed more functionality. However, we found that the input options were too complex for simple scaffolding purposes.

About your question about Bowtie;
Unfortunately, only Bowtie is supported at the moment, as SSPACE was designed for Illumina input (or other short paired reads) and based on Bowtie output.

My question; What program do people use for aligning 454 reads, can it produce similar output as Bowtie?

Cheers,
Boetsie

Quote:
Originally Posted by dan View Post
Congrats!

Before I get into the paper, can I ask if this tool supports 'hierarchical scaffolding' in the way that Bambus (supposedly) does? i.e. If I want to add in 'scaffolding' information based on gene synteny from some related organisms, can I add that in but with a lower priority than the true PE/MP data?

Does it detect repeats from the graph structure like Bambus does now?

I'm curious because Bambus promises a lot of nice functionality, which is why I keep hammering away at it. However, I'm starting to wonder if it's time to jump ship to a tool that is more robust (if perhaps less feature rich).


Cheers,
Dan.
boetsie is offline   Reply With Quote
Old 12-20-2010, 11:48 AM   #7
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default

Thanks for the clear reply Boetsie, really great to hear that you do do repeat filtering based on graph structure, and allowing the user to pick the order of the libraries seems like a nice strategy.

I've been using Newbler to align 454's PE data to contigs. Newbler automatically handles the specifics of the 454 style PE reads so, although it isn't the best aligner for 454, it is very easy to use the results, which are just tab delimited... You can read about the format of the Newbler PE data here!

Newbler can be persuaded to output ace-like format too, but it doesn't do SAM/BAM IIRC.

I was looking at the code, and it should be easy enough to feed in the data to SSPACE ;-)
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 12-31-2010, 01:45 PM   #8
sjackman
Member
 
Location: Vancouver, Canada

Join Date: Mar 2009
Posts: 15
Default

Hi Boetsie,

Does SSPACE use the SAM output format of Bowtie? If not, could it?

Cheers,
Shaun
sjackman is offline   Reply With Quote
Old 01-01-2011, 08:05 AM   #9
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Hi Shaun,

no it does not, it uses the standard output from bowtie. With modifications to the script, it should be possible to use the SAM format.

Cheers,
Boetsie
boetsie is offline   Reply With Quote
Old 01-12-2011, 03:02 PM   #10
corthay
Member
 
Location: japan

Join Date: Oct 2008
Posts: 25
Default BAC / Fosmid end

Hi boetsie,

Can I use additional BAC/Fosmid ends for scaffolding the pre-assebmled contigs
or scaffolds with SSPACE? If it's possible, is there any parameter for this purpose?

Thanks,
Corthay
corthay is offline   Reply With Quote
Old 01-13-2011, 11:33 PM   #11
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by corthay View Post
Hi boetsie,

Can I use additional BAC/Fosmid ends for scaffolding the pre-assebmled contigs
or scaffolds with SSPACE? If it's possible, is there any parameter for this purpose?

Thanks,
Corthay
Hi Corthay,

i'm not very familiar with BAC/fosmid ends, so there is no parameter for this purpose. However, if;
- these are paired sequences
- the sequences' lengths are below 1024 (maximum input of Bowtie)
- the pairs have either orientation of --> <-- (typical paired-end) or <-- --> (typical mate pair)

I see no problems why you should not give it a try if it satisfies the above points.

Kind regards,
Boetsie
boetsie is offline   Reply With Quote
Old 01-13-2011, 11:57 PM   #12
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default

What would be great is a simple tab delimited format for providing paired sequence alignments, rather than going via Bowtie... I had a quick look at the code, but unfortunately I couldn't work out where to add such functionality easily. I'll have another look at some point if nobody else does.
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 01-16-2011, 03:48 PM   #13
corthay
Member
 
Location: japan

Join Date: Oct 2008
Posts: 25
Default

Hi Boetsie,

Thanks for the response.

I've just specified "k=2" as clone coverage of BAC ends is almost 5x.
As a result, scaffolds N50 is a bit improved and the number of scaffolds is reduced. Thanks for the development of useful tool.

Corthay.


Quote:
Originally Posted by boetsie View Post
Hi Corthay,

i'm not very familiar with BAC/fosmid ends, so there is no parameter for this purpose. However, if;
- these are paired sequences
- the sequences' lengths are below 1024 (maximum input of Bowtie)
- the pairs have either orientation of --> <-- (typical paired-end) or <-- --> (typical mate pair)

I see no problems why you should not give it a try if it satisfies the above points.

Kind regards,
Boetsie
corthay is offline   Reply With Quote
Old 01-17-2011, 12:40 AM   #14
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by dan View Post
What would be great is a simple tab delimited format for providing paired sequence alignments, rather than going via Bowtie... I had a quick look at the code, but unfortunately I couldn't work out where to add such functionality easily. I'll have another look at some point if nobody else does.
Hi Dan,

i know what you mean, but than multiple library input can't be used since we do an hierarchical clustering (first generate scaffolds using one library, than produce scaffolds by aligning next library on first scaffolds and produce new scaffolds etc...). So for each library we align the reads to the new scaffolds. Therefore, no predefined paired sequence alignments could be provided, except if only one library is used. In addition, if we have such an input we would be very similar to Bambus. Our purpose is to have an easy to use scaffolder without providing complex input formats, but with a simple fasta input.

Next week, i'll try to provide another alignment tool (e.g. Newbler) to map long reads to the contigs/scaffolds.

Kind regards,
Boetsie
boetsie is offline   Reply With Quote
Old 01-17-2011, 12:41 AM   #15
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by corthay View Post
Hi Boetsie,

Thanks for the response.

I've just specified "k=2" as clone coverage of BAC ends is almost 5x.
As a result, scaffolds N50 is a bit improved and the number of scaffolds is reduced. Thanks for the development of useful tool.

Corthay.
Hi Corthay,

great that it worked and that it improved your assembly a bit!

Kind regards,
Boetsie
boetsie is offline   Reply With Quote
Old 01-17-2011, 12:58 AM   #16
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default

Quote:
Originally Posted by boetsie View Post
... since we do an hierarchical clustering ... for each library we align the reads to the new scaffolds, therefore, no predefined paired sequence alignments could be provided ...
What you need to do is track the positions of these features from the input contigs onto the output scaffolds to internally generate a new tab-delimited input file with the right coordinates... I tried doing this with BioPerl, but unfortunately got tied in knots with the cryptic class hierarchy.

In theory it shouldn't be hard to say 'position x on contig y in the input is now position j on scaffold k in the output', and simply run it again for the new library. However, I guess there is quite a bit of complexity to such a code.


Anyway, just a suggestion for improvement of an already useful tool!

Cheers,
Dan.
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 01-17-2011, 01:49 AM   #17
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by dan View Post
What you need to do is track the positions of these features from the input contigs onto the output scaffolds to internally generate a new tab-delimited input file with the right coordinates... I tried doing this with BioPerl, but unfortunately got tied in knots with the cryptic class hierarchy.

In theory it shouldn't be hard to say 'position x on contig y in the input is now position j on scaffold k in the output', and simply run it again for the new library. However, I guess there is quite a bit of complexity to such a code.


Anyway, just a suggestion for improvement of an already useful tool!

Cheers,
Dan.
Hi Dan,

first of all, thank you for the suggestions and the positive feedback!

I see what you mean, and i think it is indeed a useful function to allow other input formats. I think as a start it would be nice to allow .sam format inputs.

About remembering the positions i'm doing quite the same with remembering which contigs are on which scaffolds after each library. I think the same trick could be applied for mapping.
I'll see what i can do.

Thanks,
Boetsie
boetsie is offline   Reply With Quote
Old 01-18-2011, 11:49 PM   #18
corthay
Member
 
Location: japan

Join Date: Oct 2008
Posts: 25
Default

Hi boetsie again,

I would like to ask you if only unique mapped reads are used for the scaffolding.

If not, I am planing to mask repeat sequence before scaffolding.

Thanks,
Corthay
corthay is offline   Reply With Quote
Old 01-19-2011, 01:41 AM   #19
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by corthay View Post
Hi boetsie again,

I would like to ask you if only unique mapped reads are used for the scaffolding.

If not, I am planing to mask repeat sequence before scaffolding.

Thanks,
Corthay
Hi again

I indeed use only reads that can uniquely map to only one position on all the contigs. I use the option -m 1 from Bowtie (see; http://bowtie-bio.sourceforge.net/ma...html#reporting). Otherwise, it is impossible to know what link should be made if a read maps to multiple contigs.

Is this what you mean?

Kind regards,
Boetsie
boetsie is offline   Reply With Quote
Old 01-27-2011, 02:54 PM   #20
corthay
Member
 
Location: japan

Join Date: Oct 2008
Posts: 25
Default

Hi boetsie,

Thanks for your quick reply. I understood how uniqueness is guaranteed.
Then, I have two more questions please.

Firstly, I am wondering why the total bases of scaffolds without N is increased even though I set 0 for "-x" option.

Secondly, how do you calculate the distance of reads within a given contig pair.
Do you estimate the size of gap using reads, or gap size is just ignored ?

Sorry for asking so many questions.

Thanks
Corthay.


Quote:
Originally Posted by boetsie View Post
Hi again

I indeed use only reads that can uniquely map to only one position on all the contigs. I use the option -m 1 from Bowtie (see; http://bowtie-bio.sourceforge.net/ma...html#reporting). Otherwise, it is impossible to know what link should be made if a read maps to multiple contigs.

Is this what you mean?

Kind regards,
Boetsie
corthay is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO