SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SSPACE: a new stand-alone scaffolding tool for small and large genomes boetsie Bioinformatics 252 03-07-2019 04:19 AM
why can not download Abyss 1.3.2 elisadouzi Bioinformatics 0 12-13-2011 09:22 PM
How are you comparing assemblies? Hobbe Bioinformatics 0 02-17-2011 11:38 PM
Abyss-PE error joa_ds Bioinformatics 1 11-30-2010 09:32 PM
Abyss @ 454 joa_ds Bioinformatics 3 05-02-2010 06:40 AM

Reply
 
Thread Tools
Old 06-10-2011, 11:38 AM   #1
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default Is SSPACE good for Abyss assemblies?

Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

Opinions?

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-10-2011, 08:13 PM   #2
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by pmiguel View Post
Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

Opinions?

--
Phillip

Ray (since v1.4.0) now includes a scaffolder (it is pretty good).

See http://denovoassembler.sourceforge.net/ (open source and well-documented !)

p.s.: I am the author of Ray (I am a PhD student).

seb567 is offline   Reply With Quote
Old 06-11-2011, 05:52 AM   #3
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by pmiguel View Post
Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

Opinions?

--
Phillip
I've only tested ABYSS contigs myself for the E.coli dataset, and here it gave some very good results. I do recommend filtering small contigs (e.g. larger than 100 or 200bp), since smaller contigs are likely to be repeats or misassembled contigs.

For E.coli, scaffolding of contigs with a minimal of 100bp reduced 595 contigs to 127 scaffolds. In addition, the N50 went from 18k to 94k. I've tested these scaffolds with MUMmer and all were valid.

I must say, i am the developer of SSPACE, so i'm a bit biased

Some other post i found about ABYSS and SSPACE;

http://groups.google.com/group/abyss...505cff5cb974bd

Kind regards,
Boetsie
boetsie is offline   Reply With Quote
Old 06-14-2011, 04:43 AM   #4
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by seb567 View Post
Ray (since v1.4.0) now includes a scaffolder (it is pretty good).

See http://denovoassembler.sourceforge.net/ (open source and well-documented !)

p.s.: I am the author of Ray (I am a PhD student).

Hi Seb567,
We did try Ray. Maybe we did not configure the Ray assembly correctly, but our Abyss results looked much better. For instance the following command:
/programs/Ray-1.4.0/code/Ray \
-k \
43 \
-i \
../FastQ/000617_TL3360_both.fastq \
-o \
000617_TL3360

produced ~3400 contigs ranging from 130 bp to 8.6 kb. Whereas Abyss produced 137 contigs ranging from 41- 450165 bp using a similar kmer size (41).

These were 2x100 bp reads from ~350bp fragment PEs -- about 200x coverage. The DNA was from the bacterium Salmonella.

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-14-2011, 05:06 AM   #5
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by boetsie View Post
I've only tested ABYSS contigs myself for the E.coli dataset, and here it gave some very good results. I do recommend filtering small contigs (e.g. larger than 100 or 200bp), since smaller contigs are likely to be repeats or misassembled contigs.

For E.coli, scaffolding of contigs with a minimal of 100bp reduced 595 contigs to 127 scaffolds. In addition, the N50 went from 18k to 94k. I've tested these scaffolds with MUMmer and all were valid.

I must say, i am the developer of SSPACE, so i'm a bit biased
[...]

Kind regards,
Boetsie
Hi Boetsie,

Yes, I should try it.

After Abyss alone, our N50 for contigs >200 bases is already 17.5kb. (77 contigs, range 214-389830 bases, mean 58691 bases.) This was with setting the kmer higher (63) than the example I gave in the post above.

I will post here the results after SSPACE.

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-14-2011, 12:32 PM   #6
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Hi Boetsie,
Okay I ran SSPACE. Only one mysterious glitch in getting it to run (described below). I filtered my contigs by removing any shorter than 200 bases prior to running. Here are the initial and final results:

Inserted contig file;
Total number of contigs = 77
Sum (bp) = 5456937
Max contig size = 389830
Min contig size = 214
Average contig size = 70869
N50 = 225952

After extension;
Total number of contigs = 77
Sum (bp) = 5456953
Max contig size = 389830
Min contig size = 222
Average contig size = 70869
N50 = 225952

After scaffolding lib1:
Total number of scaffolds = 69
Sum (bp) = 5457073
Max scaffold size = 389830
Min scaffold size = 680
Average scaffold size = 79088
N50 = 226679

Overall and increase of >10% in the scaffold lengths over the initial contigs. Not bad! Actually I think I am likely coming up against a hard limit imposed by our library insert size.

Also it ran fast -- just a minute or two with -x 1 set.

I did have one problem getting it to run. It took me about 30 minutes with the perl debugger to track down the issue. So I'll describe it and the simple solution for anyone googling the warning SSPACE gave. The warning was:

Bowtie-build error; -1 at /bin/SSPACE/SSPACE-1.1_linux-x86_64/bin/mapWithBowtie.pl line 37.
WARNING: No scaffolding, because no reads found on contigs


Turns out to be because mapWithBowtie.pl was getting a permissions error when it attempted to run bowtie-build via a sys call. So

chmod +x /bin/SSPACE/SSPACE-1.1_linux-x86_64/bowtie/bow*

fixed the issue. That is, the programs in the bowtie subdirectory needed to be given execute permission.

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-14-2011, 03:02 PM   #7
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Hi Phillip,

your results look OK, <70 contigs with only one paired-end library of 200bp is very good. I think there is not much to gain from this library. Remaining contigs are probably repeats (especially the small contigs) or contigs/scaffolds that could not be combined with each other since the library insert size is too small.

For example with E.coli we went from 127 to 89 scaffolds with a paired-end 500, and then to 9 scaffolds with a mate pair 2kb.

I'm aware of this problem, and i thought i had fixed it, but it did not. The next release will hopefully not contain this error. Thanks for mentioning it!

regards,
Boetsie
boetsie is offline   Reply With Quote
Old 06-15-2011, 04:04 AM   #8
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Hi Boetsie,
Actually the new TruSeq DNA library protocol recommends fragmenting DNA to a mean length of 300-400 bases for genomic DNA. Since our resulting sequence was at or above specifications for the instrument, I think the larger insert sizes are the way to go by default.
Thanks for the info about the effect of mate end (ME) reads. I did not have any for this bacterium. We do have some for a fungal genome we assembled. But they are 454 MEs. We are giving those a shot.

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-15-2011, 08:43 AM   #9
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by pmiguel View Post
Hi Seb567,
We did try Ray. Maybe we did not configure the Ray assembly correctly, but our Abyss results looked much better. For instance the following command:
/programs/Ray-1.4.0/code/Ray \
-k \
43 \
-i \
../FastQ/000617_TL3360_both.fastq \
-o \
000617_TL3360

produced ~3400 contigs ranging from 130 bp to 8.6 kb. Whereas Abyss produced 137 contigs ranging from 41- 450165 bp using a similar kmer size (41).

These were 2x100 bp reads from ~350bp fragment PEs -- about 200x coverage. The DNA was from the bacterium Salmonella.

--
Phillip
What is the content of these files:

000617_TL3360.CoverageDistributionAnalysis.txt
000617_TL3360.LibraryStatistics.txt

Thank you.
seb567 is offline   Reply With Quote
Old 06-15-2011, 09:02 AM   #10
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by seb567 View Post
What is the content of these files:
000617_TL3360.CoverageDistributionAnalysis.txt
MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160
Percentage of vertices with coverage 1: 87.6321%
DistributionFile: 000617_TL3360.CoverageDistribution.txt


Quote:
Originally Posted by seb567 View Post
000617_TL3360.LibraryStatistics.txt
File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302

Total: 13001302

NumberOfPairedLibraries: 1

LibraryNumber: 0
InputFormat: Interleaved,Paired
DetectionType: Automatic
File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302
AverageOuterDistance: 385
StandardDeviation: 628
DetectionFailure: Yes

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-15-2011, 09:31 AM   #11
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by pmiguel View Post
MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160
Percentage of vertices with coverage 1: 87.6321%
DistributionFile: 000617_TL3360.CoverageDistribution.txt




File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302

Total: 13001302

NumberOfPairedLibraries: 1

LibraryNumber: 0
InputFormat: Interleaved,Paired
DetectionType: Automatic
File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302
AverageOuterDistance: 385
StandardDeviation: 628
DetectionFailure: Yes

--
Phillip

The CoverageDistributionAnalysis.txt file points to a bad detection of the repeat coverage, so nothing will work correctly for sure after that.

MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160 <----

Can you put the content of 000617_TL3360.CoverageDistribution.txt on http://pastebin.com/ and link it here ?
seb567 is offline   Reply With Quote
Old 06-15-2011, 10:48 AM   #12
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by seb567 View Post
The CoverageDistributionAnalysis.txt file points to a bad detection of the repeat coverage, so nothing will work correctly for sure after that.

MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160 <----

Can you put the content of 000617_TL3360.CoverageDistribution.txt on http://pastebin.com/ and link it here ?
http://pastebin.com/sBQ6k4NY

Thanks
--
Phillip
pmiguel is offline   Reply With Quote
Old 06-15-2011, 11:58 AM   #13
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by pmiguel View Post
OK, problem solved.

This is your coverage distribution:
http://i.imgur.com/caicf.png

However, it confuses Ray because it is going up and down near the inflection point:

142 1002
143 2012
144 432
145 1032
146 1098
147 1088
148 1166
149 1454
150 778
151 1122
152 1146
153 -720
154 424
155 192
156 -64
157 552
158 418
159 -406 Peak Coverage
160 164
161 -826
162 -434
163 -190
164 -124
165 26
166 1014
167 -1100
168 -562
169 -1376
170 -1288
171 -336
172 -984
173 -500
174 -1064

I added data smoothing and it fixes the problem.

File= /home/boiseb01/coverage-pmiguel
MinCoverage= 45
PeakCoverage= 158
RepeatCoverage= 290


https://github.com/sebhtml/ray/commit/6590dd022

https://github.com/sebhtml/ray/tarball/v1.6.1-rc1


seb
seb567 is offline   Reply With Quote
Old 06-16-2011, 01:27 AM   #14
SLB
Member
 
Location: Ireland

Join Date: Sep 2010
Posts: 21
Default

Hi,

I have used SSPACE with abyss output after assembly with 180 and 550 PE libraries. I filtered for contigs > 200 and below is the output from SSPACE. I have a quick question about the output relating to repeats. After scaffoldijng with the final library I get the following;
Number of repeats = 14553
Total size of repeats = 1494450560
What do these figures relate to? Its funny because If I add the total size of repeats to the total size of the scaffolded assembly after the final library is added I get, 1494450560 + 1149222136 = 2643672696, which is the estimated size of my genome!


Inserted contig file;
Total number of contigs = 440783
Sum (bp) = 657546051
Max contig size = 39800
Min contig size = 200
Average contig size = 1491
N50 = 3535

After scaffolding lib1: 3kb
Total number of scaffolds = 326357
Sum (bp) = 844894494
Max scaffold size = 102863
Min scaffold size = 200
Average scaffold size = 2588
N50 = 10046

After scaffolding lib2: 5kb
Total number of scaffolds = 266348
Sum (bp) = 993616335
Max scaffold size = 164536
Min scaffold size = 200
Average scaffold size = 3730
N50 = 17281

After scaffolding lib3: 10kb
Total number of scaffolds = 232199
Sum (bp) = 1149222136
Max scaffold size = 303516
Min scaffold size = 200
Average scaffold size = 4949
N50 = 29100
SLB is offline   Reply With Quote
Old 06-16-2011, 03:50 AM   #15
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

It's a complicated calculation, but basically it counts the number of contigs that are linked left, and the number of contigs that are linked right from the contig.

Say that contigA has three contigs that are linked left and two contigs linked right. The repeat is the highest number of links, thus here 3. This contig is thus said to be repeated 3 times in the assembly.

Have a look at the *.repeat file in the intermediate_results folder. Here, all repeats are listed.

Remember though, that one of the repeated elements is also included in the final assembly, so the repeats should be subtracted from the final scaffolds. So if contigA is repeated 4 times with a size of 1300bp. The 1300bp should be subtracted from the final assembly, since the contig is already present within the scaffolds.

To improve your assembly, try to include the PE libraries in SSPACE too. Scaffolding a combination of Paired-End and Mate pair libraries is very powerfull.

Boetsie
boetsie is offline   Reply With Quote
Old 06-16-2011, 04:45 AM   #16
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by boetsie View Post
[...]
Remember though, that one of the repeated elements is also included in the final assembly, so the repeats should be subtracted from the final scaffolds. So if contigA is repeated 4 times with a size of 1300bp. The 1300bp should be subtracted from the final assembly, since the contig is already present within the scaffolds.
[...]
Boetsie
Hi Boetsie,
If there were a repetitive element present in 10 copies in a genome that assembled into a single contig, would SSPACE only place a single copy of that element in the final assembly? Or am I misreading you?
--
Phillip
pmiguel is offline   Reply With Quote
Old 06-16-2011, 05:10 AM   #17
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by pmiguel View Post
Hi Boetsie,
If there were a repetitive element present in 10 copies in a genome that assembled into a single contig, would SSPACE only place a single copy of that element in the final assembly? Or am I misreading you?
--
Phillip
In short, yes
boetsie is offline   Reply With Quote
Old 06-16-2011, 06:15 AM   #18
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by seb567 View Post
Hi seb,
We still get the "DetectionFailure: Yes" line in the ".LibraryStatistics.txt" file.
Also the ".RayVersion.txt" file gives "Ray version: 1.6.0". So maybe the link above is to an older version?

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-16-2011, 06:34 AM   #19
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by boetsie View Post
In short, yes :)
Just wondering what the current state of the art is in full genome assembly...

If there were 10 identical copies of an 1000 bp element scattered across an otherwise single copy genome would an assembler be able to reconstruct the genome without gaps? Say none of the elements were near each other and sufficient mate end coverage existed. That is, 30X coverage with 2 kb ME reads.

In principle seems it should be possible, but I don't know if modern assemblers would do so.

If not, would SSPACE reconstruct a gapless genome, or would it still produce a set of 10 scaffolds with one copy of the repetitive element?

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-16-2011, 06:42 AM   #20
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

If the library is larger than the repeated element, SSPACE will probably generate a single scaffold, though with gaps. The repeated contig will be present only once though.
If the library is smaller than the repeated element, SSPACE will generate 10 scaffolds, in one scaffold the repeated contig is present.

I'm not sure how other assemblers/scaffolders are doing this, if they include all repeats or not.

Through gap closing the remaining gaps can be filled. Currently, i'm working on a script to do this.

Boetsie
boetsie is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO