Seqanswers Leaderboard Ad

**Kennels** · 08-28-2011, 11:49 PM

Dear boetsie,

On another note, for my -x 0 output, I am noticing that an 'n' is being reported instead of lower case nt's when there is a clear overlap between adjacent contigs in the scaffold output.

For example, the first two contigs of a scaffold are:
>contig29
....TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT
>contig15
TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT....

There is a clear 70nt overlap, however the final.scaffold output is putting an 'n' between these two contigs. This the details of the scaffold:
>scaffold1.1|size6918|tigs5
f_tig29|size803|links11|gaps-58
f_tig15|size4884|links6|gaps-99
f_tig13|size173|links6|gaps-24|merged38
f_tig24|size387|links45|gaps-52|merged45
f_tig8|size752

However it is reporting lowercase nts between contig 13/24 and 24/8, so I'm wondering if there is some kind of threshold '-n' which determines what is reported.

My invocation was:
perl /usr/local/bin/SSPACE-1.1_linux-x86_64/SSPACE_v1-1.pl -l library.txt -s contigsassem71.fa -x 0 -m 50 -o 20 -n 15 -p 1 -v 1 -b pass11_sspace

Sorry for the spate of questions.
Thanks!
Kennels

**boetsie** · 08-29-2011, 03:22 AM

Hi Kennels,

Try to run SSPACE with -x 1 and -v 0, -v is the verbose option but this will give you lots of lines of intermediate steps, including the reads used for extension. I don't think you want to this. The error in the main:

umper i'm familiar with, it is fixed in the premium version, but i'll fix it in the new basic version too.

Just run with -v 0 and everything should work fine.

Regards,
Boetsie

Originally posted by Kennels View Post

Hi boetsie

I'm eager to use your program after having to 'manually' extend contigs using combinations of patman/bowtie/velvet processes.

I initially had a bowtie-build error which was resolved by giving chmod a+x to all the files in the SSPACE subdirectories. I am using the latest version, v1.1.

Unfortunately I'm getting another error when i use the -x 1 option.

######################################
Finished Collecting Overlapping Reads - BUILDING CONSENSUS...
Undefined subroutine &main:

umper called at /usr/local/bin/SSPACE-1.1_linux-x86_64/bin/ExtendOrFormatContigs.pl line 212, <IN> line 8.

LIBRARY pass7
------------------------------------------------------------

=>Mon Aug 29 13:44:56 2011: Building Bowtie index for contigs (tmp.pass7_sspace/subset_contigs.fasta)
Warning: Empty input file
Reference file does not seem to be a FASTA file
Command: /usr/local/bin/SSPACE-1.1_linux-x86_64/bowtie/bowtie-build --quiet --noref tmp.pass7_sspace/subset_contigs.fasta bowtieoutput/pass7_sspace.pass7.bowtieIndex
#######################################

I can't find the 'tmp.pass7_sspace/subset_contigs.fasta' file anywhere, but perhaps this has something to do with the undefined subroutine &main:

umper? Also, I do have many unmapped reads, so I'm thinking it should be able to extend?

When I use the -x 0 option however, I am able to finish with no problems. I don't think I have any problems with my inputs.

My invocation was:
perl /usr/local/bin/SSPACE-1.1_linux-x86_64/SSPACE_v1-1.pl -l library.txt -s contigs.fa -x 1 -m 50 -o 20 -p 1 -b pass7_sspace -v 1

Could you comment?
Thank you,
kennels

**boetsie** · 08-29-2011, 03:26 AM

Hmmm, i've searched into the code and i see that it only goes till 50bp overlap max. I was not aware of this limitation, i'll fix this in the new release.

If you would like them to be fixed immediately, send me a personal message, so I can send you the code by e-mail.

Sorry for these small bugs, and thanks for mentioning them!

Boetsie

Originally posted by Kennels View Post

Dear boetsie,

On another note, for my -x 0 output, I am noticing that an 'n' is being reported instead of lower case nt's when there is a clear overlap between adjacent contigs in the scaffold output.

For example, the first two contigs of a scaffold are:
>contig29
....TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT
>contig15
TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT....

There is a clear 70nt overlap, however the final.scaffold output is putting an 'n' between these two contigs. This the details of the scaffold:
>scaffold1.1|size6918|tigs5
f_tig29|size803|links11|gaps-58
f_tig15|size4884|links6|gaps-99
f_tig13|size173|links6|gaps-24|merged38
f_tig24|size387|links45|gaps-52|merged45
f_tig8|size752

However it is reporting lowercase nts between contig 13/24 and 24/8, so I'm wondering if there is some kind of threshold '-n' which determines what is reported.

My invocation was:
perl /usr/local/bin/SSPACE-1.1_linux-x86_64/SSPACE_v1-1.pl -l library.txt -s contigsassem71.fa -x 0 -m 50 -o 20 -n 15 -p 1 -v 1 -b pass11_sspace

Sorry for the spate of questions.
Thanks!
Kennels

**narain** · 08-29-2011, 04:26 AM

Dear Boetsi

I tried running SSPACE for the same data for different 'n' parameter values . However there is no difference in the result that I get for n=3 or 5 or 15 (default). In all cases the N50 of the scaffold generated comes to around 1995 and all other characteristics as well such as the median or sum or maximum length. This was done on the contigs generated by ABySS .

To compare it with scaffolder in SOAPdenovo I ran the contigs generation on SOAPdenovo and did the scaffolding with the SOAPdenovo scaff tool as well as with SSPACE. The N50 and other evaluation criteria are much better for SOAPdenovo! The N50 using SOAPdenovo scaff came to about 21,653 and that with SSPACE only about 2,677 ! I am using the default value of k=5 and a=0.7 with n=15 . I have tried changing n as I stated in my first paragraph and there is no advantage of doing it. Do you recommend me any different values for k and a ?

Aby

**boetsie** · 08-29-2011, 04:51 AM

Changes to the -n parameter will not influence the N50 much. The -n parameter is only used for merging two contigs next to each other (thus removing gaps). If two contigs are merged it will decrease the N50 instead of increasing.

As stated in my previous post, it is important that sufficient paired-reads map to the contigs. If there are not much paired-reads that map, you should lower the -k value to for example 3 (or even 2). Especially, as you stated before, you have low coverage. Other option may be to trim your reads to remove erronoeus nucleotides.

Originally posted by narain View Post

Dear Boetsi

I tried running SSPACE for the same data for different 'n' parameter values . However there is no difference in the result that I get for n=3 or 5 or 15 (default). In all cases the N50 of the scaffold generated comes to around 1995 and all other characteristics as well such as the median or sum or maximum length. This was done on the contigs generated by ABySS .

To compare it with scaffolder in SOAPdenovo I ran the contigs generation on SOAPdenovo and did the scaffolding with the SOAPdenovo scaff tool as well as with SSPACE. The N50 and other evaluation criteria are much better for SOAPdenovo! The N50 using SOAPdenovo scaff came to about 21,653 and that with SSPACE only about 2,677 ! I am using the default value of k=5 and a=0.7 with n=15 . I have tried changing n as I stated in my first paragraph and there is no advantage of doing it. Do you recommend me any different values for k and a ?

Aby

**narain** · 08-29-2011, 05:25 AM

Dear Boetsie

Thank you for your suggestion. I will try with reduced 'k' parameters to 2 and 3. Do you recommend any changes to 'a' parameter value ?

What exactly is the 'n' parameter useful for ?

Aby

**boetsie** · 08-29-2011, 05:48 AM

Originally posted by narain View Post

Dear Boetsie

Thank you for your suggestion. I will try with reduced 'k' parameters to 2 and 3. Do you recommend any changes to 'a' parameter value ?

What exactly is the 'n' parameter useful for ?

Aby

You could decrease the -a value to 0.5 (meaning that there should at least be 2 times more links) if multiple links are found.

The -n parameter is useful for merging two contigs. Say you have contigA and contigB, they are scaffolded with a gap of -20bp. Then SSPACE will search for an overlap of -n or more nucleotides:

contigA
AGATGATATAAAAGTATAGATTA
contigB
ATAAAAGTATAGATTAGGGGTTATGATA

overlap:
AGATGATATAAAAGTATAGATTA
-------ATAAAAGTATAGATTAGGGGTTATGATA

So if the size of the overlap is above the defined -n parameter, they are merged together;
AGATGATATAAAAGTATAGATTAGGGGTTATGATA

regards,
Boetsie

**narain** · 08-30-2011, 05:12 AM

Dear Boetsie

As per your suggestion I ran SSPACE with lowering k parameter value from what it was 5 earlier to 2. The N50 value reduced further from what it was 1995 to 1424 ! The value of a was 0.7 and n was 10 as before. Did you rather mean to increase value of k ?

Aby

**boetsie** · 08-30-2011, 06:12 AM

Originally posted by narain View Post

Dear Boetsie

As per your suggestion I ran SSPACE with lowering k parameter value from what it was 5 earlier to 2. The N50 value reduced further from what it was 1995 to 1424 ! The value of a was 0.7 and n was 10 as before. Did you rather mean to increase value of k ?

Aby

Hi Aby,

well this i should have expected, since you said that your coverage was very low. Say you first had a nice scaffold with five links between two contigs, but now another contig can also be linked with four links, the ratio will be 4/5 = 0.8 (thus above your -a 0.7). This way, less scaffolds are formed. You could increase the -a option, but then you should wonder how reliable are your scaffolds!

Could you maybe send me your summaryfile and library file (personal message or to my private e-mail [email protected]), so i can hopefully try to help you further out?

Regards,
Boetsie

**narain** · 08-30-2011, 06:33 AM

Dear Boetsie

I have 90 bp length paired end reads of about 110 GB in total for human genome. This is approximately about 15x coverage, which is slightly less than what most assemblers look for ( 20x or more). With the decrease in k parameter value, there is a decrease in N50 , which is not a good sign. Indeed reliability of the scaffold generated is of utmost importance. I am sending you the logfile and the summary file generated as email attachment. Do you still suggest me to go for a lower value of a ? I think if I need bigger scaffold , I should go for a bigger value of a say 0.9 or higher . Should I do that in combination with higher value of k ? How much higher can I keep k and what is your suggestion ?

Aby

**boetsie** · 09-01-2011, 01:19 AM

I've added a new Basic version at http://www.baseclear.com/landingpages/sspacev12/

Main improvements are;
- searches now for overlaps larger than 50bp as suggested by Kennels
- merge-information is now correct in the evidence file. If multiple libraries were used, the merge information of previous libraries was not included in the final evidence file.
- Solved the error of 'main: Dumper not found', if -x 1 and -v 1 are set.
- now able to allow gaps for mapping the reads against the contigs with the -g option. -g 1 allows three gaps, max is 3 gaps.

Boetsie

**kulikov** · 09-03-2011, 03:14 AM

No pairs found

Hello Boetsie,

I'm now playing with SSPACE and I'm getting some strange output. I have two files with contigs -- say, contigs1.fasta and contigs2.fasta. They were output by the same assembler on the same data set (E.coli reads). The second file have some contigs from the first file glued together. For some reasons, SSPACE successfully scaffolds contigs2, but fails to find a single read pair on contigs1. Could you please help with this? I'm attaching the two summary files.

Attached Files

**Kennels** · 09-04-2011, 06:00 PM

Hi Boetsie,

Thanks for the update, I've started using it in spades for a number of datasets and its great.

In one project, I have a total of around 770 million 100nt-long PE reads across 7 lanes. Unfortunately I am quite limited in computing capacity (only 16Gb RAM - yes I have been critiqued before in other posts to get better specs, but we all have our circumstances

) till we get access to a better one, so as expected an analyses pretty much stopped at the stage of reading the unmapped reads into memory. I currently just want to extend a small number of separate contigs as much as possible, and it would be great to consider all reads at once.

1. I'm just wondering if it is possible to overcome the memory limitation - is the Premium version using a different way to store/access the data?
2. Should I split my inputs into smaller libraries - does sspace free up memory after reading a library before going on to the next (but i think not)?
3. Or should I carry out separate analyses of sspace - but i'm afraid of losing possibilities to extend contigs by not considering all data at once.

Sorry if the questions are naive.

Cheers,
kennels

**boetsie** · 09-05-2011, 12:10 AM

Originally posted by kulikov View Post

Hello Boetsie,

I'm now playing with SSPACE and I'm getting some strange output. I have two files with contigs -- say, contigs1.fasta and contigs2.fasta. They were output by the same assembler on the same data set (E.coli reads). The second file have some contigs from the first file glued together. For some reasons, SSPACE successfully scaffolds contigs2, but fails to find a single read pair on contigs1. Could you please help with this? I'm attaching the two summary files.

Hi Kulikov,

Are the contigs of summary1.txt a mix of the two assemblies? In other words; are parts of the contigs present in other contigs? Because what i think has happened, is that reads could map to multiple contigs. SSPACE does not allow reads to map to multiple contigs.

Boetsie

**boetsie** · 09-05-2011, 12:19 AM

Originally posted by Kennels View Post

Hi Boetsie,

Thanks for the update, I've started using it in spades for a number of datasets and its great.

In one project, I have a total of around 770 million 100nt-long PE reads across 7 lanes. Unfortunately I am quite limited in computing capacity (only 16Gb RAM - yes I have been critiqued before in other posts to get better specs, but we all have our circumstances

) till we get access to a better one, so as expected an analyses pretty much stopped at the stage of reading the unmapped reads into memory. I currently just want to extend a small number of separate contigs as much as possible, and it would be great to consider all reads at once.

1. I'm just wondering if it is possible to overcome the memory limitation - is the Premium version using a different way to store/access the data?
2. Should I split my inputs into smaller libraries - does sspace free up memory after reading a library before going on to the next (but i think not)?
3. Or should I carry out separate analyses of sspace - but i'm afraid of losing possibilities to extend contigs by not considering all data at once.

Sorry if the questions are naive.

Cheers,
kennels

Hi Kennels,

good that it works great!

To start; it is important to filter your reads on quality, especially with such large read length. For extension the whole read is mapped to the contigs, if not mapped it will be used for contig extension. If the quality of the reads (or part of the reads) are bad, the read will not map and it will be used for contig extension.

1.
In SSPACE premium;
- you can run bowtie with gaps, allowing up to 3 gaps, thereby reducing the number of unmapped reads and thus the number of reads stored in memory.
- A different method of storing the unmapped reads is used compared with Basic version, saving 25% of memory.
- extension is faster.

2.
For contig extension all libraries are used at once, so all reads are used. Memory is thus not freed after each library.

3.
You will loose coverage if you split the libraries. If you have sufficient coverage for one library, I should give it a go.

Regards,
Boetsie

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News