SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: Parallelized short read assembly of large genomes using de Bruijn graphs. Newsbot! Literature Watch 0 12-30-2011 02:00 AM
Assembly of Large Genomes using Cloud Computing by Contrail Gangcai De novo discovery 9 11-23-2011 07:42 AM
Scaffolding tool glacerda Bioinformatics 0 08-04-2010 03:54 PM
PubMed: BFAST: An Alignment Tool for Large Scale Genome Resequencing. Newsbot! Literature Watch 0 11-13-2009 02:10 AM
BFAST: Blat-like Fast Accurate Search Tool for Large-Scale Genome Resequencing nilshomer Bioinformatics 1 11-06-2008 09:36 PM

Reply
 
Thread Tools
Old 11-09-2011, 01:37 AM   #121
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Hi Lisa,

no problem, good that I could help you and it's all working now! Good luck and feel free to contact me if you have any questions.

Regards,
Boetsie

Quote:
Originally Posted by Lisa0508 View Post
Hi Boetzer,
Thank you so much for your quick reply and patient explanation! It's working now. I just got registered on this forum yesterday. All settings were in a default condition. I'm very sorry I did not check the private message option. Now it's O.K. to recieve private messages. Thank you again!
Regards,
Lisa
boetsie is offline   Reply With Quote
Old 12-12-2011, 04:33 AM   #122
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Hi all,

we have released a new version of both SSPACE Basic and SSPACE Premium. SSPACE Basic is the previous version of SSPACE Premium. The new SSPACE premium contains the following new features:
  • included the readmapper BWA/BWA-sw
  • Changed the multithreading of Bowtie/BWA. instead of running the readmapping of the aligner in multithread mode, SSPACE calls the aligner in single-threaded mode with multiple instances. This will preserve the order of the reads for processing and read-tracking, speeding up the process and reducing memory consumption.
  • Readfiles are split into files with portions of 1 million paired-reads instead of one file. This will speed up the alignment (see previous feature).
  • During extension, contigs are extended with subsequences (k-mers) of the unmapped reads, instead of the full read. This will increase the coverage for extension, since k-mers have a better overlap with the contigs than full reads.
  • A file is generated with more detailed information about the extension process.
  • Included the option -S which makes it able to skip the reading and processing of the paired-read input files.
  • It is now possible to include .gz files (only if gunzip is installed).
  • Changed the folder structure.
  • Changed the format of the final scaffolds
  • Included some additional statistics in the summary file; GC content, N25/N75, number of gaps and total size of gaps
  • Added a tool for quality-trimming of paired-reads
  • Added a tool for estimation of the insert size
In addition, we have been working on a tool named GapFiller for closing gaps within scaffolds using paired-read data. Currently GapFiller is submitted for publication, and the basic sourcecode will be available upon acceptance. At that time we will make sure academic users can apply for a free license. However, before the manuscript is accepted, a pre-release is available at the cost of 250,- euro (applicable to both academic and commercial users).

See our website for more information about SSPACE and GapFiller: http://www.baseclear.com/landingpage...ics-solutions/

Kind regards,
Boetsie
boetsie is offline   Reply With Quote
Old 12-13-2011, 03:40 AM   #123
stevebaeyen
Member
 
Location: Belgium

Join Date: Aug 2011
Posts: 14
Default k-mers and m parameter

Quote:
Originally Posted by boetsie View Post
  • During extension, contigs are extended with subsequences (k-mers) of the unmapped reads, instead of the full read. This will increase the coverage for extension, since k-mers have a better overlap with the contigs than full reads.
Dear Boetsie,
does this mean that we should have better results with the -m parameter optimised for k-mer size instead of read length ? How can we know the k-mer size used and how do we best adjust the -m value for example for a 50bp read?
regards,
Steve
stevebaeyen is offline   Reply With Quote
Old 12-14-2011, 04:46 AM   #124
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Hi Steve,

The kmer size used is just the (-m +1)value. -m thus actually means the overlap the kmer should have, and the extra nucleotide is the 'overhang'. The difference between the two;

previous method:

ctg: GTCGATAGATAGATCGTCGATAGTAGTCGA
read:...GATTGATAGATCGTCGATAGTAGTCGAG


The above read will not be used for extension, since it contains a mismatch and thus does not fully overlap with the contig. The new method cuts the read into k-mers;

Say we use a -m of 20, the kmers of the read is;

READ: GATAGATCGTCGATAGTAGTCGAGAT
kmer: GATAGATCGTCGATAGTAGTC
kmer: .ATAGATCGTCGATAGTAGTCG
kmer: ..TAGATCGTCGATAGTAGTCGA
kmer: ...AGATCGTCGATAGTAGTCGAG
kmer: ....GATCGTCGATAGTAGTCGAGA
kmer: .....ATCGTCGATAGTAGTCGAGAT
etc...


if we now extend the contig, the overlapping k-mer is;

ctg: GTCGATAGATAGATCGTCGATAGTAGTCGA
read:..........AGATCGTCGATAGTAGTCGAG


This will thus increase the coverage since it removes the errors, especially for longer reads.

Regards,
Boetsie

Quote:
Originally Posted by stevebaeyen View Post
Dear Boetsie,
does this mean that we should have better results with the -m parameter optimised for k-mer size instead of read length ? How can we know the k-mer size used and how do we best adjust the -m value for example for a 50bp read?
regards,
Steve
boetsie is offline   Reply With Quote
Old 01-09-2012, 05:50 AM   #125
sphil
Senior Member
 
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 186
Default

Hey all,

I got a quite strange problem: my contig fasta file looks like:

>22617
GTCTACTTCAGACAAGGAAGACGGTCTACTTCAGATGAGGAAGATGGTCTGCTACAAAGGGAAGACGGTCTGCTTCAGGCCAGGAAGACGGTCTGCTACA
>22619
CGTCTTCCAATTTTGAATCAGACCGTCTTGATTTTGAATTGGACCGTCTCCCCTGGGCGCATCTGCTGGGCCGCTGGGGCTGGAACCGTGGCTCAAAATT
>22621
TTCCTCAGCAACAACATTGATGGTGTCTTTTGTGTACATGTATGAGTAGTCAGTCAAGTAAAGTATGCGCACCTGTCTTTTGGTAAGCCTACGCAGCCTG
>22623
AGGCACTCTGCCCGAGTGGTTAAGGGGTAAGTCTCGAATACATTATTCGACCGTCCATCATGACGGGTTAACTTATAGGCTCTGCCTGCGTCGGTTCAAA

BUT

the programms tells me that:

ERROR: Invalid (-s) contig file /home/dpr..../de_novo_assembly_DNA/SOAPdenovo_39/PseudoAfi_K39.contig.fastasorted.fasta ...Exiting.


So can u tell me why my file should be corrupt?

Any help is kindly appreciated,

best


Phil
sphil is offline   Reply With Quote
Old 01-09-2012, 06:02 AM   #126
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Quote:
Originally Posted by sphil View Post
Hey all,

I got a quite strange problem: my contig fasta file looks like:

>22617
GTCTACTTCAGACAAGGAAGACGGTCTACTTCAGATGAGGAAGATGGTCTGCTACAAAGGGAAGACGGTCTGCTTCAGGCCAGGAAGACGGTCTGCTACA
>22619
CGTCTTCCAATTTTGAATCAGACCGTCTTGATTTTGAATTGGACCGTCTCCCCTGGGCGCATCTGCTGGGCCGCTGGGGCTGGAACCGTGGCTCAAAATT
>22621
TTCCTCAGCAACAACATTGATGGTGTCTTTTGTGTACATGTATGAGTAGTCAGTCAAGTAAAGTATGCGCACCTGTCTTTTGGTAAGCCTACGCAGCCTG
>22623
AGGCACTCTGCCCGAGTGGTTAAGGGGTAAGTCTCGAATACATTATTCGACCGTCCATCATGACGGGTTAACTTATAGGCTCTGCCTGCGTCGGTTCAAA

BUT

the programms tells me that:

ERROR: Invalid (-s) contig file /home/dpr..../de_novo_assembly_DNA/SOAPdenovo_39/PseudoAfi_K39.contig.fastasorted.fasta ...Exiting.


So can u tell me why my file should be corrupt?

Any help is kindly appreciated,

best


Phil
Hi Phil,

the error has nothing to do with the file format. The line where this error occurs is just checking whether the contig file exists or not. Somehow it does not find your file. Can you check if the file is really at the specified location and that the user rights are correct?

Boetsie
boetsie is offline   Reply With Quote
Old 01-12-2012, 11:25 PM   #127
sphil
Senior Member
 
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 186
Default

Hey,

sry for the late answer but I was not in the office last days. I checked the location and it is the right one so maybe i got something wrong in the library file.

here is the line containing my library...

TrueSeqStd /home/dpr/P/PA/SGII_ATCACG_L003_R1.fastq /home/dpr/P/PA/SGII_ATCACG_L003_R2.fastq 50 0.5 FR



maybe there is a fault?

Best,


Phil



got it, thanks for the help

Last edited by sphil; 01-12-2012 at 11:55 PM. Reason: solved
sphil is offline   Reply With Quote
Old 02-23-2012, 04:54 AM   #128
user1313
Junior Member
 
Location: Lviv

Join Date: May 2011
Posts: 5
Default

Dear boetsie,

Is it possible to implement a feature in SSPACE for it to recognize inward-facing reads in a Illumina MP library? This is a serious problem for some library preparations. This feature is present in Ray assembler, for example:
http://seqanswers.com/forums/showthr...?t=4301&page=7

Regards,
Nestor
user1313 is offline   Reply With Quote
Old 02-24-2012, 02:19 AM   #129
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Hi Nestor,

This is already implemented in SSPACE. Basically, Ray does the same as SSPACE by incoorperating a range of allowed reads, for example an insert size of 4000 with 0.25 deviation (range is thus 3000-5000). This will initialy filter out 'paired-end' reads, since these have smaller insert sizes (< 500bp). In addition, SSPACE requires for each library the orientation of the paired-reads. If you specify the orientation <-- -->, --> <-- paired-reads will not be taking into account for scaffolding.

Regards,
Boetsie

Quote:
Originally Posted by user1313 View Post
Dear boetsie,

Is it possible to implement a feature in SSPACE for it to recognize inward-facing reads in a Illumina MP library? This is a serious problem for some library preparations. This feature is present in Ray assembler, for example:
http://seqanswers.com/forums/showthr...?t=4301&page=7

Regards,
Nestor
boetsie is offline   Reply With Quote
Old 02-24-2012, 07:40 AM   #130
user1313
Junior Member
 
Location: Lviv

Join Date: May 2011
Posts: 5
Default

Dear boetsie,

What's with the libraries, where number of "smaller insert size" read pairs is significantly higher, than of "long insert size" read pairs? Don't you think that using such libraries with SSPACE could lead to horrible results such as, in some cases, re-orienting the contigs? Is SSPACE capable now of detecting such libraries by counting PE/MP ratio of reads that were mapped within each contiguous sequence of DNA?

Regards,
Nestor


Quote:
Originally Posted by boetsie View Post
Hi Nestor,

This is already implemented in SSPACE. Basically, Ray does the same as SSPACE by incoorperating a range of allowed reads, for example an insert size of 4000 with 0.25 deviation (range is thus 3000-5000). This will initialy filter out 'paired-end' reads, since these have smaller insert sizes (< 500bp). In addition, SSPACE requires for each library the orientation of the paired-reads. If you specify the orientation <-- -->, --> <-- paired-reads will not be taking into account for scaffolding.

Regards,
Boetsie
user1313 is offline   Reply With Quote
Old 02-29-2012, 05:43 AM   #131
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Quote:
Originally Posted by user1313 View Post
Dear boetsie,

What's with the libraries, where number of "smaller insert size" read pairs is significantly higher, than of "long insert size" read pairs? Don't you think that using such libraries with SSPACE could lead to horrible results such as, in some cases, re-orienting the contigs? Is SSPACE capable now of detecting such libraries by counting PE/MP ratio of reads that were mapped within each contiguous sequence of DNA?

Regards,
Nestor
That is indeed a problem, they might influence the scaffolding process. But since the smaller read pairs are --><-- orientated (and matepairs <-- --> orientated), they are filtered out.
I do not see the benefit of including the PE/MP ratio of reads mapped within a contig, they do not contribute to the scaffolding process. They can only influence the process when the pairs are aligned on different contigs, but as said, they will be filtered out because of orientation.
boetsie is offline   Reply With Quote
Old 02-29-2012, 06:19 AM   #132
user1313
Junior Member
 
Location: Lviv

Join Date: May 2011
Posts: 5
Thumbs up

Dear boetsie,

Thank you for the answer. I still, however, would not agree. Correct me, please, if i am wrong.

If we have contig 1 and contig 2 with some PE reads (short arrow "->") and some MP reads (longer arrow "-->") like this:

Code:
    contig 1             contig 2
5`------------3`     5`------------3`
    <--    ->          <-    -->
            ->          <-
    ---------- 4000bp ----------
Now, let's assume that we have twice more of the PE reads than of MP reads.
We gave SSPACE the information that the library is MP with 4000bp insert size. Won't SSPACE reverse-complement contigs in this manner to make the more-abundant "PE" reads to fit the 4000bp "<-- -->" pattern?

Code:
  contig 1(RC)         contig 2 (RC)
5`------------3`     5`------------3`
    <-    -->          <--    ->
     <-                        ->
    ---------- 4000bp ----------
I don't say it will happen every time, but in some cases, where the length of the RC-contigs would fit the distance listed in the library file it could be a disastrous problem. To tell you the truth, with my limited experience, i have seen more problematic MP libraries than of good ones.

Regards,
Nestor


Quote:
Originally Posted by boetsie View Post
That is indeed a problem, they might influence the scaffolding process. But since the smaller read pairs are --><-- orientated (and matepairs <-- --> orientated), they are filtered out.
I do not see the benefit of including the PE/MP ratio of reads mapped within a contig, they do not contribute to the scaffolding process. They can only influence the process when the pairs are aligned on different contigs, but as said, they will be filtered out because of orientation.

Last edited by user1313; 02-29-2012 at 06:24 AM.
user1313 is offline   Reply With Quote
Old 02-29-2012, 07:30 AM   #133
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Yes, you are right, sorry. But this will only happen if both the contigs are short. Say the pair-end reads are mapped as following;

Code:
    contig 1 (1000bp)            contig 2 (8000 bp)
5`------------>3`     5`----------------------------->3`
            <-           <- 
           pos900    pos100
Since MP are <----> orientated, contig 2 should be reverse complement;

Code:
    contig 1 (1000bp)            contig 2 (8000 bp)
5`------------3`     3`<-----------------------------5`
            <-                                  -> 
           pos900                             pos7900
The distance is now (1000-900) + 7900 = 8000. This is a difference of 4000 compared with your library (8000-4000bp = 4000 difference).

I agree though, that if contig 2 is 4000bp smaller, the distance would be 4000bp. Near the size of your library! This could be a problem, especially with contig orientation and insert size estimation (distance is not 4000 for above example, but ~200bp (1000-900 of contig1) + (pos100 of contig2)).

Thanks for the direction, I'll try to dive deeper into this...

Regards,
Boetsie
boetsie is offline   Reply With Quote
Old 03-18-2012, 03:30 PM   #134
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

Is it possible to run SSPACE on external read mappings, i.e. can I perform the read mappings on my own and then have SSPACE do the scaffolding based on them?
gaffa is offline   Reply With Quote
Old 03-19-2012, 12:37 AM   #135
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Quote:
Originally Posted by gaffa View Post
Is it possible to run SSPACE on external read mappings, i.e. can I perform the read mappings on my own and then have SSPACE do the scaffolding based on them?
yes, this is possible. The file should be in a TAB delimited format like:

<contig1> <startpos_on_contig1> <endpos_on_contig1> <contig2> <startpos_on_contig2> <endpos_on_contig2>

E.g.
contig1 100 150 contig1 350 300
contig1 4000 4050 contig2 110 60

There is a script in the 'tools' directory of the package to convert SAM/BAM to a tab format.

Regards,
Boetsie
boetsie is offline   Reply With Quote
Old 03-20-2012, 04:01 AM   #136
Hobbe
Member
 
Location: Uppsala, Sweden

Join Date: Apr 2010
Posts: 29
Default

I am having problems using SSPACE basic with my 454 paired-end data, and was hoping to get some help here. SSPACE runs fine using my Illumina PE data, but my 454-data has much longer insert-sizes (3-5 kb), and I think they really could make difference.

My problem is that SSPACE reads all the 454-pairs in, removes quite a lot of them as the include Ns, and then maps 0 of them. The report is below. It was difficult to get the reads in a format that SSPACE accepts, and I guess that the problem lies in the fastq-files. Some (very few) reads are too long (over 1024 bases), and bowtie complains about these. Would this crash the whole run? I know that bowtie is not the best choice for longer reads, but I thought it would still manage to map some reads? Is SSPACE premium the answer?

Any/all help would be much appreciated,
Henrik

READING READS Lib454:
------------------------------------------------------------
Total inserted pairs = 1217215
Number of pairs containing N's = 1066178
Remaining pairs = 151037
------------------------------------------------------------
...

LIBRARY Lib454 STATS:
################################################################################

MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 0
Number of pairs used for pairing contigs / total pairs = 0 / 0
------------------------------------------------------------

READ PAIRS STATS:
Assembled pairs: 0 (0 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 3709 +/-927.25): 0
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 0
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 0
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 0
Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 0
---
Total satisfied: 0 unsatisfied: 0


Estimated insert size statistics (based on 0 pairs):
Mean insert size = 0
Median insert size = 0
REPEATS:
Number of repeated edges = 0
------------------------------------------------------------

################################################################################
Hobbe is offline   Reply With Quote
Old 03-20-2012, 06:31 AM   #137
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

Quote:
Originally Posted by boetsie View Post
yes, this is possible. The file should be in a TAB delimited format like:

<contig1> <startpos_on_contig1> <endpos_on_contig1> <contig2> <startpos_on_contig2> <endpos_on_contig2>

E.g.
contig1 100 150 contig1 350 300
contig1 4000 4050 contig2 110 60

There is a script in the 'tools' directory of the package to convert SAM/BAM to a tab format.

Regards,
Boetsie
Ok, great! Thanks.

On a slightly related note, how well do you think SSPACE would deal with scaffolding information from other sources than paired/mate-reads, such as e.g. physical/genetic linkage data (supplied then in the above file format)? Some scaffolders (notably Bambus) claim to be able to work with essentially any kind of link information between contigs - could the same be said of SSPACE?
gaffa is offline   Reply With Quote
Old 03-28-2012, 11:57 PM   #138
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Quote:
Originally Posted by Hobbe View Post
I am having problems using SSPACE basic with my 454 paired-end data, and was hoping to get some help here. SSPACE runs fine using my Illumina PE data, but my 454-data has much longer insert-sizes (3-5 kb), and I think they really could make difference.

My problem is that SSPACE reads all the 454-pairs in, removes quite a lot of them as the include Ns, and then maps 0 of them. The report is below. It was difficult to get the reads in a format that SSPACE accepts, and I guess that the problem lies in the fastq-files. Some (very few) reads are too long (over 1024 bases), and bowtie complains about these. Would this crash the whole run? I know that bowtie is not the best choice for longer reads, but I thought it would still manage to map some reads? Is SSPACE premium the answer?
SSPACE basic does not handle 454 reads well, simply because the reads are too long for bowtie to align (up to 1024 bases). Also, bowtie can handle only up to two mismatches. In SSPACE premium I've added the BWA-SW aligner to deal with larger reads. Otherwise, you can align the reads yourself with BWA-SW and try to convert the resulting .SAM file to a .tab file (see post above).

Boetsie
boetsie is offline   Reply With Quote
Old 03-28-2012, 11:59 PM   #139
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 239
Default

Quote:
Originally Posted by gaffa View Post
Ok, great! Thanks.

On a slightly related note, how well do you think SSPACE would deal with scaffolding information from other sources than paired/mate-reads, such as e.g. physical/genetic linkage data (supplied then in the above file format)? Some scaffolders (notably Bambus) claim to be able to work with essentially any kind of link information between contigs - could the same be said of SSPACE?
I've not tested it myself, but I think any linking information is suited. I would suggest to give it a try
boetsie is offline   Reply With Quote
Old 05-03-2012, 01:43 AM   #140
is41985
Junior Member
 
Location: Croatia

Join Date: May 2012
Posts: 2
Post

Hi,

to save me the hassle of going through the code, I have a short question regarding insert sizes.
When scaffolding, does SSPACE use the user specified insert size (from the library.txt file), or the estimated insert size (that is reported in the summary file)?
It is important, since in my case these two seem to differ, and I need the real (user-specified) value to be used.


Thank you,
Ivan.
is41985 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:47 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.