SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences

Similar Threads
Thread Thread Starter Forum Replies Last Post
Beg for latest version of SOAPdenovo correction tool before assembly zhongj Illumina/Solexa 2 02-26-2012 05:56 PM
PerM is an ultra-fast and sensitive SOLiD reads mapping tool KevinLam Bioinformatics 7 06-18-2010 03:03 AM
Fast and accurate long read alignment with Burrows-Wheeler transform. nilshomer Literature Watch 1 01-28-2010 09:38 PM
BFAST and read error correction (with SAET or similar tool) javijevi Bioinformatics 4 01-27-2010 12:46 PM

Reply
 
Thread Tools
Old 07-30-2012, 12:34 PM   #1
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Smile LSC - a fast PacBio long read error correction tool.

Hello SEQanswers Community,

Further Info: http://www.stanford.edu/~kinfai/LSC/LSC.html

We at the Wong Lab have developed a new tool for error correction of PacBio data. It has been shown to be very sensitive and can improve PacBio reads to 5% error rate. In particular, it is very very fast. In total, it only takes 10 hours (8 threading) for ~ 200k subreads. And it only needs 10-15G hardisk space for temporary files.

In it's current form it supports PacBio reads and any type of short reads (from any NGS platforms). In the current version, you may need novoalign (single-thread version, free for academic community).

It is designed for the Linux platform. If you use another platform leave us a note and we'll see what we can do.

Instructions are on the website that is still not perfect yet, but if you are having any troubles don't hesitate to leave me a note here.

Please give it a try and let me know if you have any issues. We are actively developing this tool so we welcome all of your comments and concerns! Especially, we are trying to replace novoalgin by the other aligner, which can save 50% running time (5 hours in the example above). The ease of use is very important to us, so let us know if anything annoys you.

Last edited by LSC; 07-30-2012 at 01:10 PM.
LSC is offline   Reply With Quote
Old 07-31-2012, 03:24 AM   #2
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 395
Default

I thought to have a look at the paper mentioned as a preprint, but the link (http://www.stanford.edu/~kinfai/LSC/LSC.pdf) returns a 'Page not found' error...
flxlex is offline   Reply With Quote
Old 07-31-2012, 09:56 AM   #3
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

sorry, the paper is still in review and I just set up the website. I will fix the problem soon.
LSC is offline   Reply With Quote
Old 08-07-2012, 05:55 AM   #4
ZFHans
Member
 
Location: Leiden

Join Date: Jun 2009
Posts: 10
Default

Hi LSC,

I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script


Code:
@HD	VN:1.0	SO:unsorted
@PG	ID:novoalign	PN:novoalign	VN:V2.08.02	CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
@SQ	SN:Pac1	AS:pseudochr_LR.fa.cps.nix	LN:50000716
@SQ	SN:Pac2	AS:pseudochr_LR.fa.cps.nix	LN:50000772
@SQ	SN:Pac3	AS:pseudochr_LR.fa.cps.nix	LN:50002188
@SQ	SN:Pac4	AS:pseudochr_LR.fa.cps.nix	LN:50000094
@SQ	SN:Pac5	AS:pseudochr_LR.fa.cps.nix	LN:50001433
@SQ	SN:Pac6	AS:pseudochr_LR.fa.cps.nix	LN:50001526
@SQ	SN:Pac7	AS:pseudochr_LR.fa.cps.nix	LN:50001210
@SQ	SN:Pac8	AS:pseudochr_LR.fa.cps.nix	LN:50000056
@SQ	SN:Pac9	AS:pseudochr_LR.fa.cps.nix	LN:50000143
@SQ	SN:Pac10	AS:pseudochr_LR.fa.cps.nix	LN:50002588
@SQ	SN:Pac11	AS:pseudochr_LR.fa.cps.nix	LN:50001867
@SQ	SN:Pac12	AS:pseudochr_LR.fa.cps.nix	LN:50000245
@SQ	SN:Pac13	AS:pseudochr_LR.fa.cps.nix	LN:28473695
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	16	Pac10	21159148	3	8S67M19S	*	0	0	GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:0	MD:Z:67	CC:Z:Pac2	CP:i:6239884	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	256	Pac2	6239884	3	14S72M8S	*	0	0	CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:1	MD:Z:32G39	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491	4	*	0	0	*	*	0	0	CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489	4	*	0	0	*	*	0	0	AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491	4	*	0	0	*	*	0	0	TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	0	Pac10	4706429	2	28M1I16M1I29M1S	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:2	MD:Z:73	CC:Z:Pac7	CP:i:31267643	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	256	Pac7	31267643	2	6M1I21M1I47M	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:3	MD:Z:43G30	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
Could we combine this thread with the same one in the bioinformatics section?

Many thanks, Hans Jansen
ZFHans is offline   Reply With Quote
Old 08-07-2012, 08:10 AM   #5
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

I see the paper is not available but I clicked "How it Works" and that link is also broken. Can you fix?

Link: http://www.stanford.edu/~kinfai/LSC/LSC_howitworks.html
adaptivegenome is offline   Reply With Quote
Old 08-07-2012, 02:24 PM   #6
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

Quote:
Originally Posted by adaptivegenome View Post
I see the paper is not available but I clicked "How it Works" and that link is also broken. Can you fix?

Link: http://www.stanford.edu/~kinfai/LSC/LSC_howitworks.html
The paper is in review now (almost the final round of the revision). I will post it as this final revision is submitted. Sorry for the inconvenience.
LSC is offline   Reply With Quote
Old 08-07-2012, 02:24 PM   #7
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

Quote:
Originally Posted by ZFHans View Post
Hi LSC,

I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script


Code:
@HD	VN:1.0	SO:unsorted
@PG	ID:novoalign	PN:novoalign	VN:V2.08.02	CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
@SQ	SN:Pac1	AS:pseudochr_LR.fa.cps.nix	LN:50000716
@SQ	SN:Pac2	AS:pseudochr_LR.fa.cps.nix	LN:50000772
@SQ	SN:Pac3	AS:pseudochr_LR.fa.cps.nix	LN:50002188
@SQ	SN:Pac4	AS:pseudochr_LR.fa.cps.nix	LN:50000094
@SQ	SN:Pac5	AS:pseudochr_LR.fa.cps.nix	LN:50001433
@SQ	SN:Pac6	AS:pseudochr_LR.fa.cps.nix	LN:50001526
@SQ	SN:Pac7	AS:pseudochr_LR.fa.cps.nix	LN:50001210
@SQ	SN:Pac8	AS:pseudochr_LR.fa.cps.nix	LN:50000056
@SQ	SN:Pac9	AS:pseudochr_LR.fa.cps.nix	LN:50000143
@SQ	SN:Pac10	AS:pseudochr_LR.fa.cps.nix	LN:50002588
@SQ	SN:Pac11	AS:pseudochr_LR.fa.cps.nix	LN:50001867
@SQ	SN:Pac12	AS:pseudochr_LR.fa.cps.nix	LN:50000245
@SQ	SN:Pac13	AS:pseudochr_LR.fa.cps.nix	LN:28473695
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	16	Pac10	21159148	3	8S67M19S	*	0	0	GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:0	MD:Z:67	CC:Z:Pac2	CP:i:6239884	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	256	Pac2	6239884	3	14S72M8S	*	0	0	CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:1	MD:Z:32G39	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491	4	*	0	0	*	*	0	0	CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489	4	*	0	0	*	*	0	0	AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491	4	*	0	0	*	*	0	0	TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	0	Pac10	4706429	2	28M1I16M1I29M1S	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:2	MD:Z:73	CC:Z:Pac7	CP:i:31267643	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	256	Pac7	31267643	2	6M1I21M1I47M	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:3	MD:Z:43G30	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
Could we combine this thread with the same one in the bioinformatics section?

Many thanks, Hans Jansen
Hi Hans Jansen,
Your feedback is really helpful. although LSC works well in my computer cluster now, I know there may be something wrong when it is applied in some other systems. Your test is a great example for me to find the bug.
1) your change of the python path is correct. I will fix it in the coming version.
2) LSC uses the native output format instead of SAM format in novoalign. Please don't change it. Please try my setting of the original native format again. In addition, if BWA or bowtie2 could output ALL possible mappable hits, LSC would save over 50% of running time (novoalign is somewhat slow) by using them. Do you know any possible way to let BWA and bowtie2 to output all hits (including detailed indel information)?
LSC is offline   Reply With Quote
Old 08-07-2012, 10:32 PM   #8
ZFHans
Member
 
Location: Leiden

Join Date: Jun 2009
Posts: 10
Default

Hi LSC,

Thanks for your reply. I'll try the native format again, but could you tell which version of novoalign you used. Could it be that novocraft changed something in their native format?

Thanks,

Hans
ZFHans is offline   Reply With Quote
Old 08-07-2012, 11:00 PM   #9
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

Quote:
Originally Posted by ZFHans View Post
Hi LSC,

Thanks for your reply. I'll try the native format again, but could you tell which version of novoalign you used. Could it be that novocraft changed something in their native format?

Thanks,

Hans
novoalign (V2.07.10) works well in LSC.
LSC is offline   Reply With Quote
Old 08-07-2012, 11:24 PM   #10
ZFHans
Member
 
Location: Leiden

Join Date: Jun 2009
Posts: 10
Default

Hi LSC,

Thanks for your quick reply. If my current run with 2.08 fails I'll try 2.07

In the meantime I looked at the bowtie2 manual http://bowtie-bio.sourceforge.net/bo...all-alignments and found this mode:

-a mode: search for and report all alignments

-a mode is similar to -k mode except that there is no upper limit on the number of alignments Bowtie 2 should report. Alignments are reported in descending order by alignment score. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. See the SAM specification for details.

Some tools are designed with this reporting mode in mind. Bowtie 2 is not! For very large genomes, this mode is very slow.

Is this of any use to you?

Regards,

Hans
ZFHans is offline   Reply With Quote
Old 08-08-2012, 05:41 AM   #11
ZFHans
Member
 
Location: Leiden

Join Date: Jun 2009
Posts: 10
Default

Hi LSC,

As it turned out it was my mistake all along. I'm using quake corrected Illumina reads as SR input. The fasta headers of these reads contain spaces, and that was causing problems in convertNAV.py I corrected this by removing the spaces from the headers and now the novocraft native output format is understood by convertNAV.py The script now continues to correct_nonredundant.py but gives then the following error:
Traceback (most recent call last):
File "/usr/local/LSC_0.2/bin/correct_nonredundant.py", line 280, in <module>
n_rep = int(NSR.split('_')[1])
IndexError: list index out of range
This is probably still some problem of too many fields
Could you indicate how the headers of the input files should look like (both the LR and SR)

Many thanks in advance,

Hans
ZFHans is offline   Reply With Quote
Old 08-14-2012, 05:53 AM   #12
Tuinhof
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 10
Default

Hi LSC,

I work together with Hans Jansen, he installed the previous version.
Since it is not possible to go to the pages with tutorial, i was wondering how to install the newest version?
Or can I just copy the adjusted python scripts to the existing folder?

Regards,
Nynke Tuinhof
Tuinhof is offline   Reply With Quote
Old 08-14-2012, 01:59 PM   #13
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

Yes, you just need to copy the scripts to overwrite the existing folder
Quote:
Originally Posted by Tuinhof View Post
Hi LSC,

I work together with Hans Jansen, he installed the previous version.
Since it is not possible to go to the pages with tutorial, i was wondering how to install the newest version?
Or can I just copy the adjusted python scripts to the existing folder?

Regards,
Nynke Tuinhof
LSC is offline   Reply With Quote
Old 09-14-2012, 01:28 PM   #14
shanebrubaker
Member
 
Location: California

Join Date: Nov 2009
Posts: 13
Default More Info on LSC

Hi, I am also very interested in LSC. I would like to see the paper and manual if they are available.

Does anyone have time comparisons of LSC vs. PacBioToCA vs. SmrtPipe?
shanebrubaker is offline   Reply With Quote
Old 09-14-2012, 02:08 PM   #15
shanebrubaker
Member
 
Location: California

Join Date: Nov 2009
Posts: 13
Default

I also noticed that you say it corrects the reads to a 5% error rate, but the Schatz work seems to mention a 0.1% error rate. Is there a reason for that?

Thanks,
Shane
shanebrubaker is offline   Reply With Quote
Old 09-14-2012, 02:51 PM   #16
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

Quote:
Originally Posted by shanebrubaker View Post
Hi, I am also very interested in LSC. I would like to see the paper and manual if they are available.

Does anyone have time comparisons of LSC vs. PacBioToCA vs. SmrtPipe?
The paper was accepted last week and is off to print now. The preprint should be on the homepage next week.
LSC is offline   Reply With Quote
Old 09-14-2012, 02:54 PM   #17
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

Quote:
Originally Posted by shanebrubaker View Post
I also noticed that you say it corrects the reads to a 5% error rate, but the Schatz work seems to mention a 0.1% error rate. Is there a reason for that?

Thanks,
Shane
In the paper, you will see the accuracy go down to <1% when you have enough short reads (SGS) coverage. For those regions without any short reads coverage, the error rate would be still high. Then the average of the whole thing will be lower down. Thus, the more short reads, the better performance it does.
LSC is offline   Reply With Quote
Old 09-20-2012, 12:26 AM   #18
ZFHans
Member
 
Location: Leiden

Join Date: Jun 2009
Posts: 10
Default

Hi Shane,

At the moment I am trying to compare LSC vs. PacBioToCA, and if time and hardware permits SmrtPipe. I noticed that PacBioToCA reduces the dataset from 1 GB to around 400MB. Haven't checked error rate yet.
We have 30x coverage in short reads.
I am still struggling with LSC 0.2.1 This runs fine on a testset, but when it runs on the whole short read set (49 GB) I get an read error in awk (awk: read error (Bad address). It's not consistently on one point in the file. Sometimes this happens after processing 50MB, and once it reached 30 GB. I did different md5 checksums for the file and they are the same. Any suggestions are appreciated.

Thanks,

Hans Jansen
ZFHans is offline   Reply With Quote
Old 09-20-2012, 06:13 AM   #19
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 979
Default

LSC,

I would really like to use your software but it is extremely difficult to do so given the complete lack of documentation. The 'How it works?', 'Tutorial', 'Manual' and 'Filters' links are website are dead links. The 'FAQ' has just one line referring to SpliceMap. There isn't even a README file. Yes I can run the program but without any documentation I have know idea whether my results are correct or meaningful.

I installed LSC and ran it against a PacBio long read data set consisting of 100,000 reads, totaling 38Mbp. My short read set are 20 million, 100bp Illumina reads. I ran the program with default parameters and the output generated is 3 files, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa and corrected_LR_SR.map.fa. Each file contains ~30,000 reads; the full file contains ~15Mbp and the other two each ~8Mbp.

What am I to make of these files? Does this output sound normal? Which output file is useful for further analysis?
kmcarr is offline   Reply With Quote
Old 09-20-2012, 06:47 AM   #20
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 2,998
Default

I am not sure what journal this paper has been accepted at but is it not possible by now to post something (perhaps a provisional PDF) at the link ("how it works") that was included in a previous post (and is still showing a "Page not found" error).

Other links (http://www-stat.stanford.edu/~kinfai/LSC.html) appear to lead to a "Not found" error. This one is not working either (http://www-stat.stanford.edu/~kinfai/LSC_download.html).
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:08 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.