SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
LSC - a fast PacBio long read error correction tool. LSC Pacific Biosciences 55 02-14-2014 06:34 AM
Beg for latest version of SOAPdenovo correction tool before assembly zhongj Illumina/Solexa 2 02-26-2012 06:56 PM
Fast and accurate long read alignment with Burrows-Wheeler transform. nilshomer Literature Watch 1 01-28-2010 10:38 PM
BFAST and read error correction (with SAET or similar tool) javijevi Bioinformatics 4 01-27-2010 01:46 PM

Reply
 
Thread Tools
Old 07-30-2012, 01:35 PM   #1
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Smile LSC - a fast PacBio long read error correction tool.

Hello SEQanswers Community,

Further Info: http://www.stanford.edu/~kinfai/LSC/LSC.html

We at the Wong Lab have developed a new tool for error correction of PacBio data. It has been shown to be very sensitive and can improve PacBio reads to 5% error rate. In particular, it is very very fast. In total, it only takes 10 hours (8 threading) for ~ 200k subreads. And it only needs 10-15G hardisk space for temporary files.

In it's current form it supports PacBio reads and any type of short reads (from any NGS platforms). In the current version, you may need novoalign (single-thread version, free for academic community).

It is designed for the Linux platform. If you use another platform leave us a note and we'll see what we can do.

Instructions are on the website that is still not perfect yet, but if you are having any troubles don't hesitate to leave me a note here.

Please give it a try and let me know if you have any issues. We are actively developing this tool so we welcome all of your comments and concerns! Especially, we are trying to replace novoalgin by the other aligner, which can save 50% running time (5 hours in the example above). The ease of use is very important to us, so let us know if anything annoys you.

Last edited by LSC; 07-30-2012 at 02:11 PM.
LSC is offline   Reply With Quote
Old 08-07-2012, 04:13 AM   #2
ZFHans
Member
 
Location: Leiden

Join Date: Jun 2009
Posts: 10
Default

Hi LSC,

I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script

Many thanks, Hans Jansen

Code:
@HD	VN:1.0	SO:unsorted
@PG	ID:novoalign	PN:novoalign	VN:V2.08.02	CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
@SQ	SN:Pac1	AS:pseudochr_LR.fa.cps.nix	LN:50000716
@SQ	SN:Pac2	AS:pseudochr_LR.fa.cps.nix	LN:50000772
@SQ	SN:Pac3	AS:pseudochr_LR.fa.cps.nix	LN:50002188
@SQ	SN:Pac4	AS:pseudochr_LR.fa.cps.nix	LN:50000094
@SQ	SN:Pac5	AS:pseudochr_LR.fa.cps.nix	LN:50001433
@SQ	SN:Pac6	AS:pseudochr_LR.fa.cps.nix	LN:50001526
@SQ	SN:Pac7	AS:pseudochr_LR.fa.cps.nix	LN:50001210
@SQ	SN:Pac8	AS:pseudochr_LR.fa.cps.nix	LN:50000056
@SQ	SN:Pac9	AS:pseudochr_LR.fa.cps.nix	LN:50000143
@SQ	SN:Pac10	AS:pseudochr_LR.fa.cps.nix	LN:50002588
@SQ	SN:Pac11	AS:pseudochr_LR.fa.cps.nix	LN:50001867
@SQ	SN:Pac12	AS:pseudochr_LR.fa.cps.nix	LN:50000245
@SQ	SN:Pac13	AS:pseudochr_LR.fa.cps.nix	LN:28473695
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	16	Pac10	21159148	3	8S67M19S	*	0	0	GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:0	MD:Z:67	CC:Z:Pac2	CP:i:6239884	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	256	Pac2	6239884	3	14S72M8S	*	0	0	CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:1	MD:Z:32G39	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491	4	*	0	0	*	*	0	0	CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489	4	*	0	0	*	*	0	0	AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491	4	*	0	0	*	*	0	0	TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	0	Pac10	4706429	2	28M1I16M1I29M1S	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:2	MD:Z:73	CC:Z:Pac7	CP:i:31267643	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	256	Pac7	31267643	2	6M1I21M1I47M	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:3	MD:Z:43G30	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2

Last edited by ZFHans; 08-07-2012 at 05:33 AM.
ZFHans is offline   Reply With Quote
Old 10-29-2012, 02:06 AM   #3
HenrivdGeest
Member
 
Location: Arnhem

Join Date: Feb 2012
Posts: 16
Default

Hello,

I just tried LSC, and it looks good;
some remarks:

-input cant be fastq, at first I tried to run with fastq but run into weird errors and crashes.
- It cant handle multiline fasta for pacbio long reads?
- documentation:
"full_LR_SR.map.fa
Although the terminus sequences are corrected, they are concatenated with their corrected sequence (corrected_LR_SR.map.fa) to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file full_LR_SR.map.fa"
I think this should be:
"Although the terminus sequences are uncorrected..."

- I dont understand this option:
" Remove PacBio tails sub reads?"

I use filtered pacbio subreads as input, and the fastq header "...s1_p0/16/3441_3479" means that this subread originates from the raw read at position 3441-3479 and is 38bp long. These tails are already removed by secondary filtering? (these tails might ended up in another subread)
HenrivdGeest is offline   Reply With Quote
Old 12-05-2013, 12:18 AM   #4
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi Hans,

The options given for Novoalign are all wrong.

If you are using Novoalign with LSC you need to use -r Exhaustive not -r All. This will improve results dramatically and should be better than the other aligners.

For 100bp reads try

novoalign_options = -c1 -r Ex 1000 -t 120 -F FA -g 0 -x 20 -o sam

If read length is different adjust -t accordingly. On Cerebellum reads used by Kin Fai results are 40% better with 30% reduced run time vs BWA using LSC 0.3.1


Best, Colin
sparks is offline   Reply With Quote
Old 12-11-2013, 01:02 AM   #5
ilyaso
Junior Member
 
Location: Israel

Join Date: Dec 2013
Posts: 2
Default

Hi all,
I have a question to LSC users. I have a 5x coverage with PacBio reads of ~2 Gb genome and something like 100x coverage with Illumina reads that I want to try to use for correction of the long reads.
The problem is that the pacbio data is more than 2^32 bits and thus bowtie\blasr indexing does not work for it and thus I can't run alignment of the short reads. I am wondering if anyone knows how to overcome it
Thanks a lot!
Ilya
ilyaso is offline   Reply With Quote
Old 12-11-2013, 01:21 AM   #6
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Use Novoalign or split your PacBio reads into multiple indexes.
sparks is offline   Reply With Quote
Old 12-11-2013, 02:23 AM   #7
ilyaso
Junior Member
 
Location: Israel

Join Date: Dec 2013
Posts: 2
Default

Thanks for the answer this sounds helpful
Quote:
Originally Posted by sparks View Post
Use Novoalign or split your PacBio reads into multiple indexes.
Is it possible to use LSC pipeline in a way that it will split the data into multiple chunks, index them separately with e.g bowtie and then perform the alignment?
Thanks a lot for the help!
ilyaso is offline   Reply With Quote
Old 12-11-2013, 04:56 PM   #8
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

I think you could split the PacBio files with a short perl script. Try and keep subreads together. You then run LSC on each set of PacBio reads and later combine the corrected reads.
sparks is offline   Reply With Quote
Old 01-06-2014, 05:38 PM   #9
moolder
Junior Member
 
Location: china

Join Date: Mar 2013
Posts: 2
Default Lsc

Hi,when i run lsc with bowtie2,but the index building step has error like :
Error: Reference sequence has more than 2^32-1 characters! Please divide the
reference into batches or chunks of about 3.6 billion characters or less each
and index each independently.

so can any one tell me how to solve it?or can i split the pacbio reads file into many small files ,and run lsc separately?

thank you !
moolder is offline   Reply With Quote
Old 08-21-2015, 07:06 AM   #10
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

We have just released a new program in the Novo* suite that facilitates PacBio read error correction with Novoalign. We used this strategy - similar to LSC - to assemble the first draft of the pineapple chloroplast genome with our academic collaborators. Will be great to get some feedback on this from the wider group. We're calling it NovoCorrectorLR.
zee is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO