SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   LSC - a fast PacBio long read error correction tool. (http://seqanswers.com/forums/showthread.php?t=22099)

LSC 07-30-2012 01:35 PM

LSC - a fast PacBio long read error correction tool.
 
Hello SEQanswers Community,

Further Info: http://www.stanford.edu/~kinfai/LSC/LSC.html

We at the Wong Lab have developed a new tool for error correction of PacBio data. It has been shown to be very sensitive and can improve PacBio reads to 5% error rate. In particular, it is very very fast. In total, it only takes 10 hours (8 threading) for ~ 200k subreads. And it only needs 10-15G hardisk space for temporary files.

In it's current form it supports PacBio reads and any type of short reads (from any NGS platforms). In the current version, you may need novoalign (single-thread version, free for academic community).

It is designed for the Linux platform. If you use another platform leave us a note and we'll see what we can do.

Instructions are on the website that is still not perfect yet, but if you are having any troubles don't hesitate to leave me a note here.

Please give it a try and let me know if you have any issues. We are actively developing this tool so we welcome all of your comments and concerns! Especially, we are trying to replace novoalgin by the other aligner, which can save 50% running time (5 hours in the example above). The ease of use is very important to us, so let us know if anything annoys you.

ZFHans 08-07-2012 04:13 AM

Hi LSC,

I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script

Many thanks, Hans Jansen

Code:

@HD        VN:1.0        SO:unsorted
@PG        ID:novoalign        PN:novoalign        VN:V2.08.02        CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
@SQ        SN:Pac1        AS:pseudochr_LR.fa.cps.nix        LN:50000716
@SQ        SN:Pac2        AS:pseudochr_LR.fa.cps.nix        LN:50000772
@SQ        SN:Pac3        AS:pseudochr_LR.fa.cps.nix        LN:50002188
@SQ        SN:Pac4        AS:pseudochr_LR.fa.cps.nix        LN:50000094
@SQ        SN:Pac5        AS:pseudochr_LR.fa.cps.nix        LN:50001433
@SQ        SN:Pac6        AS:pseudochr_LR.fa.cps.nix        LN:50001526
@SQ        SN:Pac7        AS:pseudochr_LR.fa.cps.nix        LN:50001210
@SQ        SN:Pac8        AS:pseudochr_LR.fa.cps.nix        LN:50000056
@SQ        SN:Pac9        AS:pseudochr_LR.fa.cps.nix        LN:50000143
@SQ        SN:Pac10        AS:pseudochr_LR.fa.cps.nix        LN:50002588
@SQ        SN:Pac11        AS:pseudochr_LR.fa.cps.nix        LN:50001867
@SQ        SN:Pac12        AS:pseudochr_LR.fa.cps.nix        LN:50000245
@SQ        SN:Pac13        AS:pseudochr_LR.fa.cps.nix        LN:28473695
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493        16        Pac10        21159148        3        8S67M19S        *        0        0        GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG        *        PG:Z:novoalign        AS:i:242        UQ:i:242        NM:i:0        MD:Z:67        CC:Z:Pac2        CP:i:6239884        ZS:Z:R        ZN:i:2        NH:i:2        HI:i:1        IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493        256        Pac2        6239884        3        14S72M8S        *        0        0        CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC        *        PG:Z:novoalign        AS:i:242        UQ:i:242        NM:i:1        MD:Z:32G39        ZS:Z:R        ZN:i:2        NH:i:2        HI:i:2        IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491        4        *        0        0        *        *        0        0        CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC        *        PG:Z:novoalign        ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489        4        *        0        0        *        *        0        0        AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG        *        PG:Z:novoalign        ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491        4        *        0        0        *        *        0        0        TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG        *        PG:Z:novoalign        ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483        0        Pac10        4706429        2        28M1I16M1I29M1S        *        0        0        ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT        *        PG:Z:novoalign        AS:i:122        UQ:i:122        NM:i:2        MD:Z:73        CC:Z:Pac7        CP:i:31267643        ZS:Z:R        ZN:i:2        NH:i:2        HI:i:1        IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483        256        Pac7        31267643        2        6M1I21M1I47M        *        0        0        ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT        *        PG:Z:novoalign        AS:i:122        UQ:i:122        NM:i:3        MD:Z:43G30        ZS:Z:R        ZN:i:2        NH:i:2        HI:i:2        IH:i:2


HenrivdGeest 10-29-2012 02:06 AM

Hello,

I just tried LSC, and it looks good;
some remarks:

-input cant be fastq, at first I tried to run with fastq but run into weird errors and crashes.
- It cant handle multiline fasta for pacbio long reads?
- documentation:
"full_LR_SR.map.fa
Although the terminus sequences are corrected, they are concatenated with their corrected sequence (corrected_LR_SR.map.fa) to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file full_LR_SR.map.fa"
I think this should be:
"Although the terminus sequences are uncorrected..."

- I dont understand this option:
" Remove PacBio tails sub reads?"

I use filtered pacbio subreads as input, and the fastq header "...s1_p0/16/3441_3479" means that this subread originates from the raw read at position 3441-3479 and is 38bp long. These tails are already removed by secondary filtering? (these tails might ended up in another subread)

sparks 12-05-2013 12:18 AM

Hi Hans,

The options given for Novoalign are all wrong.

If you are using Novoalign with LSC you need to use -r Exhaustive not -r All. This will improve results dramatically and should be better than the other aligners.

For 100bp reads try

novoalign_options = -c1 -r Ex 1000 -t 120 -F FA -g 0 -x 20 -o sam

If read length is different adjust -t accordingly. On Cerebellum reads used by Kin Fai results are 40% better with 30% reduced run time vs BWA using LSC 0.3.1


Best, Colin

ilyaso 12-11-2013 01:02 AM

Hi all,
I have a question to LSC users. I have a 5x coverage with PacBio reads of ~2 Gb genome and something like 100x coverage with Illumina reads that I want to try to use for correction of the long reads.
The problem is that the pacbio data is more than 2^32 bits and thus bowtie\blasr indexing does not work for it and thus I can't run alignment of the short reads. I am wondering if anyone knows how to overcome it
Thanks a lot!
Ilya

sparks 12-11-2013 01:21 AM

Use Novoalign or split your PacBio reads into multiple indexes.

ilyaso 12-11-2013 02:23 AM

Thanks for the answer this sounds helpful
Quote:

Originally Posted by sparks (Post 127083)
Use Novoalign or split your PacBio reads into multiple indexes.

Is it possible to use LSC pipeline in a way that it will split the data into multiple chunks, index them separately with e.g bowtie and then perform the alignment?
Thanks a lot for the help!

sparks 12-11-2013 04:56 PM

I think you could split the PacBio files with a short perl script. Try and keep subreads together. You then run LSC on each set of PacBio reads and later combine the corrected reads.

moolder 01-06-2014 05:38 PM

Lsc
 
Hi,when i run lsc with bowtie2,but the index building step has error like :
Error: Reference sequence has more than 2^32-1 characters! Please divide the
reference into batches or chunks of about 3.6 billion characters or less each
and index each independently.

so can any one tell me how to solve it?or can i split the pacbio reads file into many small files ,and run lsc separately?

thank you !

zee 08-21-2015 07:06 AM

We have just released a new program in the Novo* suite that facilitates PacBio read error correction with Novoalign. We used this strategy - similar to LSC - to assemble the first draft of the pineapple chloroplast genome with our academic collaborators. Will be great to get some feedback on this from the wider group. We're calling it NovoCorrectorLR.


All times are GMT -8. The time now is 03:42 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.