Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • LSC - a fast PacBio long read error correction tool.

    Hello SEQanswers Community,

    Further Info: http://www.stanford.edu/~kinfai/LSC/LSC.html

    We at the Wong Lab have developed a new tool for error correction of PacBio data. It has been shown to be very sensitive and can improve PacBio reads to 5% error rate. In particular, it is very very fast. In total, it only takes 10 hours (8 threading) for ~ 200k subreads. And it only needs 10-15G hardisk space for temporary files.

    In it's current form it supports PacBio reads and any type of short reads (from any NGS platforms). In the current version, you may need novoalign (single-thread version, free for academic community).

    It is designed for the Linux platform. If you use another platform leave us a note and we'll see what we can do.

    Instructions are on the website that is still not perfect yet, but if you are having any troubles don't hesitate to leave me a note here.

    Please give it a try and let me know if you have any issues. We are actively developing this tool so we welcome all of your comments and concerns! Especially, we are trying to replace novoalgin by the other aligner, which can save 50% running time (5 hours in the example above). The ease of use is very important to us, so let us know if anything annoys you.
    Last edited by LSC; 07-30-2012, 01:11 PM.

  • #2
    Hi LSC,

    I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
    Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script

    Many thanks, Hans Jansen

    Code:
    @HD	VN:1.0	SO:unsorted
    @PG	ID:novoalign	PN:novoalign	VN:V2.08.02	CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
    @SQ	SN:Pac1	AS:pseudochr_LR.fa.cps.nix	LN:50000716
    @SQ	SN:Pac2	AS:pseudochr_LR.fa.cps.nix	LN:50000772
    @SQ	SN:Pac3	AS:pseudochr_LR.fa.cps.nix	LN:50002188
    @SQ	SN:Pac4	AS:pseudochr_LR.fa.cps.nix	LN:50000094
    @SQ	SN:Pac5	AS:pseudochr_LR.fa.cps.nix	LN:50001433
    @SQ	SN:Pac6	AS:pseudochr_LR.fa.cps.nix	LN:50001526
    @SQ	SN:Pac7	AS:pseudochr_LR.fa.cps.nix	LN:50001210
    @SQ	SN:Pac8	AS:pseudochr_LR.fa.cps.nix	LN:50000056
    @SQ	SN:Pac9	AS:pseudochr_LR.fa.cps.nix	LN:50000143
    @SQ	SN:Pac10	AS:pseudochr_LR.fa.cps.nix	LN:50002588
    @SQ	SN:Pac11	AS:pseudochr_LR.fa.cps.nix	LN:50001867
    @SQ	SN:Pac12	AS:pseudochr_LR.fa.cps.nix	LN:50000245
    @SQ	SN:Pac13	AS:pseudochr_LR.fa.cps.nix	LN:28473695
    ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	16	Pac10	21159148	3	8S67M19S	*	0	0	GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:0	MD:Z:67	CC:Z:Pac2	CP:i:6239884	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
    ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	256	Pac2	6239884	3	14S72M8S	*	0	0	CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:1	MD:Z:32G39	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
    ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491	4	*	0	0	*	*	0	0	CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC	*	PG:Z:novoalign	ZS:Z:NM
    ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489	4	*	0	0	*	*	0	0	AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG	*	PG:Z:novoalign	ZS:Z:NM
    ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491	4	*	0	0	*	*	0	0	TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG	*	PG:Z:novoalign	ZS:Z:NM
    ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	0	Pac10	4706429	2	28M1I16M1I29M1S	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:2	MD:Z:73	CC:Z:Pac7	CP:i:31267643	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
    ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	256	Pac7	31267643	2	6M1I21M1I47M	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:3	MD:Z:43G30	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
    Last edited by ZFHans; 08-07-2012, 04:33 AM.

    Comment


    • #3
      Hello,

      I just tried LSC, and it looks good;
      some remarks:

      -input cant be fastq, at first I tried to run with fastq but run into weird errors and crashes.
      - It cant handle multiline fasta for pacbio long reads?
      - documentation:
      "full_LR_SR.map.fa
      Although the terminus sequences are corrected, they are concatenated with their corrected sequence (corrected_LR_SR.map.fa) to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file full_LR_SR.map.fa"
      I think this should be:
      "Although the terminus sequences are uncorrected..."

      - I dont understand this option:
      " Remove PacBio tails sub reads?"

      I use filtered pacbio subreads as input, and the fastq header "...s1_p0/16/3441_3479" means that this subread originates from the raw read at position 3441-3479 and is 38bp long. These tails are already removed by secondary filtering? (these tails might ended up in another subread)

      Comment


      • #4
        Hi Hans,

        The options given for Novoalign are all wrong.

        If you are using Novoalign with LSC you need to use -r Exhaustive not -r All. This will improve results dramatically and should be better than the other aligners.

        For 100bp reads try

        novoalign_options = -c1 -r Ex 1000 -t 120 -F FA -g 0 -x 20 -o sam

        If read length is different adjust -t accordingly. On Cerebellum reads used by Kin Fai results are 40% better with 30% reduced run time vs BWA using LSC 0.3.1


        Best, Colin

        Comment


        • #5
          Hi all,
          I have a question to LSC users. I have a 5x coverage with PacBio reads of ~2 Gb genome and something like 100x coverage with Illumina reads that I want to try to use for correction of the long reads.
          The problem is that the pacbio data is more than 2^32 bits and thus bowtie\blasr indexing does not work for it and thus I can't run alignment of the short reads. I am wondering if anyone knows how to overcome it
          Thanks a lot!
          Ilya

          Comment


          • #6
            Use Novoalign or split your PacBio reads into multiple indexes.

            Comment


            • #7
              Thanks for the answer this sounds helpful
              Originally posted by sparks View Post
              Use Novoalign or split your PacBio reads into multiple indexes.
              Is it possible to use LSC pipeline in a way that it will split the data into multiple chunks, index them separately with e.g bowtie and then perform the alignment?
              Thanks a lot for the help!

              Comment


              • #8
                I think you could split the PacBio files with a short perl script. Try and keep subreads together. You then run LSC on each set of PacBio reads and later combine the corrected reads.

                Comment


                • #9
                  Lsc

                  Hi,when i run lsc with bowtie2,but the index building step has error like :
                  Error: Reference sequence has more than 2^32-1 characters! Please divide the
                  reference into batches or chunks of about 3.6 billion characters or less each
                  and index each independently.

                  so can any one tell me how to solve it?or can i split the pacbio reads file into many small files ,and run lsc separately?

                  thank you !

                  Comment


                  • #10
                    We have just released a new program in the Novo* suite that facilitates PacBio read error correction with Novoalign. We used this strategy - similar to LSC - to assemble the first draft of the pineapple chloroplast genome with our academic collaborators. Will be great to get some feedback on this from the wider group. We're calling it NovoCorrectorLR.

                    Comment


                    • #11
                      Hello, I am using LSC2.0, both input are fastq file. However, return "===batch count:===
                      Work will begin on 0 batches." And I do not how to run correctly

                      Comment


                      • #12
                        Always run failure with "===batch count:===
                        Work will begin on 0 batches." And the manual is too simple.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        66 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X