Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • LSC - a fast PacBio long read error correction tool.

    Hello SEQanswers Community,

    Further Info: http://www.stanford.edu/~kinfai/LSC/LSC.html

    We at the Wong Lab have developed a new tool for error correction of PacBio data. It has been shown to be very sensitive and can improve PacBio reads to 5% error rate. In particular, it is very very fast. In total, it only takes 10 hours (8 threading) for ~ 200k subreads. And it only needs 10-15G hardisk space for temporary files.

    In it's current form it supports PacBio reads and any type of short reads (from any NGS platforms). In the current version, you may need novoalign (single-thread version, free for academic community).

    It is designed for the Linux platform. If you use another platform leave us a note and we'll see what we can do.

    Instructions are on the website that is still not perfect yet, but if you are having any troubles don't hesitate to leave me a note here.

    Please give it a try and let me know if you have any issues. We are actively developing this tool so we welcome all of your comments and concerns! Especially, we are trying to replace novoalgin by the other aligner, which can save 50% running time (5 hours in the example above). The ease of use is very important to us, so let us know if anything annoys you.
    Last edited by LSC; 07-30-2012, 01:10 PM.

  • #2
    I thought to have a look at the paper mentioned as a preprint, but the link (http://www.stanford.edu/~kinfai/LSC/LSC.pdf) returns a 'Page not found' error...

    Comment


    • #3
      sorry, the paper is still in review and I just set up the website. I will fix the problem soon.

      Comment


      • #4
        Hi LSC,

        I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
        Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script


        Code:
        @HD	VN:1.0	SO:unsorted
        @PG	ID:novoalign	PN:novoalign	VN:V2.08.02	CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
        @SQ	SN:Pac1	AS:pseudochr_LR.fa.cps.nix	LN:50000716
        @SQ	SN:Pac2	AS:pseudochr_LR.fa.cps.nix	LN:50000772
        @SQ	SN:Pac3	AS:pseudochr_LR.fa.cps.nix	LN:50002188
        @SQ	SN:Pac4	AS:pseudochr_LR.fa.cps.nix	LN:50000094
        @SQ	SN:Pac5	AS:pseudochr_LR.fa.cps.nix	LN:50001433
        @SQ	SN:Pac6	AS:pseudochr_LR.fa.cps.nix	LN:50001526
        @SQ	SN:Pac7	AS:pseudochr_LR.fa.cps.nix	LN:50001210
        @SQ	SN:Pac8	AS:pseudochr_LR.fa.cps.nix	LN:50000056
        @SQ	SN:Pac9	AS:pseudochr_LR.fa.cps.nix	LN:50000143
        @SQ	SN:Pac10	AS:pseudochr_LR.fa.cps.nix	LN:50002588
        @SQ	SN:Pac11	AS:pseudochr_LR.fa.cps.nix	LN:50001867
        @SQ	SN:Pac12	AS:pseudochr_LR.fa.cps.nix	LN:50000245
        @SQ	SN:Pac13	AS:pseudochr_LR.fa.cps.nix	LN:28473695
        ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	16	Pac10	21159148	3	8S67M19S	*	0	0	GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:0	MD:Z:67	CC:Z:Pac2	CP:i:6239884	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
        ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	256	Pac2	6239884	3	14S72M8S	*	0	0	CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:1	MD:Z:32G39	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
        ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491	4	*	0	0	*	*	0	0	CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC	*	PG:Z:novoalign	ZS:Z:NM
        ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489	4	*	0	0	*	*	0	0	AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG	*	PG:Z:novoalign	ZS:Z:NM
        ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491	4	*	0	0	*	*	0	0	TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG	*	PG:Z:novoalign	ZS:Z:NM
        ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	0	Pac10	4706429	2	28M1I16M1I29M1S	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:2	MD:Z:73	CC:Z:Pac7	CP:i:31267643	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
        ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	256	Pac7	31267643	2	6M1I21M1I47M	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:3	MD:Z:43G30	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
        Could we combine this thread with the same one in the bioinformatics section?

        Many thanks, Hans Jansen

        Comment


        • #5
          I see the paper is not available but I clicked "How it Works" and that link is also broken. Can you fix?

          Link: http://www.stanford.edu/~kinfai/LSC/LSC_howitworks.html

          Comment


          • #6
            Originally posted by adaptivegenome View Post
            I see the paper is not available but I clicked "How it Works" and that link is also broken. Can you fix?

            Link: http://www.stanford.edu/~kinfai/LSC/LSC_howitworks.html
            The paper is in review now (almost the final round of the revision). I will post it as this final revision is submitted. Sorry for the inconvenience.

            Comment


            • #7
              Originally posted by ZFHans View Post
              Hi LSC,

              I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
              Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script


              Code:
              @HD	VN:1.0	SO:unsorted
              @PG	ID:novoalign	PN:novoalign	VN:V2.08.02	CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
              @SQ	SN:Pac1	AS:pseudochr_LR.fa.cps.nix	LN:50000716
              @SQ	SN:Pac2	AS:pseudochr_LR.fa.cps.nix	LN:50000772
              @SQ	SN:Pac3	AS:pseudochr_LR.fa.cps.nix	LN:50002188
              @SQ	SN:Pac4	AS:pseudochr_LR.fa.cps.nix	LN:50000094
              @SQ	SN:Pac5	AS:pseudochr_LR.fa.cps.nix	LN:50001433
              @SQ	SN:Pac6	AS:pseudochr_LR.fa.cps.nix	LN:50001526
              @SQ	SN:Pac7	AS:pseudochr_LR.fa.cps.nix	LN:50001210
              @SQ	SN:Pac8	AS:pseudochr_LR.fa.cps.nix	LN:50000056
              @SQ	SN:Pac9	AS:pseudochr_LR.fa.cps.nix	LN:50000143
              @SQ	SN:Pac10	AS:pseudochr_LR.fa.cps.nix	LN:50002588
              @SQ	SN:Pac11	AS:pseudochr_LR.fa.cps.nix	LN:50001867
              @SQ	SN:Pac12	AS:pseudochr_LR.fa.cps.nix	LN:50000245
              @SQ	SN:Pac13	AS:pseudochr_LR.fa.cps.nix	LN:28473695
              ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	16	Pac10	21159148	3	8S67M19S	*	0	0	GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:0	MD:Z:67	CC:Z:Pac2	CP:i:6239884	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
              ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	256	Pac2	6239884	3	14S72M8S	*	0	0	CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:1	MD:Z:32G39	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
              ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491	4	*	0	0	*	*	0	0	CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC	*	PG:Z:novoalign	ZS:Z:NM
              ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489	4	*	0	0	*	*	0	0	AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG	*	PG:Z:novoalign	ZS:Z:NM
              ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491	4	*	0	0	*	*	0	0	TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG	*	PG:Z:novoalign	ZS:Z:NM
              ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	0	Pac10	4706429	2	28M1I16M1I29M1S	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:2	MD:Z:73	CC:Z:Pac7	CP:i:31267643	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
              ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	256	Pac7	31267643	2	6M1I21M1I47M	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:3	MD:Z:43G30	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
              Could we combine this thread with the same one in the bioinformatics section?

              Many thanks, Hans Jansen
              Hi Hans Jansen,
              Your feedback is really helpful. although LSC works well in my computer cluster now, I know there may be something wrong when it is applied in some other systems. Your test is a great example for me to find the bug.
              1) your change of the python path is correct. I will fix it in the coming version.
              2) LSC uses the native output format instead of SAM format in novoalign. Please don't change it. Please try my setting of the original native format again. In addition, if BWA or bowtie2 could output ALL possible mappable hits, LSC would save over 50% of running time (novoalign is somewhat slow) by using them. Do you know any possible way to let BWA and bowtie2 to output all hits (including detailed indel information)?

              Comment


              • #8
                Hi LSC,

                Thanks for your reply. I'll try the native format again, but could you tell which version of novoalign you used. Could it be that novocraft changed something in their native format?

                Thanks,

                Hans

                Comment


                • #9
                  Originally posted by ZFHans View Post
                  Hi LSC,

                  Thanks for your reply. I'll try the native format again, but could you tell which version of novoalign you used. Could it be that novocraft changed something in their native format?

                  Thanks,

                  Hans
                  novoalign (V2.07.10) works well in LSC.

                  Comment


                  • #10
                    Hi LSC,

                    Thanks for your quick reply. If my current run with 2.08 fails I'll try 2.07

                    In the meantime I looked at the bowtie2 manual http://bowtie-bio.sourceforge.net/bo...all-alignments and found this mode:

                    -a mode: search for and report all alignments

                    -a mode is similar to -k mode except that there is no upper limit on the number of alignments Bowtie 2 should report. Alignments are reported in descending order by alignment score. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. See the SAM specification for details.

                    Some tools are designed with this reporting mode in mind. Bowtie 2 is not! For very large genomes, this mode is very slow.

                    Is this of any use to you?

                    Regards,

                    Hans

                    Comment


                    • #11
                      Hi LSC,

                      As it turned out it was my mistake all along. I'm using quake corrected Illumina reads as SR input. The fasta headers of these reads contain spaces, and that was causing problems in convertNAV.py I corrected this by removing the spaces from the headers and now the novocraft native output format is understood by convertNAV.py The script now continues to correct_nonredundant.py but gives then the following error:
                      Traceback (most recent call last):
                      File "/usr/local/LSC_0.2/bin/correct_nonredundant.py", line 280, in <module>
                      n_rep = int(NSR.split('_')[1])
                      IndexError: list index out of range
                      This is probably still some problem of too many fields
                      Could you indicate how the headers of the input files should look like (both the LR and SR)

                      Many thanks in advance,

                      Hans

                      Comment


                      • #12
                        Hi LSC,

                        I work together with Hans Jansen, he installed the previous version.
                        Since it is not possible to go to the pages with tutorial, i was wondering how to install the newest version?
                        Or can I just copy the adjusted python scripts to the existing folder?

                        Regards,
                        Nynke Tuinhof

                        Comment


                        • #13
                          Yes, you just need to copy the scripts to overwrite the existing folder
                          Originally posted by Tuinhof View Post
                          Hi LSC,

                          I work together with Hans Jansen, he installed the previous version.
                          Since it is not possible to go to the pages with tutorial, i was wondering how to install the newest version?
                          Or can I just copy the adjusted python scripts to the existing folder?

                          Regards,
                          Nynke Tuinhof

                          Comment


                          • #14
                            More Info on LSC

                            Hi, I am also very interested in LSC. I would like to see the paper and manual if they are available.

                            Does anyone have time comparisons of LSC vs. PacBioToCA vs. SmrtPipe?

                            Comment


                            • #15
                              I also noticed that you say it corrects the reads to a 5% error rate, but the Schatz work seems to mention a 0.1% error rate. Is there a reason for that?

                              Thanks,
                              Shane

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X