Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by shanebrubaker View Post
    Hi, I am also very interested in LSC. I would like to see the paper and manual if they are available.

    Does anyone have time comparisons of LSC vs. PacBioToCA vs. SmrtPipe?
    The paper was accepted last week and is off to print now. The preprint should be on the homepage next week.

    Comment


    • #17
      Originally posted by shanebrubaker View Post
      I also noticed that you say it corrects the reads to a 5% error rate, but the Schatz work seems to mention a 0.1% error rate. Is there a reason for that?

      Thanks,
      Shane
      In the paper, you will see the accuracy go down to <1% when you have enough short reads (SGS) coverage. For those regions without any short reads coverage, the error rate would be still high. Then the average of the whole thing will be lower down. Thus, the more short reads, the better performance it does.

      Comment


      • #18
        Hi Shane,

        At the moment I am trying to compare LSC vs. PacBioToCA, and if time and hardware permits SmrtPipe. I noticed that PacBioToCA reduces the dataset from 1 GB to around 400MB. Haven't checked error rate yet.
        We have 30x coverage in short reads.
        I am still struggling with LSC 0.2.1 This runs fine on a testset, but when it runs on the whole short read set (49 GB) I get an read error in awk (awk: read error (Bad address). It's not consistently on one point in the file. Sometimes this happens after processing 50MB, and once it reached 30 GB. I did different md5 checksums for the file and they are the same. Any suggestions are appreciated.

        Thanks,

        Hans Jansen

        Comment


        • #19
          LSC,

          I would really like to use your software but it is extremely difficult to do so given the complete lack of documentation. The 'How it works?', 'Tutorial', 'Manual' and 'Filters' links are website are dead links. The 'FAQ' has just one line referring to SpliceMap. There isn't even a README file. Yes I can run the program but without any documentation I have know idea whether my results are correct or meaningful.

          I installed LSC and ran it against a PacBio long read data set consisting of 100,000 reads, totaling 38Mbp. My short read set are 20 million, 100bp Illumina reads. I ran the program with default parameters and the output generated is 3 files, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa and corrected_LR_SR.map.fa. Each file contains ~30,000 reads; the full file contains ~15Mbp and the other two each ~8Mbp.

          What am I to make of these files? Does this output sound normal? Which output file is useful for further analysis?

          Comment


          • #20
            I am not sure what journal this paper has been accepted at but is it not possible by now to post something (perhaps a provisional PDF) at the link ("how it works") that was included in a previous post (and is still showing a "Page not found" error).

            Other links (http://www-stat.stanford.edu/~kinfai/LSC.html) appear to lead to a "Not found" error. This one is not working either (http://www-stat.stanford.edu/~kinfai/LSC_download.html).

            Comment


            • #21
              Sorry of the incompleteness of the website. I am currently pulled into an emergency project so that I have to postpone the release of the documentation. I hope I could have time to finish the manual in a week or so. The paper is on the homepage now. Sorry for the inconvenience again.

              Originally posted by kmcarr View Post
              LSC,

              I would really like to use your software but it is extremely difficult to do so given the complete lack of documentation. The 'How it works?', 'Tutorial', 'Manual' and 'Filters' links are website are dead links. The 'FAQ' has just one line referring to SpliceMap. There isn't even a README file. Yes I can run the program but without any documentation I have know idea whether my results are correct or meaningful.

              I installed LSC and ran it against a PacBio long read data set consisting of 100,000 reads, totaling 38Mbp. My short read set are 20 million, 100bp Illumina reads. I ran the program with default parameters and the output generated is 3 files, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa and corrected_LR_SR.map.fa. Each file contains ~30,000 reads; the full file contains ~15Mbp and the other two each ~8Mbp.

              What am I to make of these files? Does this output sound normal? Which output file is useful for further analysis?

              Comment


              • #22
                Has anyone had any success in running LSC to correct pacbio data arising from Gb genomes? I am currently using it to try and correct a 6x coverage of >2Gb genome with 30X SR data. At the moment it is in the alignment stage with 40 CPU but finding it difficult to gauge how long the alignment could take.

                Comment


                • #23
                  LSC: beware the dinucleotide repeats

                  I am working with a ~1Gb genome and using 40X coverage of mer-trimmed Illumina reads. A test run on 100Mb of PacBio sequence took almost 10 days to complete on 40 cpus. As you know, LSC sorts the Illumina reads by sequence, then normalizes the data with "uniq", then splits the reads into several SR.fa.*.cps files according to the number of cpus. Each sub-file is aligned to the PacBio reads in parallel. What I learned in this test run was that 'sort' grouped reads that contained classes of dinucleotide repeats. Thus the split resulted in a few sub-files that were quite rich in CA repeats, GT repeats, etc. Those files required a few more days to complete the Novoalign step while the rest of the cpus sat idle.

                  Next time, I would run a small test set of PacBio reads with SR_uniq.fa and copy the .cps subfiles to a new directory as soon as they are produced, then terminate runLSC. Let's say, hypothetically, that I used 48 cpus and sort/uniq/split resulted in four files that were rich in CA, GT, CT, and GA repeats. I would cat the 44 non-repetitive files then re-split into 48 subfiles. Then I'd split each of the four repeat-rich files into 48 subfiles and add them to the non-repetitive files. I'd cat these into a single, new SR_uniq.fa file. The result should be that when LSC runs afresh on the new SR_uniq.fa, the repetitive reads would be distributed evenly among the 48 subfiles. That approach is only a rough estimate of where the repetitive sequences exist in the original file, and is also inelegant due to lack of programming skill but perhaps someone more skilled could find a way to automate the process.

                  Comment


                  • #24
                    Thanks for the information. I would be interested to know how you get on with your second attempt. Out of interest, did the nature of your data set allow you to evaluate the corrected reads from your first test?

                    Comment


                    • #25
                      Hello!
                      The names of PacBio long reads must be in the format of the following example: ">m111006_202713_42141_c100202382555500000315044810141104_s1_p0/16/3441_3479".
                      The last two numbers (3441 and 3479 in this example) are the positions of the sub reads.

                      However, my new data PacBio, doesn't contain, the last two numbers.

                      ex:
                      >m120627_142215_42149_c100335932550000001523020209201251_s1_p0/7
                      >m120627_142215_42149_c100335932550000001523020209201251_s1_p0/9

                      How can I get the last two numbers (3441 and 3479 in this example) are the positions of the sub reads?
                      thanks

                      Comment


                      • #26
                        Reads in the correct form for LSC are the result of filtering and trimming by smrtpipe. Your read IDs look like those from raw reads before filtering.
                        Last edited by flxlex; 01-24-2013, 04:23 AM. Reason: Clarification

                        Comment


                        • #27
                          Has anyone experienced the following error when getting to the writetmp.py stage of the pipeline.

                          Traceback (most recent call last):
                          File "/home/stby/bin/writetmp.py", line 57, in ?
                          SR_cps_dict[readname] = line.strip()
                          MemoryError


                          I do have over 400Gb of memory available.

                          Comment


                          • #28
                            Originally posted by SLB View Post
                            Has anyone experienced the following error when getting to the writetmp.py stage of the pipeline.

                            Traceback (most recent call last):
                            File "/home/stby/bin/writetmp.py", line 57, in ?
                            SR_cps_dict[readname] = line.strip()
                            MemoryError


                            I do have over 400Gb of memory available.
                            Problem solved.. It was an issue with python version. Although I had specified a newer installation of python in the runLSC.py script, when it called the writetmp.py script the default python path pointed to an older version. Something to bear in mind if there is multiple installations of python on your system.

                            Comment


                            • #29
                              Paired-End files

                              Hello,

                              I can't seem to find the instruction for Illumina paired-end reads. Should I first combine the two files, or there's a way to write both files in the .cfg file?

                              Also, is Novoalign still required to run LSC?

                              Thanks,


                              WJ

                              Comment


                              • #30
                                I wonder if there is a version LSC for Mac users.
                                Thanks

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                50 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X