Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by lletourn View Post
    One Thing that happened to us was that I forgot to change the LEN_BIT in the celera sources, so any pacbio read longer than 2kb got thrown out. We had less than 0.1% of our reads after correction.
    Can you be a bit more specific?

    - Is this already fixed in the downloadable binaries
    - Do you have a url for this issue? (Can't find it on the wiki: http://sourceforge.net/apps/mediawik...tle=PacBioToCA)

    Comment


    • #17
      Look at the 'Building from Source ' section
      You need to change in AS_global.h
      #define AS_READ_MAX_NORMAL_LEN_BITS 11

      to something higher.

      There is a precompiled binary called wgs7.0-pacbio but it segfaults here and I don't know what they put as a value.

      We just recompiled it.

      I Actually don't know at all what they changed in that build. I don't think they'll put it by default soon because putting a higher value there takes up more ram. For people using only short reads, this might not be a good thing.

      Comment


      • #18
        One of the checks I do if I'm running into low coverage in the returned pacbio reads is to do an assembly of just the reads I will eventually error correct with. i.e. for your dataset try to do an assembly using just the illumina reads.

        Obviously you will likely have a lot more contigs than if you were assembling with error corrected pacbio long reads, but a quick check is to look at the total size of the assembled reads. If the assembly with just your illumina reads is ~100MB then a first pass assessment is you have "good" coverage. If the assembly is much smaller than that, you may not have particularly even coverage. For instance some areas might have extremely deep coverage and some no coverage at all. I've run into this issue before when I tried to do some complex filtering. Obviously this is not a very rigorous check and there are other issues that might explain your issues, but this can point you in the direction of whether your problem is with the error correction pipeline, or the data you are inputting to it.

        Comment


        • #19
          I did an Illumina only assembly, and that resulted in a 100MB contig file.And indeed, some areas have a bigger coverage than others. (some areas downto 1x)

          I am now testing with a single pacbio read and some Illumina reads that map on this read (performed with CLCbio read mapping) And I see where it might go wrong:
          - I get a lot of small fixed pacbio reads (21 at my previous used settings)
          - with all the smallReads adjustments I got 17 new pacbio reads.
          (frgMinLen = 40
          ovlMinLen = 30
          merSize=9 )

          Most of the output reads are very small and no longer that a 454 read. Is it not possible to get the pacbio input read, but fixed at the positions where possible?

          Comment


          • #20
            I also have some questions about pacbiotoca. Mine is taking a long time as well. I have tried the spec file above.
            Is there any way to estimate how long it should take?
            Can anyone help maybe give me some pointers?
            Also is it possible to collapse your Illumina reads somehow before correction, would that help?

            Thanks,
            Shane

            Comment


            • #21
              I finally read quite a bit on that. And conclusion, as they mention on the RunCA page, it depends on your hardware.

              The 3 worst steps are 0-overlaptrim-overlap, 1-overlapper and runPartition

              If your running on SGE it might be a good idea to intentionally give less ram to
              ovlHashBlockLength to generate more jobs to split up the work.

              If you have a lot of cores per node and a lot of ram, ou could bring up
              ovlHashBits, ovlThreads, ovlHashBlockLength

              Although if it's a bacteria, don't put a too big ovlHashBits, it's wasted ram. On RunCA they explain how to guess the best value.

              Also a warning, the max value to ovlHashBlockLength is 4G. anything bigger can crash the process and give unexpected results.

              good luck

              Comment


              • #22
                Hi, I am trying to run pacbiotoca. I am running a small test, with 10 pacbio sequences to correct, against a set of Illumina data that is 37GB. It has now been running for nearly 4 days, on a 24-core machine, averaging a load of about 17. It appears to be nearing the end of doing ~1500 overlapInCore jobs. I used a spec file similar to the one above.

                I am wondering if this is normal or if it should take this long. Is there anyone who could help me try to speed this up? Thanks!

                Comment


                • #23
                  @HenrivdGeest pacBioToCA v7 splits PacBio reads if there's a coverage gap somewhere. A coming release will keep the full read even if there's a portion without coverage.

                  Comment


                  • #24
                    Some input about error correction

                    We have been working with Sergey on the PacBioToCA for some time. First, you can definitely have too much coverage of short reads, especially illumina reads where errors are non-random and can confuse the correction leading to more than one version of a corrected read. That is, at some depth you can get enough of the same error to convince the correction routine that there are two different sequences.
                    Generally no more than 50-70x coverage works better than higher coverage, you should down-sample.
                    Second, we have had the best luck on microbial genomes using high cutoffs for read length on the PacBio data, usually 6kb or greater (although some strains have worked better with somewhat lower, and some with somewhat higher, cutoffs).
                    Third, until the current version (not sure it is even released yet) Sergey had not incorporated paired end information into the routine. Each 100 or 150 base read was thus being used directly to try and correct, but mapping those short reads to the 15% error reads was difficult. 454 data works much better, or CCS on PacBio. I understand that the 2x250 paired reads you can now do on the MiSeq kick ass for error correction when using the version that accounts for paired ends, but haven't yet tried it as our MiSeq is just now getting the upgrade.

                    Comment


                    • #25
                      Originally posted by jbingham View Post
                      @HenrivdGeest pacBioToCA v7 splits PacBio reads if there's a coverage gap somewhere. A coming release will keep the full read even if there's a portion without coverage.
                      Indeed, I am using the current cvs release, and it has this option with -maxGap.
                      I set it to 300 to allow pieces upto 300bp not having any coverage. It dit help, my median read length of the fixed pacbio reads went up, but it's still at 800bp, altough the pacbio input is about 2.5kb.

                      Comment


                      • #26
                        Originally posted by tplsmith View Post
                        We have been working with Sergey on the PacBioToCA for some time. First, you can definitely have too much coverage of short reads, especially illumina reads where errors are non-random and can confuse the correction leading to more than one version of a corrected read. That is, at some depth you can get enough of the same error to convince the correction routine that there are two different sequences.
                        Generally no more than 50-70x coverage works better than higher coverage, you should down-sample.
                        Second, we have had the best luck on microbial genomes using high cutoffs for read length on the PacBio data, usually 6kb or greater (although some strains have worked better with somewhat lower, and some with somewhat higher, cutoffs).
                        Third, until the current version (not sure it is even released yet) Sergey had not incorporated paired end information into the routine. Each 100 or 150 base read was thus being used directly to try and correct, but mapping those short reads to the 15% error reads was difficult. 454 data works much better, or CCS on PacBio. I understand that the 2x250 paired reads you can now do on the MiSeq kick ass for error correction when using the version that accounts for paired ends, but haven't yet tried it as our MiSeq is just now getting the upgrade.
                        Thanks. I am indeed now only using 454 reads. I also have 454 paired end of 3KB, so that might also help with the long pacbio reads.

                        Comment


                        • #27
                          I am wrestling with optimizing the pacbio.spec file too. I have a 512 GB RAM machine with 64 cores. What might be the optimum values for these? Some of the values I tried caused the pipeline to crash at the 0-overlaptrim-overlap with too little memory. Also, is there some way to restart jobs at the failed stage only and not from the beginning?
                          Farhat Habib

                          Comment


                          • #28
                            Pacbio spec file for high-memory multi-core

                            Hi,

                            Anybody have specific updates to be done in pacbio.spec file which is designed for high-memory multi-core machines. I have machine with 48 cores and 128GB RAM.

                            I am using 50X of short illumina data to correct the pacbio reads with 30X coverage. I was able to run the pacBioToCA pipeline but problem is with generating pacbio.frg file.

                            My illumina data is 1.1GB and pacbio data is 250MB. However the correction run the pacbio.frg file was only 750KB and pacbio.fasta file was 400KB. Something is going wrong and I am not able to figure it out.

                            Any suggestions?

                            Thanks
                            Sagar

                            Comment


                            • #29
                              It may make sense to separate various components of PacBioToCA, run them separately on test files and then write your own optimized pipeline. We are looking into the possibility.

                              http://homolog.us

                              Comment


                              • #30
                                Does pacbioToCA correct raw reads independently?

                                For the same coverage in high quality reads will the input amount in raw reads affect the correction? Does pacbioToCA correct raw reads independently?

                                I typically run with a single fastq from 2-4 filtered SMRTcells with 50X-100X of high quality correction reads followed by assembly.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin


                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                  Yesterday, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                37 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                41 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                35 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X