Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PBJelly novice needs advise

    I have a supernova assembly of 10x genomics data for which I also have 4 smrt cells of PacBio Sequel data. The general workflow of my efforts so far have been:

    supernova (using 10x genomics data)
    SSPACE-LongRead (using pacbio sequel data)
    GapFiller (using 10x genomics data)
    PBJelly (using pacbio sequel data)

    I saw steady improvement of the assembly up through GapFiller, but when I ran PBJelly at default settings the output seem to be in worse shape than the input. Our guiding metrics were total assembly length (which we expect to be 400Mb) and BUSCO completeness. The GapFiller results looked good at 414Mb total length & 88.8% core genes being found by BUSCO. But the output of default PBJelly grew in size to 550Mb and the BUSCO completeness dropped to 82.8%.

    I then tried running PBJelly set to only do internal gap filling to address the issue with the overall length. It performed better with this argument set but still too long at 500Mb, and the BUSCO results were still a bit worse than the input at 88.4% (which is 1 core gene less than what was found for the GapFiller results that were the input).

    So I could use some advise on how to tune PBJelly for my project. Are there certain input assembly metrics I can look at to drive my choice of parameters to set? Any advice would be greatly appreciated.

    Thanks,
    John

  • #2
    For the PacBio Sequel data, were you using the raw subreads? That's what I would recommend. By default the PacBio subread BAM files give a quality score of Q0 to all bases, and PBJelly needs quality scores for the bases. When I ran PBJelly, I gave all PacBio Sequel subread sequences a quality score of Q30.
    Last edited by Gopo; 10-20-2018, 05:54 AM. Reason: clarity

    Comment


    • #3
      I am using subreads that were generated from the smrt link using the 'bam to fastx' method and they do appear to all be quality 0. Is there some way to force PBJelly to assume all quality values is 30? Or do I need to swap out the quality values myself in the input fastq?

      Comment


      • #4
        I have a GIST that should help you with instructions that I used for running PBJelly and then correct indels (not with Pilon but with a variant caller)- see https://gist.github.com/jelber2/730f...3d5da2c97bedea

        Comment


        • #5
          I've got PBJelly running now using subreads for which I've swapped in q30 values. I'm running one instance w/ all defaults and another using support -x "--capturedOnly" in the hopes to minimize the expansion of my input genome. I hadn't set a mapping quality filter which you use in your workflow. I may launch another instance to try that out.

          Your workflow is very interesting, I will look into variant calling (via BBMap) for error correction and compare it to PILON (which had been my original plan). Thanks for all the suggestions & info!


          John

          Comment


          • #6
            Here is a script for running Pilon twice quickly by splitting the genome to be corrected into parts, generating intervals for Pilon to hasten its processing, and combining them again

            It assumes you have a file called genome.pilon-0.fasta in whatever you call work-dir


            It really is only useful if you have many cores on your Server (i.e., >4) and probably 100GB RAM.

            Obviously also depends on the amount of Illumina reads and the size of the genome being corrected (also number of scaffolds in the file genome.pilon-0.fasta)

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM
            • seqadmin
              The Impact of AI in Genomic Medicine
              by seqadmin



              Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
              02-26-2024, 02:07 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-14-2024, 06:13 AM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-08-2024, 08:03 AM
            0 responses
            71 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-07-2024, 08:13 AM
            0 responses
            80 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-06-2024, 09:51 AM
            0 responses
            68 views
            0 likes
            Last Post seqadmin  
            Working...
            X