Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jpummil
    Member
    • Apr 2014
    • 85

    Blasr/Quiver - Best Practice for Current Versions?

    So, first time attempting to use quiver to polish an assembly I created in Canu. All of the software is installed and functional via command line (quiver, pbalign, bax2bam, quiver, arrow, etc). Our systems people installed using pitchfork.

    I have the assembly in .fasta format and I also have 30ea bax.h5 files to utilize. The issue comes when I start looking at blasr which apparently no longer outputs cmp.h5 files that are preferred by pbalign. blasr now defaults to .bam files (nice for standardization I guess) and it isn't obvious that pbalign will utilize .bam as input. Found this little snippet which was somewhat informative:

    GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.


    Ultimately, in short, I simply want to utilize the bax.h5 files with quiver (or arrow) to polish the assembly I did in Canu.
  • rhall
    Senior Member
    • Aug 2012
    • 324

    #2
    Basic workflow using the data you have and the latest versions of software installed using pitchfork:
    bax.h5 -> bax2bam -> reads.bam
    reads.bam + reference.fasta -> pbalign -> aligned_reads.bam
    aligned_reads.bam -> variant_caller (quiver or arrow) -> consensus.fasta

    Comment

    • jpummil
      Member
      • Apr 2014
      • 85

      #3
      Originally posted by rhall View Post
      Basic workflow using the data you have and the latest versions of software installed using pitchfork:
      bax.h5 -> bax2bam -> reads.bam
      reads.bam + reference.fasta -> pbalign -> aligned_reads.bam
      aligned_reads.bam -> variant_caller (quiver or arrow) -> consensus.fasta
      Thanks so much, rhall! Just getting my data set up for a trial now ;-)

      I'll update the thread when I'm done!

      Comment

      • rhall
        Senior Member
        • Aug 2012
        • 324

        #4
        Just let me know if you need any more details on a particular step. If it's a particularly large dataset http://pacificbiosciences.github.io/...coretools.html has info on how to do a lightweight chunking without using the full workflow engine.

        Comment

        • jpummil
          Member
          • Apr 2014
          • 85

          #5
          Originally posted by rhall View Post
          Just let me know if you need any more details on a particular step. If it's a particularly large dataset http://pacificbiosciences.github.io/...coretools.html has info on how to do a lightweight chunking without using the full workflow engine.
          Genome size is about 170Mb....and I have ~65GB of bax.h5 files. BUT...I also have a pretty robust linux server to run it on.

          Comment

          • jpummil
            Member
            • Apr 2014
            • 85

            #6
            Hmmm...apparently some misunderstanding about "movies" on my part.

            Using the following bax2bam command:
            bax2bam Q1133.bax.h5/*.bax.h5 -o movie --subread


            ERROR: multiple movies detected:
            ERROR: m160611_230902_00125_c101003722550000001823226809161614_s1_p0
            ERROR: m160612_032816_00125_c101003722550000001823226809161615_s1_p0
            ERROR: m160612_074931_00125_c101003722550000001823226809161616_s1_p0
            ERROR: m160614_131714_00125_c101010592550000001823230710211662_s1_p0
            ERROR: m160614_194151_00125_c101010592550000001823230710211663_s1_p0
            ERROR: m160615_020159_00125_c101010592550000001823230710211664_s1_p0
            ERROR: m160615_062522_00125_c101010592550000001823230710211665_s1_p0
            ERROR: m160615_104711_00125_c101010592550000001823230710211666_s1_p0
            ERROR: m160615_150655_00125_c101010592550000001823230710211667_s1_p0
            ERROR: m160615_213630_00125_c101009882550000001823230710211620_s1_p0

            Contents of my bax.h5 directory:

            [jpummil@tres-l1 Q1133.bax.h5]$ ls
            m160611_230902_00125_c101003722550000001823226809161614_s1_p0.1.bax.h5
            m160611_230902_00125_c101003722550000001823226809161614_s1_p0.2.bax.h5
            m160611_230902_00125_c101003722550000001823226809161614_s1_p0.3.bax.h5
            m160612_032816_00125_c101003722550000001823226809161615_s1_p0.1.bax.h5
            m160612_032816_00125_c101003722550000001823226809161615_s1_p0.2.bax.h5
            m160612_032816_00125_c101003722550000001823226809161615_s1_p0.3.bax.h5
            m160612_074931_00125_c101003722550000001823226809161616_s1_p0.1.bax.h5
            m160612_074931_00125_c101003722550000001823226809161616_s1_p0.2.bax.h5
            m160612_074931_00125_c101003722550000001823226809161616_s1_p0.3.bax.h5
            m160614_131714_00125_c101010592550000001823230710211662_s1_p0.1.bax.h5
            m160614_131714_00125_c101010592550000001823230710211662_s1_p0.2.bax.h5
            m160614_131714_00125_c101010592550000001823230710211662_s1_p0.3.bax.h5
            m160614_194151_00125_c101010592550000001823230710211663_s1_p0.1.bax.h5
            m160614_194151_00125_c101010592550000001823230710211663_s1_p0.2.bax.h5
            m160614_194151_00125_c101010592550000001823230710211663_s1_p0.3.bax.h5
            m160615_020159_00125_c101010592550000001823230710211664_s1_p0.1.bax.h5
            m160615_020159_00125_c101010592550000001823230710211664_s1_p0.2.bax.h5
            m160615_020159_00125_c101010592550000001823230710211664_s1_p0.3.bax.h5
            m160615_062522_00125_c101010592550000001823230710211665_s1_p0.1.bax.h5
            m160615_062522_00125_c101010592550000001823230710211665_s1_p0.2.bax.h5
            m160615_062522_00125_c101010592550000001823230710211665_s1_p0.3.bax.h5
            m160615_104711_00125_c101010592550000001823230710211666_s1_p0.1.bax.h5
            m160615_104711_00125_c101010592550000001823230710211666_s1_p0.2.bax.h5
            m160615_104711_00125_c101010592550000001823230710211666_s1_p0.3.bax.h5
            m160615_150655_00125_c101010592550000001823230710211667_s1_p0.1.bax.h5
            m160615_150655_00125_c101010592550000001823230710211667_s1_p0.2.bax.h5
            m160615_150655_00125_c101010592550000001823230710211667_s1_p0.3.bax.h5
            m160615_213630_00125_c101009882550000001823230710211620_s1_p0.1.bax.h5
            m160615_213630_00125_c101009882550000001823230710211620_s1_p0.2.bax.h5
            m160615_213630_00125_c101009882550000001823230710211620_s1_p0.3.bax.h5

            So, does one just process the three files per group into a single .bam file, do the next, etc...then repeat the PBalign step per each .bam created?

            Comment

            • rhall
              Senior Member
              • Aug 2012
              • 324

              #7
              Yes the bax2bam should be ran per movie (3 files). PBalign can then be ran using all the data.

              Comment

              • jpummil
                Member
                • Apr 2014
                • 85

                #8
                Originally posted by rhall View Post
                Yes the bax2bam should be ran per movie (3 files). PBalign can then be ran using all the data.
                Thanks rhall! Was just about to post an update when you posted this ;-)

                -rw-r--r-- 1 jpummil jpummil 2.3G Dec 16 12:06 m160611_230902_00125_c101003722550000001823226809161614_s1_p0.1.bax.h5

                -rw-r--r-- 1 jpummil jpummil 2.2G Dec 16 12:06 m160611_230902_00125_c101003722550000001823226809161614_s1_p0.2.bax.h5

                -rw-r--r-- 1 jpummil jpummil 2.2G Dec 16 12:06 m160611_230902_00125_c101003722550000001823226809161614_s1_p0.3.bax.h5

                BECOMES (using bax2bam):

                -rw-rw-r-- 1 jpummil jpummil 1.3G Dec 16 12:18 movie.scraps.bam

                -rw-rw-r-- 1 jpummil jpummil 2.3G Dec 16 12:18 movie.subreads.bam


                Now running PBalign (blasr aligner) of the one lane of converted data to create "Q1133.aligned.bam".

                I'll then see about a Quiver run of that file just to validate the process. Afterwards, I'll go back and tidy things up to process all ten lanes of data, put thru PBalign, then Quiver.

                Comment

                • rhall
                  Senior Member
                  • Aug 2012
                  • 324

                  #9
                  FYI, the scraps file includes all the sequence not in the HQ (high quality) region, and the adapter sequences. The old bax format saved everything with indexes into the usable data. The subreads.bam contains all the usable data.

                  Comment

                  • jpummil
                    Member
                    • Apr 2014
                    • 85

                    #10
                    Used bax2bam to convert all of the bax.h5 files to .bam format...no problem.

                    When I try to use pbalign on the 10ea .bam files, it complains after it appears to read and accept the first three files:

                    pbalign: error: unrecognized arguments: movie_04.subreads.bam movie_05.subreads.bam movie_06.subreads.bam movie_07.subre
                    ads.bam movie_08.subreads.bam movie_09.subreads.bam movie_10.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligne
                    d.bam

                    PBalign comand as follows:
                    pbalign --nproc 32 *.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam

                    *.subreads.bam corresponds to the following filenames:
                    movie_01.subreads.bam
                    thru
                    movie_10.subreads.bam

                    I don't see anything obvious in the pbalign --help details about running multiple .bam files for alignment.

                    My initial tests using just a single lane worked as expected. It's when I get to the task of running 10ea lanes thru pbalign that things go south.

                    I'll keep looking for docs, etc...but any tips are appreciated. It just seems as if the PacBio help documents are a bit lacking on details...

                    Comment

                    • gconcepcion
                      Member
                      • Dec 2010
                      • 68

                      #11
                      Originally posted by jpummil View Post
                      Used bax2bam to convert all of the bax.h5 files to .bam format...no problem.

                      When I try to use pbalign on the 10ea .bam files, it complains after it appears to read and accept the first three files:

                      pbalign: error: unrecognized arguments: movie_04.subreads.bam movie_05.subreads.bam movie_06.subreads.bam movie_07.subre
                      ads.bam movie_08.subreads.bam movie_09.subreads.bam movie_10.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligne
                      d.bam

                      PBalign comand as follows:
                      pbalign --nproc 32 *.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam

                      *.subreads.bam corresponds to the following filenames:
                      movie_01.subreads.bam
                      thru
                      movie_10.subreads.bam

                      I don't see anything obvious in the pbalign --help details about running multiple .bam files for alignment.

                      My initial tests using just a single lane worked as expected. It's when I get to the task of running 10ea lanes thru pbalign that things go south.

                      I'll keep looking for docs, etc...but any tips are appreciated. It just seems as if the PacBio help documents are a bit lacking on details...

                      $ ls *.subreads.bam >input.fofn
                      $ pbalign --nproc 32 input.fofn Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam


                      pbalign takes individual sequence files as input, or a File of File Names (*.fofn)

                      From pbalign --help:
                      "Input can be a fasta, pls.h5, bas.h5 or ccs.h5 file or a fofn (file of file names). Output can be in SAM or BAM format."
                      Last edited by gconcepcion; 12-19-2016, 04:34 PM.

                      Comment

                      • jpummil
                        Member
                        • Apr 2014
                        • 85

                        #12
                        Originally posted by gconcepcion View Post
                        $ ls *.subreads.bam >input.fofn
                        $ pbalign --nproc 32 input.fofn Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam


                        pbalign takes individual sequence files as input, or a File of File Names (*.fofn)

                        From pbalign --help:
                        "Input can be a fasta, pls.h5, bas.h5 or ccs.h5 file or a fofn (file of file names). Output can be in SAM or BAM format."
                        Thanks gconcepcion! That definitely did the trick. Ran into a /tmp space issue next. THINK I have it solved by supplying the --tmpDir option in the command line and setting it to the /local_scratch on the compute node as well...almost a terabyte available there. Will know very soon ;-)

                        Onwards!

                        Comment

                        • jpummil
                          Member
                          • Apr 2014
                          • 85

                          #13
                          SUCCESS!

                          Not a huge improvement on the original assembly (using Canu)...but improved gene's found using Quast (comparison attached).

                          I used default parameters in Quiver, not sure tweaking would've changed a whole lot.

                          I'm "officially" on vacation 'til Jan 9th, but will try and toss together a short "How-To" based on the steps I followed (and those tips graciously supplied by rhall and gconcepcion!) so that others just starting to work with PacBio data and tools can do this as well.
                          Attached Files

                          Comment

                          • jpummil
                            Member
                            • Apr 2014
                            • 85

                            #14
                            So, I still need to write up a brief process workflow to describe what I've doe thus far, but thought I'd update anyone following this thread on another step I heard about thru the grapevine (people at PAG this week sending me email).

                            APPARENTLY....many are performing an additional step after the Quiver polishing and running the updated assembly thru BROAD's Pilon tool. Of course, you have to create a new alignment file with the original bax.h5 files (converted to .bams with bax2bam utility) and the Quiver polished assembly....

                            It's running now, so I'll report back (bad or good) when it completes and I can run a quick comparison of the three with Quast.

                            It should be said that I have been just using default settings with most of the tools, so probably some incremental gains to be had by tweaking settings IF you know your data well. That said, it seems developers do a pretty nice job of setting defaults these days!

                            Comment

                            • jpummil
                              Member
                              • Apr 2014
                              • 85

                              #15
                              Results of using Pilon AFTER having used Quiver....

                              So, Pilon just completed after 40+ hours of computation....another small increase in file size from what Quiver provided...

                              -rw-rw-r-- 1 jpummil jpummil 121958163 Jan 20 08:50 Q1133.012017.GS170M.Canu-Base.fasta

                              -rw-rw-r-- 1 jpummil jpummil 122467877 Jan 20 02:16 Q1133.122116.GS170M.Quiver.fasta (+ 509,714 bytes)

                              -rw-rw-r-- 1 jpummil jpummil 122649681 Jan 20 02:16 Q1133.012017.GS170M.Pilon.fasta (+ 181,804 bytes)

                              Still need to do a Quast assessment, maybe sometime today...

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              13 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...