I've just posted to github PacBioEDA, a package of python scripts for subread-level examination of the output of the PacBio instrument.
From the blurb (on pacbiodevnet, registration required):
From the blurb (on pacbiodevnet, registration required):
Do you want to have a closer look at the data your PacBio RS
instrument is producing? Do you want to see a lower level of detail
than that contained in the reports produced by SMRTanalysis? The
PacBioEDA package lets you do exploratory data analysis at the
subread/region level.
For example: a given read was split into 5 subreads. Each subread
aligned nicely -- but all of them to the reverse strand, rather than
in the +-+-+ sequence you'd expect. Looking further, we see that the
aligned portion of each subread included only the first first half of
it. Conjecture: we're miscalling the adapter on one end of the
SMRTbell. Graph the alignment scores for the read against the adapter
sequence. Conjecture confirmed.
(The above problem went away with the 1.2.3 release, which does
a better job of recognising adapters.)
PacBioEDA consists of a set of python scripts, which accept as input a
bas.h5 file, and in some cases the associated cmp.h5 alignments file. The
scripts are run from the command line, and produce either a text file
or a .png plot as output. This is a no-frills package intended for
people who are willing to get dirty with their data.
I've included lots of commentary in the scripts themselves (unlike
most Open Source bioinformatics offerings, where the only comment is
the copyright notice), in the hope that this will help you understand
what the scripts are telling you about your data.
Tom Skelly ([email protected])
instrument is producing? Do you want to see a lower level of detail
than that contained in the reports produced by SMRTanalysis? The
PacBioEDA package lets you do exploratory data analysis at the
subread/region level.
For example: a given read was split into 5 subreads. Each subread
aligned nicely -- but all of them to the reverse strand, rather than
in the +-+-+ sequence you'd expect. Looking further, we see that the
aligned portion of each subread included only the first first half of
it. Conjecture: we're miscalling the adapter on one end of the
SMRTbell. Graph the alignment scores for the read against the adapter
sequence. Conjecture confirmed.
(The above problem went away with the 1.2.3 release, which does
a better job of recognising adapters.)
PacBioEDA consists of a set of python scripts, which accept as input a
bas.h5 file, and in some cases the associated cmp.h5 alignments file. The
scripts are run from the command line, and produce either a text file
or a .png plot as output. This is a no-frills package intended for
people who are willing to get dirty with their data.
I've included lots of commentary in the scripts themselves (unlike
most Open Source bioinformatics offerings, where the only comment is
the copyright notice), in the hope that this will help you understand
what the scripts are telling you about your data.
Tom Skelly ([email protected])
Comment