Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rhall
    Senior Member
    • Aug 2012
    • 324

    #16
    Hmm. Can you attach the contents of the '# Task 1' in the whitelist and non whitelist filter jobs, <job dir>/workflow/P_Filter/filter_001of0??.sh for a working whitelist it should look something like this:
    Code:
    filterPlsH5.py --debug --filter='MinReadScore=0.80,MinSRL=500,MinRL=100,ReadWhitelist=/home/rhall/projects/reads.list' --trim='True' --outputDir=/home/rhall/projects/data/filtered_regions --outputSummary=/home/rhall/projects/data/filtered_summary.chunk001of064.csv --outputFofn=/home/rhall/projects/data/filtered_regions.chunk001of064.fofn /home/rhall/projects/input.chunk001of064.fofn

    Comment

    • ebioman
      Member
      • Aug 2013
      • 41

      #17
      It looks okay on the first glance:

      Code:
      # Task filter_001of001 commands:
      filterPlsH5.py --filter='MinReadScore=0.80,MinSRL=500,MinRL=500,ReadWhitelist=whitelist.txt' --trim='True' --outputDir=/local/scratch/eschmid/HGAP_chloro/data/filtered_regions --outputSummary=/local/scratch/eschmid/HGAP_chloro/data/filtered_summary.chunk001of001.csv --outputFofn=/local/scratch/eschmid/HGAP_chloro/data/filtered_regions.chunk001of001.fofn /local/scratch/eschmid/HGAP_chloro/input.chunk001of001.fofn || exit $?; echo "Task 0 completed at `date -u`" || exit $?;
      I think I will go again step-wise through their minimal-example and repeat it - maybe I pick up some problem

      **UPDATE:**
      The repetition of the minimal example worked fine, so I checked everything again.
      So, actually I think I discovered where the problem stems from - not sure whether it will actually fix it as well:

      So I took my read:
      Code:
      blasr v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5  Quercus_rubra_NC_020152.complete.fasta -bestn 1 -header -m 1 -nproc 15 -out chloro.align
      And ended up with :
      Code:
      qname tname qstrand tstrand score pctsimilarity tstart tend tlength qstart qend qlength ncells
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54502/0_7747 Ch1 0 0 -373 83.7838 31916 32012 161304 4355 4465 7862 2366
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54556/821_1265 Ch1 0 1 -1530 82.4268 54993 55441 161304 821 1265 7304 14097
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54573/0_2683 Ch1 0 0 -7976 82.8279 54507 56615 161304 272 2659 2812 54303
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54570/5913_10576 Ch1 0 1 -15043 82.133 12405 16513 161304 5983 10530 10661 125713
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54569/2315_3986 Ch1 0 1 -5881 85.9842 104684 106188 161304 2403 3975 10419 33850
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54569/4047_7434 Ch1 0 0 -12456 86.9206 55152 58229 161304 4131 7434 10419 71202
      Now the strange thing, if I check with
      Code:
      h5ls -r v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5
      /PulseData/ConsensusBaseCalls/ZMW/HoleNumber Dataset {54494/Inf}
      The holes in the read actually then started not with 1 but with 54000-something.
      I wondered whether part of the problem might actually stem from the way I did the white-list. I actually got all the filtered *.h5 files as FASTAs and then mapped these against the reference.
      I tried now again, but directly with the *.h5 files against the reference to see whether the outcome differs.
      Last edited by ebioman; 12-17-2013, 03:31 AM.

      Comment

      • ebioman
        Member
        • Aug 2013
        • 41

        #18
        I did now another test with the minimal example from the Whitelist-tutorial and my data in parallel - still the same outcome, failing of filtering


        So, what I did was to filter the example for E. coli reads and mine for chloroplasts:

        Code:
        blasr m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5 ecoli.fasta -bestn 1 -header  -nproc 20 -out ecoli.align
        Code:
        blasr 0xJIqqkV_m130905_172422_42182_c100546852550000001823086111241314_s1_p0.2.bax.h5 Quercus_rubra_NC_020152.complete.fasta -bestn 1 -header  -nproc 20 -out chloro.align
        To keep it as simple as possible I did not filter inverse but used the mapped reads, just used the movie name and hole number:
        Code:
        awk '{print $1}' chloro.align | perl -lpe 's/\/\w+$//' > whitelist.txt
        Code:
        awk '{print $1}' ecoli.align | perl -lpe 's/\/\w+$//' > whitelist.txt
        Then I did the input-prep:
        Code:
        ls /local/scratch/eschmid/HGAP_chloro/0xJIqqkV_m130905_172422_42182_c100546852550000001823086111241314_s1_p0.2.bax.h5 > input.fofn
        Code:
        ls /local/scratch/eschmid/HGAP_minEX/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5 > input.fofn
        Then I launched them in parallel with the same settings file:

        Code:
        smrtpipe.py --params=settings.xml xml:input.xml
        Guess what happened?

        Correct, the E.coli example worked mine not

        This is reflected in the filter_summary.csv - some of the E.coli reads got the 1,1,1 whereas mine get ALL a 0 at the last position and are not used

        I was hoping that I would discover a kind of mistake going step wise in parallel through it, but wasnt able to spot anything at the moment

        Comment

        • rhall
          Senior Member
          • Aug 2012
          • 324

          #19
          I think I may see the problem, why does your base file start with '0xJIqqkV_' is that added by the sequencing provider? bas / bax files should start with 'm' it is generally not recommended to change the name of h5 files. I haven't check the code but my guess is the root of the read id is being taken from the filename, i.e.
          m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54502
          never has a match in:
          v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5
          You could try removing v4GNfoL0_ from the filename, or adding it to the readID in the whitelist, although I would go with the former, renamed h5 files tend to lead to problems.

          The holes in the read actually then started not with 1 but with 54000-something.
          This is because the bax file is the second of a set of three, every RSII run produces 3 bax.h5 files totaling ~150000 ZMWs, *.1.bax.h5 has ZMWs 1-~50,000 *.2.bax.h5 ~50,000 - ~100,000 etc.

          Comment

          • ebioman
            Member
            • Aug 2013
            • 41

            #20
            @rhall
            You are a life-saver !
            Such a stupid and obvious mistake and I am looking into this problem for a week now !
            Now everything works like charm - it was this particular identifier which is randomly assigned by our company...

            Thanks a lot again for your help!

            Comment

            • rhall
              Senior Member
              • Aug 2012
              • 324

              #21
              No problem. Glad that fixed it.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                Yesterday, 10:05 AM
              • SEQadmin2
                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                by SEQadmin2


                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                Introduction

                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                05-22-2026, 06:42 AM
              • SEQadmin2
                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                by SEQadmin2

                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                05-06-2026, 09:04 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 12:03 PM
              0 responses
              19 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, Yesterday, 11:40 AM
              0 responses
              14 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-28-2026, 11:40 AM
              0 responses
              29 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-26-2026, 10:12 AM
              0 responses
              31 views
              0 reactions
              Last Post SEQadmin2  
              Working...