Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hmm. Can you attach the contents of the '# Task 1' in the whitelist and non whitelist filter jobs, <job dir>/workflow/P_Filter/filter_001of0??.sh for a working whitelist it should look something like this:
    Code:
    filterPlsH5.py --debug --filter='MinReadScore=0.80,MinSRL=500,MinRL=100,ReadWhitelist=/home/rhall/projects/reads.list' --trim='True' --outputDir=/home/rhall/projects/data/filtered_regions --outputSummary=/home/rhall/projects/data/filtered_summary.chunk001of064.csv --outputFofn=/home/rhall/projects/data/filtered_regions.chunk001of064.fofn /home/rhall/projects/input.chunk001of064.fofn

    Comment


    • #17
      It looks okay on the first glance:

      Code:
      # Task filter_001of001 commands:
      filterPlsH5.py --filter='MinReadScore=0.80,MinSRL=500,MinRL=500,ReadWhitelist=whitelist.txt' --trim='True' --outputDir=/local/scratch/eschmid/HGAP_chloro/data/filtered_regions --outputSummary=/local/scratch/eschmid/HGAP_chloro/data/filtered_summary.chunk001of001.csv --outputFofn=/local/scratch/eschmid/HGAP_chloro/data/filtered_regions.chunk001of001.fofn /local/scratch/eschmid/HGAP_chloro/input.chunk001of001.fofn || exit $?; echo "Task 0 completed at `date -u`" || exit $?;
      I think I will go again step-wise through their minimal-example and repeat it - maybe I pick up some problem

      **UPDATE:**
      The repetition of the minimal example worked fine, so I checked everything again.
      So, actually I think I discovered where the problem stems from - not sure whether it will actually fix it as well:

      So I took my read:
      Code:
      blasr v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5  Quercus_rubra_NC_020152.complete.fasta -bestn 1 -header -m 1 -nproc 15 -out chloro.align
      And ended up with :
      Code:
      qname tname qstrand tstrand score pctsimilarity tstart tend tlength qstart qend qlength ncells
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54502/0_7747 Ch1 0 0 -373 83.7838 31916 32012 161304 4355 4465 7862 2366
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54556/821_1265 Ch1 0 1 -1530 82.4268 54993 55441 161304 821 1265 7304 14097
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54573/0_2683 Ch1 0 0 -7976 82.8279 54507 56615 161304 272 2659 2812 54303
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54570/5913_10576 Ch1 0 1 -15043 82.133 12405 16513 161304 5983 10530 10661 125713
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54569/2315_3986 Ch1 0 1 -5881 85.9842 104684 106188 161304 2403 3975 10419 33850
      m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54569/4047_7434 Ch1 0 0 -12456 86.9206 55152 58229 161304 4131 7434 10419 71202
      Now the strange thing, if I check with
      Code:
      h5ls -r v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5
      /PulseData/ConsensusBaseCalls/ZMW/HoleNumber Dataset {54494/Inf}
      The holes in the read actually then started not with 1 but with 54000-something.
      I wondered whether part of the problem might actually stem from the way I did the white-list. I actually got all the filtered *.h5 files as FASTAs and then mapped these against the reference.
      I tried now again, but directly with the *.h5 files against the reference to see whether the outcome differs.
      Last edited by ebioman; 12-17-2013, 03:31 AM.

      Comment


      • #18
        I did now another test with the minimal example from the Whitelist-tutorial and my data in parallel - still the same outcome, failing of filtering


        So, what I did was to filter the example for E. coli reads and mine for chloroplasts:

        Code:
        blasr m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5 ecoli.fasta -bestn 1 -header  -nproc 20 -out ecoli.align
        Code:
        blasr 0xJIqqkV_m130905_172422_42182_c100546852550000001823086111241314_s1_p0.2.bax.h5 Quercus_rubra_NC_020152.complete.fasta -bestn 1 -header  -nproc 20 -out chloro.align
        To keep it as simple as possible I did not filter inverse but used the mapped reads, just used the movie name and hole number:
        Code:
        awk '{print $1}' chloro.align | perl -lpe 's/\/\w+$//' > whitelist.txt
        Code:
        awk '{print $1}' ecoli.align | perl -lpe 's/\/\w+$//' > whitelist.txt
        Then I did the input-prep:
        Code:
        ls /local/scratch/eschmid/HGAP_chloro/0xJIqqkV_m130905_172422_42182_c100546852550000001823086111241314_s1_p0.2.bax.h5 > input.fofn
        Code:
        ls /local/scratch/eschmid/HGAP_minEX/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5 > input.fofn
        Then I launched them in parallel with the same settings file:

        Code:
        smrtpipe.py --params=settings.xml xml:input.xml
        Guess what happened?

        Correct, the E.coli example worked mine not

        This is reflected in the filter_summary.csv - some of the E.coli reads got the 1,1,1 whereas mine get ALL a 0 at the last position and are not used

        I was hoping that I would discover a kind of mistake going step wise in parallel through it, but wasnt able to spot anything at the moment

        Comment


        • #19
          I think I may see the problem, why does your base file start with '0xJIqqkV_' is that added by the sequencing provider? bas / bax files should start with 'm' it is generally not recommended to change the name of h5 files. I haven't check the code but my guess is the root of the read id is being taken from the filename, i.e.
          m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54502
          never has a match in:
          v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5
          You could try removing v4GNfoL0_ from the filename, or adding it to the readID in the whitelist, although I would go with the former, renamed h5 files tend to lead to problems.

          The holes in the read actually then started not with 1 but with 54000-something.
          This is because the bax file is the second of a set of three, every RSII run produces 3 bax.h5 files totaling ~150000 ZMWs, *.1.bax.h5 has ZMWs 1-~50,000 *.2.bax.h5 ~50,000 - ~100,000 etc.

          Comment


          • #20
            @rhall
            You are a life-saver !
            Such a stupid and obvious mistake and I am looking into this problem for a week now !
            Now everything works like charm - it was this particular identifier which is randomly assigned by our company...

            Thanks a lot again for your help!

            Comment


            • #21
              No problem. Glad that fixed it.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 08:47 AM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              59 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X