Seqanswers Leaderboard Ad

**rhall** · 12-13-2013, 02:01 PM

Hmm. Can you attach the contents of the '# Task 1' in the whitelist and non whitelist filter jobs, <job dir>/workflow/P_Filter/filter_001of0??.sh for a working whitelist it should look something like this:

Code:

filterPlsH5.py --debug --filter='MinReadScore=0.80,MinSRL=500,MinRL=100,ReadWhitelist=/home/rhall/projects/reads.list' --trim='True' --outputDir=/home/rhall/projects/data/filtered_regions --outputSummary=/home/rhall/projects/data/filtered_summary.chunk001of064.csv --outputFofn=/home/rhall/projects/data/filtered_regions.chunk001of064.fofn /home/rhall/projects/input.chunk001of064.fofn

**ebioman** · 12-17-2013, 12:24 AM

It looks okay on the first glance:

Code:

# Task filter_001of001 commands:
filterPlsH5.py --filter='MinReadScore=0.80,MinSRL=500,MinRL=500,ReadWhitelist=whitelist.txt' --trim='True' --outputDir=/local/scratch/eschmid/HGAP_chloro/data/filtered_regions --outputSummary=/local/scratch/eschmid/HGAP_chloro/data/filtered_summary.chunk001of001.csv --outputFofn=/local/scratch/eschmid/HGAP_chloro/data/filtered_regions.chunk001of001.fofn /local/scratch/eschmid/HGAP_chloro/input.chunk001of001.fofn || exit $?; echo "Task 0 completed at `date -u`" || exit $?;

I think I will go again step-wise through their minimal-example and repeat it - maybe I pick up some problem

**UPDATE:**
The repetition of the minimal example worked fine, so I checked everything again.
So, actually I think I discovered where the problem stems from - not sure whether it will actually fix it as well:

So I took my read:

Code:

blasr v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5  Quercus_rubra_NC_020152.complete.fasta -bestn 1 -header -m 1 -nproc 15 -out chloro.align

And ended up with :

Code:

qname tname qstrand tstrand score pctsimilarity tstart tend tlength qstart qend qlength ncells
m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54502/0_7747 Ch1 0 0 -373 83.7838 31916 32012 161304 4355 4465 7862 2366
m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54556/821_1265 Ch1 0 1 -1530 82.4268 54993 55441 161304 821 1265 7304 14097
m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54573/0_2683 Ch1 0 0 -7976 82.8279 54507 56615 161304 272 2659 2812 54303
m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54570/5913_10576 Ch1 0 1 -15043 82.133 12405 16513 161304 5983 10530 10661 125713
m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54569/2315_3986 Ch1 0 1 -5881 85.9842 104684 106188 161304 2403 3975 10419 33850
m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54569/4047_7434 Ch1 0 0 -12456 86.9206 55152 58229 161304 4131 7434 10419 71202

Now the strange thing, if I check with

Code:

h5ls -r v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5
/PulseData/ConsensusBaseCalls/ZMW/HoleNumber Dataset {54494/Inf}

The holes in the read actually then started not with 1 but with 54000-something.
I wondered whether part of the problem might actually stem from the way I did the white-list. I actually got all the filtered *.h5 files as FASTAs and then mapped these against the reference.
I tried now again, but directly with the *.h5 files against the reference to see whether the outcome differs.

**ebioman** · 12-17-2013, 04:48 AM

I did now another test with the minimal example from the Whitelist-tutorial and my data in parallel - still the same outcome, failing of filtering

So, what I did was to filter the example for E. coli reads and mine for chloroplasts:

Code:

blasr m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5 ecoli.fasta -bestn 1 -header  -nproc 20 -out ecoli.align

Code:

blasr 0xJIqqkV_m130905_172422_42182_c100546852550000001823086111241314_s1_p0.2.bax.h5 Quercus_rubra_NC_020152.complete.fasta -bestn 1 -header  -nproc 20 -out chloro.align

To keep it as simple as possible I did not filter inverse but used the mapped reads, just used the movie name and hole number:

Code:

awk '{print $1}' chloro.align | perl -lpe 's/\/\w+$//' > whitelist.txt

Code:

awk '{print $1}' ecoli.align | perl -lpe 's/\/\w+$//' > whitelist.txt

Then I did the input-prep:

Code:

ls /local/scratch/eschmid/HGAP_chloro/0xJIqqkV_m130905_172422_42182_c100546852550000001823086111241314_s1_p0.2.bax.h5 > input.fofn

Code:

ls /local/scratch/eschmid/HGAP_minEX/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5 > input.fofn

Then I launched them in parallel with the same settings file:

Code:

smrtpipe.py --params=settings.xml xml:input.xml

Guess what happened?

Correct, the E.coli example worked mine not

This is reflected in the filter_summary.csv - some of the E.coli reads got the 1,1,1 whereas mine get ALL a 0 at the last position and are not used

I was hoping that I would discover a kind of mistake going step wise in parallel through it, but wasnt able to spot anything at the moment

**rhall** · 12-17-2013, 06:25 AM

I think I may see the problem, why does your base file start with '0xJIqqkV_' is that added by the sequencing provider? bas / bax files should start with 'm' it is generally not recommended to change the name of h5 files. I haven't check the code but my guess is the root of the read id is being taken from the filename, i.e.
m130905_221121_42182_c100546852550000001823086111241316_s1_p0/54502
never has a match in:
v4GNfoL0_m130905_221121_42182_c100546852550000001823086111241316_s1_p0.2.bax.h5
You could try removing v4GNfoL0_ from the filename, or adding it to the readID in the whitelist, although I would go with the former, renamed h5 files tend to lead to problems.

The holes in the read actually then started not with 1 but with 54000-something.

This is because the bax file is the second of a set of three, every RSII run produces 3 bax.h5 files totaling ~150000 ZMWs, *.1.bax.h5 has ZMWs 1-~50,000 *.2.bax.h5 ~50,000 - ~100,000 etc.

**ebioman** · 12-17-2013, 07:24 AM

@rhall
You are a life-saver !

Such a stupid and obvious mistake and I am looking into this problem for a week now !
Now everything works like charm - it was this particular identifier which is randomly assigned by our company...

Thanks a lot again for your help!

**rhall** · 12-17-2013, 07:37 AM

No problem. Glad that fixed it.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News