![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Find large indels from PacBio reads? | metheuse | Pacific Biosciences | 1 | 07-01-2013 03:08 PM |
Have a genome assembly, what should I do with 15x Pacbio reads? | lemur2 | Pacific Biosciences | 2 | 10-25-2012 10:06 PM |
Filtering clonal reads | AlexB | 454 Pyrosequencing | 2 | 05-11-2010 01:30 PM |
Filtering SOLiD reads | k-gun12 | Bioinformatics | 8 | 03-12-2010 09:51 PM |
Filtering and excluding reads | dawe | Bioinformatics | 3 | 09-09-2009 09:47 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Switzerland Join Date: Aug 2013
Posts: 41
|
![]()
Hello
I am dealing with a rather common problem which gives me real headaches with PacBio data: How can you filter your reads prior assembly ? Note: I am not talking about subread filtering, quality filtering and so on, but filtering for "contamination". E.g. Assembling a genome and filtering known metagenomes, or filtering only mitochondrial reads and assembling its genome from a sequenced organism ,..... So, what I am looking for is a possibility to filter the *.h5 files and only use certain ones for the assembly. I guess one would have to parse the files, but with the h5 format that is a little bit beyond my scope. For some modules (e.g. Allora) FASTA can be provided, but for other ones not (e.g. HGAP). Anybody ideas ? |
![]() |
![]() |
![]() |
#2 |
Moderator
Location: Oslo, Norway Join Date: Nov 2008
Posts: 415
|
![]()
Have a look here: https://github.com/PacificBioscience...filtering-step. Does that answer your question?
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Switzerland Join Date: Aug 2013
Posts: 41
|
![]()
On the first glance, it seems exactly what I was looking for - thanks !
Will test it over night ![]() |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Germany Join Date: Oct 2008
Posts: 415
|
![]()
Great, thanks for the link. The tutorial is even working (!).
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Canada Join Date: Apr 2014
Posts: 26
|
![]()
Hello,
I'm following the whitelist tutorial and have encountered an error where I have to use their python script to generate the whitelist.txt. I'm using my own PacBio data as opposed to the data provided, but I don't think that it should make a difference. When I run the python script, this is the error I encounter: Traceback (most recent call last): File "whitelist.py", line 20, in <module> hole_num = alignment_id_parts[1] IndexError: list index out of range Which I believe means that when the list is generated from splitting the first column at the '/', it is not generating any more than one list element. Is anybody else encountering this issue? Regards, Nick |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: San Francisco Join Date: Aug 2012
Posts: 322
|
![]()
Can you post an example of your alignment output?
Code:
head <align output file> Thanks. |
![]() |
![]() |
![]() |
#7 |
Member
Location: Canada Join Date: Apr 2014
Posts: 26
|
![]()
I got it working but had to change around my script a bit. I had to change the headers to look like:
Code:
qName/qHole/qAlign tName qStrand tStrand score percentSimilarity tStart tEnd tLength qStart qEnd qLength nCells m130726_012611_42158_c100524682550000001823080609281382_s1_p0/72/2066_4077 gi|556503834|ref|NC_000913.3| 0 0 -828 91.0891 366290 366482 4641652 2118 2314 4346 4145 I changed the code to just include my one movie file: Code:
import sys # NOTE: this script assumes a single movie in the blasr alignment blacklist_alignment = sys.argv[1] num_holes = int(sys.argv[2]) blacklist = set() movie_prefix = 'm130726_012611_42158_c100524682550000001823080609281382_s1_p0' for line in open(blacklist_alignment): if line.startswith('qname'): continue cols = line.split() alignment_id_parts = cols[0].split('/') hole_num = alignment_id_parts[1] blacklist.add(hole_num) for i in range(num_holes): i = str(i) if i not in blacklist: print '/'.join([movie_prefix, i]) Code:
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/0 m130726_012611_42158_c100524682550000001823080609281382_s1_p0/1 m130726_012611_42158_c100524682550000001823080609281382_s1_p0/2 m130726_012611_42158_c100524682550000001823080609281382_s1_p0/3 m130726_012611_42158_c100524682550000001823080609281382_s1_p0/4 ... Code:
<param name="whiteList" label="Read IDs to whitelist"> <value>/home/USERNAME/BACassembly/whitelist.txt</value> </param> In other words, does Smrtpipe have a whitelist function built into it? I would have expected more code to be required. |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: San Francisco Join Date: Aug 2012
Posts: 322
|
![]()
Yes SMRT Pipe should filter out the reads if passed that parameter. Technically an extra filter is passed to the filtering task, you can check workflow/P_Filter/filter_?of?.sh to make sure the ReadWhilelist parameter is present:
Code:
filterPlsH5.py --debug --filter='MinReadScore=0.80,MinSRL=500,MinRL=100,ReadWhitelist=/home/USERNAME/BACassembly/whitelist.txt ......' |
![]() |
![]() |
![]() |
#9 |
Member
Location: Canada Join Date: Apr 2014
Posts: 26
|
![]()
The whitelisting tutorial online is a good one, but it doesn't specify the format the whitelist would need to be if you wanted to run multiple SMRT cells with their own whitelisted reads.
What format would the whitelist file need to be in, in order to run say 8 smrt cells all of which are decontaminated? "...Note whitelisting in this tutorial assumes that you are working on one cell of data at a time (1x bas.h5 file, or 3x bax.h5 files), it could be expanded to run on multiple cells with some changes to the code." |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: San Francisco Join Date: Aug 2012
Posts: 322
|
![]()
The whitelist file is in the same format, e.g.
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/0 The cell/file name would just be different. The note in the tutorial is because the tutorial whitelists the reads that do not map to ecoli, therefore the script needs to take the 'inverse' of the reads that map to ecoli, this is only really simple for a single cell. To make the script work on multiple cells it would need to be expanded to keep track of all the cells, the alternative is to generate the whitelist for each cell, then join them together to get a whitelist for running a job on all the cells. |
![]() |
![]() |
![]() |
#11 |
Member
Location: Canada Join Date: Apr 2014
Posts: 26
|
![]()
Awesome, thank you for your reply.
So basically a whitelist encompassing more than one cell could look like: <cell_1>_s1_p0/0 <cell_1>_s1_p0/2 <cell_1>_s1_p0/5 .... <cell_1>_s1_p0/81740 <cell_2>_s1_p0/1 <cell_2>_s1_p0/2 <cell_2>_s1_p0/6 Such that subsequent whitelists are basically appended to the previous one? Regards, Nick |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: San Francisco Join Date: Aug 2012
Posts: 322
|
![]()
Exactly
![]() |
![]() |
![]() |
![]() |
#13 | |
Member
Location: France, Poitiers Join Date: Sep 2012
Posts: 12
|
![]() Quote:
It seems that the previous link doesn't redirect to the correct page anymore (at least for me). Nonetheless, I found the whitelist tutorial at this location: https://github.com/PacificBioscience...sting-Tutorial Can the whitelist procedure be implemented in SMRT Portal ? It is a very standard filtering step which would benefit to be run in a user-friendly way. Seb. |
|
![]() |
![]() |
![]() |
#14 |
Member
Location: Canada Join Date: Apr 2014
Posts: 26
|
![]()
Indeed, it can be implemented but it requires some XML coding. If you just follow the tutorial, and don't do the last part of it (the SMRTpipe part), and make your own custom protocol using the whitelist filtering parameter, you can specify the path to your whitelist in smrtportal as a parameter in the filtering step and it will use the whitelist to filter your reads
|
![]() |
![]() |
![]() |
#15 |
Member
Location: France, Poitiers Join Date: Sep 2012
Posts: 12
|
![]()
Ok, thanks.
But I need to practice more bas.h5 manipulation and protocol creation first, as I'm very new with PacBio data. Actually I don't have my own data yet, so I'm practicing with tutorials data. By the way, do you know where can I find publicly available cosmid sequencing data under the PacBio format ? |
![]() |
![]() |
![]() |
#16 |
Member
Location: Canada Join Date: Apr 2014
Posts: 26
|
![]()
I don't know about cosmid data, but the tutorial provides you with sample data. It's useful if you're just trying to get familiar with the bas.h5 file type.
Code:
wget http://files.pacb.com/software/hgap/HGAp_BAS_H5_DATA/HGAp_BAS_H5_DATA/BAC/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5 Nick Last edited by OstanNick; 07-23-2014 at 09:20 AM. |
![]() |
![]() |
![]() |
#17 |
Member
Location: France, Poitiers Join Date: Sep 2012
Posts: 12
|
![]()
Yes, I will use this sample data. BAC and Cosmids are finally quite similar, and the method would be the same.
Thanks, seb. |
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: San Francisco Join Date: Aug 2012
Posts: 322
|
![]()
I just wanted to add, that is very old data (from an original RS machine, not an RS II), it may be better going with something that isn't a BAC or Cosmid, but is current. The data is also missing a metadeta.xml file, so will be almost impossible to import into SMRT Portal.
|
![]() |
![]() |
![]() |
#19 |
Junior Member
Location: USA Join Date: Nov 2013
Posts: 3
|
![]()
Hi, it's been a while since a post to this thread, but I have a question/comment that could help future people looking to filter PacBio libraries of a "metagenomic" nature - especially as the links above redirect to protocols using older versions of the programs. I have used the latest version of the SMRTanalysis (2.3.0) to process (and assemble using RS_HGAP_Assembly.2) my data, and in addition to the .h5 reads, SMRTPortal has produced raw subreads in fasta format. Is this a new output feature? Or would these not work for doing metagenomic filtering with a known "non-target" organism? Thanks!
|
![]() |
![]() |
![]() |
#20 |
Junior Member
Location: Europe Join Date: Jan 2015
Posts: 2
|
![]()
Hi!
As "acgt", I have also to process/assemble metagenomics data with PacBio reads, I follow the next publications (may be useful to some): http://files.pacb.com/pdf/RHall_ASM2...veWorkflow.pdf http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3294205/ The problem is that blasting the preassembled reads to ncbi NT database I find many reads with a partial (in length) best hit. Consequently, I would like to ask: 1) How could I improve the sensitivity of overlapping step (e.g. blasr parameters) in order to improve the consensus generation in terms of production less chimeric preassembled reads? 2) Also should I have to keep the same values for blasr running in the polishing step? Thanks in advance ![]() |
![]() |
![]() |
![]() |
Tags |
filter reads, pacbio |
Thread Tools | |
|
|