SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Find large indels from PacBio reads? metheuse Pacific Biosciences 1 07-01-2013 03:08 PM
Have a genome assembly, what should I do with 15x Pacbio reads? lemur2 Pacific Biosciences 2 10-25-2012 10:06 PM
Filtering clonal reads AlexB 454 Pyrosequencing 2 05-11-2010 01:30 PM
Filtering SOLiD reads k-gun12 Bioinformatics 8 03-12-2010 09:51 PM
Filtering and excluding reads dawe Bioinformatics 3 09-09-2009 09:47 AM

Reply
 
Thread Tools
Old 11-26-2013, 03:20 AM   #1
ebioman
Member
 
Location: Switzerland

Join Date: Aug 2013
Posts: 41
Default Filtering PacBio reads

Hello

I am dealing with a rather common problem which gives me real headaches with PacBio data:

How can you filter your reads prior assembly ?

Note: I am not talking about subread filtering, quality filtering and so on, but filtering for "contamination".

E.g. Assembling a genome and filtering known metagenomes, or filtering only mitochondrial reads and assembling its genome from a sequenced organism ,.....

So, what I am looking for is a possibility to filter the *.h5 files and only use certain ones for the assembly. I guess one would have to parse the files, but with the h5 format that is a little bit beyond my scope.
For some modules (e.g. Allora) FASTA can be provided, but for other ones not (e.g. HGAP).

Anybody ideas ?
ebioman is offline   Reply With Quote
Old 11-27-2013, 06:36 AM   #2
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

Have a look here: https://github.com/PacificBioscience...filtering-step. Does that answer your question?
flxlex is offline   Reply With Quote
Old 11-27-2013, 07:24 AM   #3
ebioman
Member
 
Location: Switzerland

Join Date: Aug 2013
Posts: 41
Default

On the first glance, it seems exactly what I was looking for - thanks !
Will test it over night
ebioman is offline   Reply With Quote
Old 05-26-2014, 06:46 AM   #4
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

Great, thanks for the link. The tutorial is even working (!).
colindaven is offline   Reply With Quote
Old 06-18-2014, 09:03 AM   #5
OstanNick
Member
 
Location: Canada

Join Date: Apr 2014
Posts: 26
Default

Hello,

I'm following the whitelist tutorial and have encountered an error where I have to use their python script to generate the whitelist.txt.

I'm using my own PacBio data as opposed to the data provided, but I don't think that it should make a difference.

When I run the python script, this is the error I encounter:

Traceback (most recent call last):
File "whitelist.py", line 20, in <module>
hole_num = alignment_id_parts[1]
IndexError: list index out of range

Which I believe means that when the list is generated from splitting the first column at the '/', it is not generating any more than one list element.

Is anybody else encountering this issue?

Regards,
Nick
OstanNick is offline   Reply With Quote
Old 06-19-2014, 10:31 AM   #6
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Can you post an example of your alignment output?
Code:
head <align output file>
The script isn't very robust, any deviation in the blasr output will cause issues.
Thanks.
rhall is offline   Reply With Quote
Old 06-19-2014, 10:55 AM   #7
OstanNick
Member
 
Location: Canada

Join Date: Apr 2014
Posts: 26
Default

I got it working but had to change around my script a bit. I had to change the headers to look like:

Code:
qName/qHole/qAlign tName qStrand tStrand score percentSimilarity tStart tEnd tLength qStart qEnd qLength nCells
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/72/2066_4077 gi|556503834|ref|NC_000913.3| 0 0 -828 91.0891 366290 366482 4641652 2118 2314 4346 4145
Because it wasn't properly splitting at the space separations.

I changed the code to just include my one movie file:

Code:
import sys

# NOTE: this script assumes a single movie in the blasr alignment

blacklist_alignment = sys.argv[1]
num_holes = int(sys.argv[2])

blacklist = set()
movie_prefix = 'm130726_012611_42158_c100524682550000001823080609281382_s1_p0'
for line in open(blacklist_alignment):
    if line.startswith('qname'):
        continue

    cols = line.split()

    alignment_id_parts = cols[0].split('/')

    hole_num = alignment_id_parts[1]

    blacklist.add(hole_num)

for i in range(num_holes):
    i = str(i)
    if i not in blacklist:
        print '/'.join([movie_prefix, i])
I got my whitelist.txt output to look like such:

Code:
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/0
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/1
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/2
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/3
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/4

...
Does including the following parameter in the settings.xml file:

Code:
 <param name="whiteList" label="Read IDs to whitelist">
            <value>/home/USERNAME/BACassembly/whitelist.txt</value>
            </param>
Automatically let Smrtpipe know that I want to include those reads as on a whitelist?
In other words, does Smrtpipe have a whitelist function built into it? I would have expected more code to be required.
OstanNick is offline   Reply With Quote
Old 06-19-2014, 11:07 AM   #8
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Yes SMRT Pipe should filter out the reads if passed that parameter. Technically an extra filter is passed to the filtering task, you can check workflow/P_Filter/filter_?of?.sh to make sure the ReadWhilelist parameter is present:
Code:
filterPlsH5.py --debug --filter='MinReadScore=0.80,MinSRL=500,MinRL=100,ReadWhitelist=/home/USERNAME/BACassembly/whitelist.txt ......'
rhall is offline   Reply With Quote
Old 07-08-2014, 11:46 AM   #9
OstanNick
Member
 
Location: Canada

Join Date: Apr 2014
Posts: 26
Default

The whitelisting tutorial online is a good one, but it doesn't specify the format the whitelist would need to be if you wanted to run multiple SMRT cells with their own whitelisted reads.

What format would the whitelist file need to be in, in order to run say 8 smrt cells all of which are decontaminated?


"...Note whitelisting in this tutorial assumes that you are working on one cell of data at a time (1x bas.h5 file, or 3x bax.h5 files), it could be expanded to run on multiple cells with some changes to the code."
OstanNick is offline   Reply With Quote
Old 07-08-2014, 12:06 PM   #10
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

The whitelist file is in the same format, e.g.
m130726_012611_42158_c100524682550000001823080609281382_s1_p0/0

The cell/file name would just be different. The note in the tutorial is because the tutorial whitelists the reads that do not map to ecoli, therefore the script needs to take the 'inverse' of the reads that map to ecoli, this is only really simple for a single cell.
To make the script work on multiple cells it would need to be expanded to keep track of all the cells, the alternative is to generate the whitelist for each cell, then join them together to get a whitelist for running a job on all the cells.
rhall is offline   Reply With Quote
Old 07-08-2014, 12:40 PM   #11
OstanNick
Member
 
Location: Canada

Join Date: Apr 2014
Posts: 26
Default

Awesome, thank you for your reply.

So basically a whitelist encompassing more than one cell could look like:

<cell_1>_s1_p0/0
<cell_1>_s1_p0/2
<cell_1>_s1_p0/5
....
<cell_1>_s1_p0/81740
<cell_2>_s1_p0/1
<cell_2>_s1_p0/2
<cell_2>_s1_p0/6

Such that subsequent whitelists are basically appended to the previous one?

Regards,
Nick
OstanNick is offline   Reply With Quote
Old 07-08-2014, 01:31 PM   #12
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Exactly
rhall is offline   Reply With Quote
Old 07-22-2014, 09:49 PM   #13
seb.lees
Member
 
Location: France, Poitiers

Join Date: Sep 2012
Posts: 12
Default New link for the whitelist tutorial ?

Quote:
Originally Posted by flxlex View Post
Have a look here: https://github.com/PacificBioscience...filtering-step. Does that answer your question?
Hi all,

It seems that the previous link doesn't redirect to the correct page anymore (at least for me).

Nonetheless, I found the whitelist tutorial at this location:

https://github.com/PacificBioscience...sting-Tutorial

Can the whitelist procedure be implemented in SMRT Portal ? It is a very standard filtering step which would benefit to be run in a user-friendly way.

Seb.
seb.lees is offline   Reply With Quote
Old 07-22-2014, 09:57 PM   #14
OstanNick
Member
 
Location: Canada

Join Date: Apr 2014
Posts: 26
Default

Indeed, it can be implemented but it requires some XML coding. If you just follow the tutorial, and don't do the last part of it (the SMRTpipe part), and make your own custom protocol using the whitelist filtering parameter, you can specify the path to your whitelist in smrtportal as a parameter in the filtering step and it will use the whitelist to filter your reads
OstanNick is offline   Reply With Quote
Old 07-22-2014, 10:45 PM   #15
seb.lees
Member
 
Location: France, Poitiers

Join Date: Sep 2012
Posts: 12
Default

Ok, thanks.

But I need to practice more bas.h5 manipulation and protocol creation first, as I'm very new with PacBio data.

Actually I don't have my own data yet, so I'm practicing with tutorials data. By the way, do you know where can I find publicly available cosmid sequencing data under the PacBio format ?
seb.lees is offline   Reply With Quote
Old 07-23-2014, 09:11 AM   #16
OstanNick
Member
 
Location: Canada

Join Date: Apr 2014
Posts: 26
Default

I don't know about cosmid data, but the tutorial provides you with sample data. It's useful if you're just trying to get familiar with the bas.h5 file type.

Code:
wget http://files.pacb.com/software/hgap/HGAp_BAS_H5_DATA/HGAp_BAS_H5_DATA/BAC/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5
Cheers
Nick

Last edited by OstanNick; 07-23-2014 at 09:20 AM.
OstanNick is offline   Reply With Quote
Old 07-23-2014, 10:35 PM   #17
seb.lees
Member
 
Location: France, Poitiers

Join Date: Sep 2012
Posts: 12
Default

Yes, I will use this sample data. BAC and Cosmids are finally quite similar, and the method would be the same.

Thanks,
seb.
seb.lees is offline   Reply With Quote
Old 07-24-2014, 09:50 AM   #18
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

I just wanted to add, that is very old data (from an original RS machine, not an RS II), it may be better going with something that isn't a BAC or Cosmid, but is current. The data is also missing a metadeta.xml file, so will be almost impossible to import into SMRT Portal.
rhall is offline   Reply With Quote
Old 01-29-2015, 08:11 AM   #19
acgt
Junior Member
 
Location: USA

Join Date: Nov 2013
Posts: 3
Default

Hi, it's been a while since a post to this thread, but I have a question/comment that could help future people looking to filter PacBio libraries of a "metagenomic" nature - especially as the links above redirect to protocols using older versions of the programs. I have used the latest version of the SMRTanalysis (2.3.0) to process (and assemble using RS_HGAP_Assembly.2) my data, and in addition to the .h5 reads, SMRTPortal has produced raw subreads in fasta format. Is this a new output feature? Or would these not work for doing metagenomic filtering with a known "non-target" organism? Thanks!
acgt is offline   Reply With Quote
Old 03-04-2015, 08:04 AM   #20
damianosmel
Junior Member
 
Location: Europe

Join Date: Jan 2015
Posts: 2
Default

Hi!

As "acgt", I have also to process/assemble metagenomics data with PacBio reads, I follow the next publications (may be useful to some):

http://files.pacb.com/pdf/RHall_ASM2...veWorkflow.pdf

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3294205/

The problem is that blasting the preassembled reads to ncbi NT database I find many reads with a partial (in length) best hit.

Consequently, I would like to ask:
1) How could I improve the sensitivity of overlapping step (e.g. blasr parameters) in order to improve the consensus generation in terms of production less chimeric preassembled reads?
2) Also should I have to keep the same values for blasr running in the polishing step?

Thanks in advance
damianosmel is offline   Reply With Quote
Reply

Tags
filter reads, pacbio

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO