SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
BAC DNA extraction for PacBio juckdnarocks Pacific Biosciences 3 04-04-2013 11:50 AM
De novo assembly of PacBio with short Illumina data boetsie Pacific Biosciences 1 10-06-2012 12:35 PM
Mapping vs De Novo & Masking or not Eric Fournier General 0 07-05-2011 07:49 AM
PubMed: A De Novo Expression Profiling of Anopheles funestus, Malaria Vector in Afric Newsbot! Literature Watch 0 03-03-2011 02:00 AM
De Novo plant Bac sequencing JDL De novo discovery 11 03-18-2009 04:01 PM

Reply
 
Thread Tools
Old 05-23-2013, 02:22 AM   #1
Champi
Member
 
Location: Japan

Join Date: Dec 2011
Posts: 17
Default BAC vector sequece masking for de novo assembly using PacBio C2

Dear all,

I have PacBio reads from a pool of BACs (eventually, 100X coverage taking into account an average of 120 Kb of insert size in the BAC), and am having problems to find a decent solution to mask the vector sequences. The vector is 8 Kb long, and given the fact that a considerable proportion of my reads are longer than 8 Kb, that means that I will have in many cases the vector sequence in the middle, and the sequences I am interested in at the ends of the reads...

My idea is to use SSAHA2 as recommended in the MIRA manual, and then try to assemble with MIRA (all this after correcting the reads using the PacBioToCA pipeline, which I got to run it without any problem). However, how would MIRA use a read with a sequence masked in the middle? Would it use the two extremes as independent reads (wanted effect)? Or would join the two extremes (unwanted effect)?

I have tried to assemble without masking the vector, but since it's the same vector for all the BACs, I am getting problems and to many contigs (including quimaeras...)

Thanks a lot!

Last edited by Champi; 05-23-2013 at 04:37 AM.
Champi is offline   Reply With Quote
Old 05-23-2013, 03:25 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

Repeat masker may work too http://www.repeatmasker.org/

You would have to edit the masked sequence out before doing assembly.

Last edited by GenoMax; 05-23-2013 at 03:27 AM.
GenoMax is offline   Reply With Quote
Old 05-24-2013, 10:06 AM   #3
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).

You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).
krobison is offline   Reply With Quote
Old 05-24-2013, 09:57 PM   #4
Champi
Member
 
Location: Japan

Join Date: Dec 2011
Posts: 17
Default

Quote:
Originally Posted by krobison View Post
If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).
Thank you very much krobison. I was thinking on writing a perl script to do that when found non-masked sequences at both sides of masked one in a read, splitting that read into reads a and b. However, my level of Perl is (below) very beginner , and it will take time, although I think it's a good exercise. I'm working with fastq, what I suppose will make the thing double complicated, but will try.

Quote:
Originally Posted by krobison View Post
You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).
Will do for sure! Thank you for letting me know

Cheers
Champi is offline   Reply With Quote
Old 05-25-2013, 10:41 AM   #5
mchaisso
Member
 
Location: Seattle, WA

Join Date: Apr 2008
Posts: 84
Default

Hi, we address the same problem in our lab.

There are two solutions, one if you are doing things by hand, the other if you are going to incorporate this into a production pipeline and use SMRTPipe to handle assembly for you. Both need access to the original bas.h5 file.

For the first method
# Create a file of the usable subreads:
# this program is included in the blasr distribution under pbihdfutils
pls2fasta m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5 reads.fasta -trimByRegion

# Next map the reads to the vector using blasr. allowing for multiple hits
blasr reads.fasta vector.fasta -m 4 -bestn 10 -out alignments.tovector.m4

# use your own filtering program to cut out the masked sequences
filterSequences.py reads.fasta alignments.tovector.m4 reads.split.fasta

... do the rest of your assembly with the split file.

If you want to use smrtpipe to do all the heavy lifting, you need to do some modifications to the original bas.h5 file (ideally a copy of it, and more ideally a file that encodes a single dataset from this file).

So, a little, ok, a long, background about PacBio sequences if you are new to it.

A pacbio "read" will contain 3 parts: a low quality prefix, a usable middle portion, and a low quality suffix. The reason read is quoted is because the middle portion may contain one or more passes over the sequencing template which are often called subreads so you may consider a read to be the consensus of the subreads, or the subreads themselves. The low quality prefix and suffix exist because there is noise at the beginning and ending of a movie that are recorded and interpreted as bases and called.

The length of the low quality prefix/suffix are different for every read, and may even be 1000s of bases. However, the basecalling is VERY accurate at detecting just what are the low quality prefix and suffix sequences, and there is an annotation of this described next. And don't worry, the statistics on all the 10kb reads are after subtracting off the prefix and suffix, so the numbers are not inflated.

Tucked in the bas.h5 file is a dataset called a RegionTable:
>h5ls -r data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
/ Group
/PulseData Group
/PulseData/BaseCalls Group
/PulseData/BaseCalls/Basecall Dataset {468462122/Inf}
/PulseData/BaseCalls/DeletionQV Dataset {468462122/Inf}
...
/PulseData/Regions

This table contains a description of the segments of a read, namely the coordinates of the start and end of the high quality middle, and the coordinates of the adapter and insert for every ZMW in the dataset.

Once you get over the headache of hdf5 coding, you can do nice things like make a dataset self documenting. In this instance, the description of the columns is part of the dataset:
ATTRIBUTE "ColumnNames" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 5 ) / ( H5S_UNLIMITED ) }
DATA {
(0): "HoleNumber", "Region type index", "Region start in bases",
(3): "Region end in bases", "Region score"
}
}

The region types are enumerated, which is also described in the dataset:

ATTRIBUTE "RegionTypes" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( H5S_UNLIMITED ) }
DATA {
(0): "Adapter", "Insert", "HQRegion"
}
}

Looking at the first few lines of the region table yields this:

h5dump -d /PulseData/Regions data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
HDF5 "data/Analysis_Results/m130517_113100_42134_c10052345255000000182308330928
1340_s1_p0.bas.h5" {
DATASET "/PulseData/Regions" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 197012, 5 ) / ( H5S_UNLIMITED, 5 ) }
DATA {
(0,0): 0, 1, 0, 426, -1,
(1,0): 0, 2, 0, 0, 0,
(2,0): 1, 1, 0, 507, -1,
(3,0): 1, 2, 0, 0, 0,
(4,0): 2, 1, 0, 523, -1,
(5,0): 2, 2, 0, 0, 0,
(6,0): 3, 1, 0, 354, -1,
(7,0): 3, 2, 0, 0, 0,
(8,0): 4, 1, 0, 15159, -1,
(9,0): 4, 1, 15211, 22459, -1,
(10,0): 4, 0, 15159, 15211, 750,
(11,0): 4, 2, 17273, 22458, 894,

You can see reads 0-3 have no usable sequence, which is expected due to the Poisson loading of the smrtcell.

Read 4 has a high-quality segment from 17273 - 22458. Paradoxically, it is possible for the low quality prefix and annotated insert regions to overlap.


Ok, now finally to the vector screening. This was mainly set up by our bioinformatics specialist with a little feedback from me. We use blasr to map the original fasta reads to the vector sequence, allowing for multiple hits:
blasr m130517_113100_42134_c100523452550000001823083309281340_s1_p0.fasta vector.fa -m 4 -bestn 10 -out alignment.m4 -header

The next step requires some python coding. HDF is extremely painful to use in C/C++, but pretty easy just a bit slow to use in python. If you modify the rows in the RegionTable to reflect the masked off sequence and write this back out, you can feed the modified bas.h5 files into smrtpipe, and assembly away, sans-vector.

Finally, an even better solution is to write out an entirely new file that contains only the region table m130517_113100_42134_c100523452550000001823083309281340_s1_p0.rgn.h5. Smrtpipe allows one to ignore the region table included in the bas.h5 file, and use the custom region table in this file, for scenarios just like this masking one here.


This post now sums up more text than I've ever posted on seqanswers since starting. Enjoy.

-mark
mchaisso is offline   Reply With Quote
Old 05-26-2013, 04:33 PM   #6
Champi
Member
 
Location: Japan

Join Date: Dec 2011
Posts: 17
Default

@mchaisso, that is an overwhelming amount of information

Since I don't have access to the original bas.h5 files (the provider gave me already filtered .fastq files), I will go for the first solution as I said just in my previous post: align the BAC vector sequence to the (already corrected) reads using SSAHA2 (although I think I could use blasr as you say) and then use the information from the output to split the reads with an in-house script and proceed with the resulting split-reads file.

If I get the original bas.h5 files (which I will ask the provider for), I will try also the second option that you propose.

Thank you very very much!!!
Champi is offline   Reply With Quote
Old 05-27-2013, 01:59 AM   #7
skycreative
Member
 
Location: GuangXi China

Join Date: Jan 2010
Posts: 27
Default

I like this way, and if there is a seq contain the vector sequence in the middle,
it is also can be splited as paired reads for assembly.
Quote:
Originally Posted by GenoMax View Post
How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

Repeat masker may work too http://www.repeatmasker.org/

You would have to edit the masked sequence out before doing assembly.
skycreative is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO