![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
BAC DNA extraction for PacBio | juckdnarocks | Pacific Biosciences | 3 | 04-04-2013 12:50 PM |
De novo assembly of PacBio with short Illumina data | boetsie | Pacific Biosciences | 1 | 10-06-2012 01:35 PM |
Mapping vs De Novo & Masking or not | Eric Fournier | General | 0 | 07-05-2011 08:49 AM |
PubMed: A De Novo Expression Profiling of Anopheles funestus, Malaria Vector in Afric | Newsbot! | Literature Watch | 0 | 03-03-2011 03:00 AM |
De Novo plant Bac sequencing | JDL | De novo discovery | 11 | 03-18-2009 05:01 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Japan Join Date: Dec 2011
Posts: 17
|
![]()
Dear all,
I have PacBio reads from a pool of BACs (eventually, 100X coverage taking into account an average of 120 Kb of insert size in the BAC), and am having problems to find a decent solution to mask the vector sequences. The vector is 8 Kb long, and given the fact that a considerable proportion of my reads are longer than 8 Kb, that means that I will have in many cases the vector sequence in the middle, and the sequences I am interested in at the ends of the reads... My idea is to use SSAHA2 as recommended in the MIRA manual, and then try to assemble with MIRA (all this after correcting the reads using the PacBioToCA pipeline, which I got to run it without any problem). However, how would MIRA use a read with a sequence masked in the middle? Would it use the two extremes as independent reads (wanted effect)? Or would join the two extremes (unwanted effect)? ![]() I have tried to assemble without masking the vector, but since it's the same vector for all the BACs, I am getting problems and to many contigs (including quimaeras...) Thanks a lot! Last edited by Champi; 05-23-2013 at 05:37 AM. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).
Repeat masker may work too http://www.repeatmasker.org/ You would have to edit the masked sequence out before doing assembly. Last edited by GenoMax; 05-23-2013 at 04:27 AM. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Boston area Join Date: Nov 2007
Posts: 747
|
![]()
If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).
You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released). |
![]() |
![]() |
![]() |
#4 | ||
Member
Location: Japan Join Date: Dec 2011
Posts: 17
|
![]() Quote:
![]() Quote:
Cheers |
||
![]() |
![]() |
![]() |
#5 |
Member
Location: Seattle, WA Join Date: Apr 2008
Posts: 84
|
![]()
Hi, we address the same problem in our lab.
There are two solutions, one if you are doing things by hand, the other if you are going to incorporate this into a production pipeline and use SMRTPipe to handle assembly for you. Both need access to the original bas.h5 file. For the first method # Create a file of the usable subreads: # this program is included in the blasr distribution under pbihdfutils pls2fasta m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5 reads.fasta -trimByRegion # Next map the reads to the vector using blasr. allowing for multiple hits blasr reads.fasta vector.fasta -m 4 -bestn 10 -out alignments.tovector.m4 # use your own filtering program to cut out the masked sequences filterSequences.py reads.fasta alignments.tovector.m4 reads.split.fasta ... do the rest of your assembly with the split file. If you want to use smrtpipe to do all the heavy lifting, you need to do some modifications to the original bas.h5 file (ideally a copy of it, and more ideally a file that encodes a single dataset from this file). So, a little, ok, a long, background about PacBio sequences if you are new to it. A pacbio "read" will contain 3 parts: a low quality prefix, a usable middle portion, and a low quality suffix. The reason read is quoted is because the middle portion may contain one or more passes over the sequencing template which are often called subreads so you may consider a read to be the consensus of the subreads, or the subreads themselves. The low quality prefix and suffix exist because there is noise at the beginning and ending of a movie that are recorded and interpreted as bases and called. The length of the low quality prefix/suffix are different for every read, and may even be 1000s of bases. However, the basecalling is VERY accurate at detecting just what are the low quality prefix and suffix sequences, and there is an annotation of this described next. And don't worry, the statistics on all the 10kb reads are after subtracting off the prefix and suffix, so the numbers are not inflated. Tucked in the bas.h5 file is a dataset called a RegionTable: >h5ls -r data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5 / Group /PulseData Group /PulseData/BaseCalls Group /PulseData/BaseCalls/Basecall Dataset {468462122/Inf} /PulseData/BaseCalls/DeletionQV Dataset {468462122/Inf} ... /PulseData/Regions This table contains a description of the segments of a read, namely the coordinates of the start and end of the high quality middle, and the coordinates of the adapter and insert for every ZMW in the dataset. Once you get over the headache of hdf5 coding, you can do nice things like make a dataset self documenting. In this instance, the description of the columns is part of the dataset: ATTRIBUTE "ColumnNames" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 5 ) / ( H5S_UNLIMITED ) } DATA { (0): "HoleNumber", "Region type index", "Region start in bases", (3): "Region end in bases", "Region score" } } The region types are enumerated, which is also described in the dataset: ATTRIBUTE "RegionTypes" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 3 ) / ( H5S_UNLIMITED ) } DATA { (0): "Adapter", "Insert", "HQRegion" } } Looking at the first few lines of the region table yields this: h5dump -d /PulseData/Regions data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5 HDF5 "data/Analysis_Results/m130517_113100_42134_c10052345255000000182308330928 1340_s1_p0.bas.h5" { DATASET "/PulseData/Regions" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 197012, 5 ) / ( H5S_UNLIMITED, 5 ) } DATA { (0,0): 0, 1, 0, 426, -1, (1,0): 0, 2, 0, 0, 0, (2,0): 1, 1, 0, 507, -1, (3,0): 1, 2, 0, 0, 0, (4,0): 2, 1, 0, 523, -1, (5,0): 2, 2, 0, 0, 0, (6,0): 3, 1, 0, 354, -1, (7,0): 3, 2, 0, 0, 0, (8,0): 4, 1, 0, 15159, -1, (9,0): 4, 1, 15211, 22459, -1, (10,0): 4, 0, 15159, 15211, 750, (11,0): 4, 2, 17273, 22458, 894, You can see reads 0-3 have no usable sequence, which is expected due to the Poisson loading of the smrtcell. Read 4 has a high-quality segment from 17273 - 22458. Paradoxically, it is possible for the low quality prefix and annotated insert regions to overlap. Ok, now finally to the vector screening. This was mainly set up by our bioinformatics specialist with a little feedback from me. We use blasr to map the original fasta reads to the vector sequence, allowing for multiple hits: blasr m130517_113100_42134_c100523452550000001823083309281340_s1_p0.fasta vector.fa -m 4 -bestn 10 -out alignment.m4 -header The next step requires some python coding. HDF is extremely painful to use in C/C++, but pretty easy just a bit slow to use in python. If you modify the rows in the RegionTable to reflect the masked off sequence and write this back out, you can feed the modified bas.h5 files into smrtpipe, and assembly away, sans-vector. Finally, an even better solution is to write out an entirely new file that contains only the region table m130517_113100_42134_c100523452550000001823083309281340_s1_p0.rgn.h5. Smrtpipe allows one to ignore the region table included in the bas.h5 file, and use the custom region table in this file, for scenarios just like this masking one here. This post now sums up more text than I've ever posted on seqanswers since starting. Enjoy. -mark |
![]() |
![]() |
![]() |
#6 |
Member
Location: Japan Join Date: Dec 2011
Posts: 17
|
![]()
@mchaisso, that is an overwhelming amount of information
![]() Since I don't have access to the original bas.h5 files (the provider gave me already filtered .fastq files), I will go for the first solution as I said just in my previous post: align the BAC vector sequence to the (already corrected) reads using SSAHA2 (although I think I could use blasr as you say) and then use the information from the output to split the reads with an in-house script and proceed with the resulting split-reads file. If I get the original bas.h5 files (which I will ask the provider for), I will try also the second option that you propose. Thank you very very much!!! |
![]() |
![]() |
![]() |
#7 | |
Member
Location: GuangXi China Join Date: Jan 2010
Posts: 27
|
![]()
I like this way, and if there is a seq contain the vector sequence in the middle,
it is also can be splited as paired reads for assembly. Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|