Seqanswers Leaderboard Ad

**OstanNick** · 07-23-2014, 08:11 AM

I don't know about cosmid data, but the tutorial provides you with sample data. It's useful if you're just trying to get familiar with the bas.h5 file type.

Code:

wget http://files.pacb.com/software/hgap/HGAp_BAS_H5_DATA/HGAp_BAS_H5_DATA/BAC/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5

Cheers
Nick

**seb.lees** · 07-23-2014, 09:35 PM

Yes, I will use this sample data. BAC and Cosmids are finally quite similar, and the method would be the same.

Thanks,
seb.

**rhall** · 07-24-2014, 08:50 AM

I just wanted to add, that is very old data (from an original RS machine, not an RS II), it may be better going with something that isn't a BAC or Cosmid, but is current. The data is also missing a metadeta.xml file, so will be almost impossible to import into SMRT Portal.

**acgt** · 01-29-2015, 08:11 AM

Hi, it's been a while since a post to this thread, but I have a question/comment that could help future people looking to filter PacBio libraries of a "metagenomic" nature - especially as the links above redirect to protocols using older versions of the programs. I have used the latest version of the SMRTanalysis (2.3.0) to process (and assemble using RS_HGAP_Assembly.2) my data, and in addition to the .h5 reads, SMRTPortal has produced raw subreads in fasta format. Is this a new output feature? Or would these not work for doing metagenomic filtering with a known "non-target" organism? Thanks!

**damianosmel** · 03-04-2015, 08:04 AM

Hi!

As "acgt", I have also to process/assemble metagenomics data with PacBio reads, I follow the next publications (may be useful to some):

403 Forbidden

http://files.pacb.com/pdf/RHall_ASM2014_InteractiveWorkflow.pdf

Page not available - PMC

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3294205/

The problem is that blasting the preassembled reads to ncbi NT database I find many reads with a partial (in length) best hit.

Consequently, I would like to ask:
1) How could I improve the sensitivity of overlapping step (e.g. blasr parameters) in order to improve the consensus generation in terms of production less chimeric preassembled reads?
2) Also should I have to keep the same values for blasr running in the polishing step?

Thanks in advance

**rhall** · 03-04-2015, 10:17 AM

Hi,
Last question first.
2. The blasr parameters for the preassmebly and polishing are completely independent.

I'm not sure I full understand question 1, given the default parameters (HGAP.3) the chance of correcting a randomly created biological chimera are very low. I can probably recommend some blasr parameters, but the recommendation will be dependent on what exactly is happening. If you could give a little more detail.

The sample is a non amplified (not WGA, MDA) shotgun sample?
Are you using HGAP.3 for the assembly?
Are you sure that the preassembly is generating chimeric corrected reads? i.e. do you have a perfect reference for what is being sequenced?
What does the final assembled reads look like, are the chimeric reads used in the assembly?

**damianosmel** · 03-05-2015, 08:28 AM

Dear rhall,

Thank you very much for your reply and willingness to help us out.

Let me specify more: we are sequencing nematode DNA (roughly 31% GC content) extracted from the soil of infected plants (tomato; similar GC content); no amplification step.

We use HGAP.3. We have checked a part of our assembled reads with BLAST vs nrDB(nt) to get a picture of the key taxonomic groups: we find a lot of proteobacteria (GC content varying from 40-70%), down to arthropods and several other organisms. For a good portion of the assembled reads used as BLAST input, we only get low partial coverage in the BLAST result, potentially indicating that the parameters originally selected as basis for the pre-assembly step could be improved in order to avoid having any chimera.

Unfortunately, we don’t have a perfect reference. Based on the final assembly, in some cases we see that only a low fraction of a long read hits e.g. to tomato genome (whose genomes has been sequenced), so that’s why we want to make sure we are strict in the pre-assembly step. In fact, we’d ultimately like to use the combination of GCcontent, coverage and taxonomy info to limit the reads that we would select to then attempt a partial nematode genome seq. assembly. Limiting this in the pre-assembly, whose results we screen by BLAST would be very helpful.

Possibly you would have a suggestion about the parameter selection to avoid potential chimera creation in this metagenome sequencing situation.

Thank you very much!

**rhall** · 03-05-2015, 09:51 AM

Sorry a few more questions, HGAP.3 parameters are set such that artificial chimera generation is extremely unlikely. I have used HGAP.3 (default parameters) with multiple metagenomic samples and have never had this problem, so it's probably best to diagnose where the chimeras are coming from before brute force trying different parameters.
Are you blasting the preassembled reads 'corrected.fasta' or the assembled reads 'polished_assembly.fasta'? If it is the preassembled reads, then it is possible to increase the coverage requirement for preassembly ('Minimum Coverage For Correction'), the default is 6, increasing will reduce the preassembled yield, but the likelihood of having 6 reads that support a chimeric read that does not have some biological root cause is unlikely. Chimeric generation in the OLC step, would be effected by the overlap error rate in the CA spec file, but it is already extremely conservative.

One option is to take the set of reads that contain possible chimeras and resequence all the data against these reads to see if the chimeras have raw read support.

**cascoamarillo** · 04-20-2015, 11:19 AM

Hi,
First contact with PacBio here. So I have the filtered_subreads in fastq and fasta format and the subreads without spike in control in fasta (later I'll get all raw data). Apparently, PacBio pipeline does not remove the control reads from fastq files (they told me that). So what's the best way to do it and have subreads in fastq without control reads? Thanks

*I don't work with SMRT Portal.

**rhall** · 04-20-2015, 12:06 PM

Who provided the filtered_subreads.fasta/q, it is possible to filter out the spike in control when these are generated.
Do I understand correctly that you have subreads in fasta format without the control? could you use this as a reference to extract the fastq reads from the filtered_subreads.fastq file.
What are the files being used as input to? If it's assembly, simply leave them in, the control will assemble out.

**cascoamarillo** · 04-20-2015, 12:55 PM

Originally posted by rhall View Post

Who provided the filtered_subreads.fasta/q

The sequencing core facility where our sample was sent.

Originally posted by rhall View Post

Do I understand correctly that you have subreads in fasta format without the control? could you use this as a reference to extract the fastq reads from the filtered_subreads.fastq file.

That's a good point to start with.

Originally posted by rhall View Post

What are the files being used as input to? If it's assembly, simply leave them in, the control will assemble out.

Will it? I haven't done nothing yet; the idea is to wait for the illumina reads and made an hybrid assembly. In the mean time, try to play with the subreads and make a draft assembly..... SMRT pipes/protocols are killing me!

Thanks for your reply!

**rhall** · 04-20-2015, 01:10 PM

It is highly unlikely that any of your real sequence is shared with the control, so I wouldn't worry about leaving it in, just remember to filter it out of the assembled contigs before you submit to ncbi (I can point to at least one submission that didn't). How big is the assembly? an AMI is certainly an option for using SMRT Pipe with small data sets https://github.com/PacificBioscience...MRT-Portal-AMI
Otherwise try http://wgs-assembler.sourceforge.net...index.php/PBcR or https://github.com/PacificBioscience...lcon_manual.md to assemble the PacBio data.

**cascoamarillo** · 04-20-2015, 01:26 PM

Originally posted by rhall View Post

It is highly unlikely that any of your real sequence is shared with the control, so I wouldn't worry about leaving it in, just remember to filter it out of the assembled contigs before you submit to ncbi (I can point to at least one submission that didn't). How big is the assembly? an AMI is certainly an option for using SMRT Pipe with small data sets https://github.com/PacificBioscience...MRT-Portal-AMI
Otherwise try http://wgs-assembler.sourceforge.net...index.php/PBcR or https://github.com/PacificBioscience...lcon_manual.md to assemble the PacBio data.

The genome size is ca. 250Mb (eukaryote) , so not sure if SMRT-Portal will support it. I have launched AMIs before, but I rather prefer to work in our local computer cluster facility (just linux). I'll give a try to PBcR and FALCON.
Thanks

**GenoMax** · 04-20-2015, 01:32 PM

If your local cluster uses SGE (and your local admins are willing) then install SMRTportal locally. Certain steps of PacBio analysis are best done via portal (and you are going to get raw data). FYI: SMRTportal no longer supports hybrid assemblies so that part you will have to do outside of the portal as @rhall has already posted.

**cascoamarillo** · 04-20-2015, 03:53 PM

Originally posted by GenoMax View Post

If your local cluster uses SGE (and your local admins are willing) then install SMRTportal locally. Certain steps of PacBio analysis are best done via portal (and you are going to get raw data). FYI: SMRTportal no longer supports hybrid assemblies so that part you will have to do outside of the portal as @rhall has already posted.

I thought it was going to be more complicated; thanks to the SMRT analysis wiki and good tips from my system administrator I was able to run SMRTportal locally. But...is there a way to import your filtered_subreads in fastq file for further analysis?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News