Seqanswers Leaderboard Ad

**HenrivdGeest** · 07-05-2012, 04:25 AM

Originally posted by lletourn View Post

One Thing that happened to us was that I forgot to change the LEN_BIT in the celera sources, so any pacbio read longer than 2kb got thrown out. We had less than 0.1% of our reads after correction.

Can you be a bit more specific?

- Is this already fixed in the downloadable binaries
- Do you have a url for this issue? (Can't find it on the wiki: http://sourceforge.net/apps/mediawik...tle=PacBioToCA)

**lletourn** · 07-05-2012, 04:38 AM

Look at the 'Building from Source ' section
You need to change in AS_global.h
#define AS_READ_MAX_NORMAL_LEN_BITS 11 
to something higher.

There is a precompiled binary called wgs7.0-pacbio but it segfaults here and I don't know what they put as a value.

We just recompiled it.

I Actually don't know at all what they changed in that build. I don't think they'll put it by default soon because putting a higher value there takes up more ram. For people using only short reads, this might not be a good thing.

**jpearl01** · 07-05-2012, 07:36 AM

One of the checks I do if I'm running into low coverage in the returned pacbio reads is to do an assembly of just the reads I will eventually error correct with. i.e. for your dataset try to do an assembly using just the illumina reads.

Obviously you will likely have a lot more contigs than if you were assembling with error corrected pacbio long reads, but a quick check is to look at the total size of the assembled reads. If the assembly with just your illumina reads is ~100MB then a first pass assessment is you have "good" coverage. If the assembly is much smaller than that, you may not have particularly even coverage. For instance some areas might have extremely deep coverage and some no coverage at all. I've run into this issue before when I tried to do some complex filtering. Obviously this is not a very rigorous check and there are other issues that might explain your issues, but this can point you in the direction of whether your problem is with the error correction pipeline, or the data you are inputting to it.

**HenrivdGeest** · 07-05-2012, 08:31 AM

I did an Illumina only assembly, and that resulted in a 100MB contig file.And indeed, some areas have a bigger coverage than others. (some areas downto 1x)

I am now testing with a single pacbio read and some Illumina reads that map on this read (performed with CLCbio read mapping) And I see where it might go wrong:
- I get a lot of small fixed pacbio reads (21 at my previous used settings)
- with all the smallReads adjustments I got 17 new pacbio reads.
(frgMinLen = 40
ovlMinLen = 30
merSize=9 )

Most of the output reads are very small and no longer that a 454 read. Is it not possible to get the pacbio input read, but fixed at the positions where possible?

**shanebrubaker** · 09-14-2012, 09:32 AM

I also have some questions about pacbiotoca. Mine is taking a long time as well. I have tried the spec file above.
Is there any way to estimate how long it should take?
Can anyone help maybe give me some pointers?
Also is it possible to collapse your Illumina reads somehow before correction, would that help?

Thanks,
Shane

**lletourn** · 09-14-2012, 10:31 AM

I finally read quite a bit on that. And conclusion, as they mention on the RunCA page, it depends on your hardware.

The 3 worst steps are 0-overlaptrim-overlap, 1-overlapper and runPartition

If your running on SGE it might be a good idea to intentionally give less ram to
ovlHashBlockLength to generate more jobs to split up the work.

If you have a lot of cores per node and a lot of ram, ou could bring up
ovlHashBits, ovlThreads, ovlHashBlockLength

Although if it's a bacteria, don't put a too big ovlHashBits, it's wasted ram. On RunCA they explain how to guess the best value.

Also a warning, the max value to ovlHashBlockLength is 4G. anything bigger can crash the process and give unexpected results.

good luck

**shanebrubaker** · 09-17-2012, 09:01 AM

Hi, I am trying to run pacbiotoca. I am running a small test, with 10 pacbio sequences to correct, against a set of Illumina data that is 37GB. It has now been running for nearly 4 days, on a 24-core machine, averaging a load of about 17. It appears to be nearing the end of doing ~1500 overlapInCore jobs. I used a spec file similar to the one above.

I am wondering if this is normal or if it should take this long. Is there anyone who could help me try to speed this up? Thanks!

**jbingham** · 09-18-2012, 06:18 AM

@HenrivdGeest pacBioToCA v7 splits PacBio reads if there's a coverage gap somewhere. A coming release will keep the full read even if there's a portion without coverage.

**tplsmith** · 09-20-2012, 01:39 PM

Some input about error correction

We have been working with Sergey on the PacBioToCA for some time. First, you can definitely have too much coverage of short reads, especially illumina reads where errors are non-random and can confuse the correction leading to more than one version of a corrected read. That is, at some depth you can get enough of the same error to convince the correction routine that there are two different sequences.
Generally no more than 50-70x coverage works better than higher coverage, you should down-sample.
Second, we have had the best luck on microbial genomes using high cutoffs for read length on the PacBio data, usually 6kb or greater (although some strains have worked better with somewhat lower, and some with somewhat higher, cutoffs).
Third, until the current version (not sure it is even released yet) Sergey had not incorporated paired end information into the routine. Each 100 or 150 base read was thus being used directly to try and correct, but mapping those short reads to the 15% error reads was difficult. 454 data works much better, or CCS on PacBio. I understand that the 2x250 paired reads you can now do on the MiSeq kick ass for error correction when using the version that accounts for paired ends, but haven't yet tried it as our MiSeq is just now getting the upgrade.

**HenrivdGeest** · 09-21-2012, 01:18 AM

Originally posted by jbingham View Post

@HenrivdGeest pacBioToCA v7 splits PacBio reads if there's a coverage gap somewhere. A coming release will keep the full read even if there's a portion without coverage.

Indeed, I am using the current cvs release, and it has this option with -maxGap.
I set it to 300 to allow pieces upto 300bp not having any coverage. It dit help, my median read length of the fixed pacbio reads went up, but it's still at 800bp, altough the pacbio input is about 2.5kb.

**HenrivdGeest** · 09-21-2012, 01:20 AM

Originally posted by tplsmith View Post

We have been working with Sergey on the PacBioToCA for some time. First, you can definitely have too much coverage of short reads, especially illumina reads where errors are non-random and can confuse the correction leading to more than one version of a corrected read. That is, at some depth you can get enough of the same error to convince the correction routine that there are two different sequences.
Generally no more than 50-70x coverage works better than higher coverage, you should down-sample.
Second, we have had the best luck on microbial genomes using high cutoffs for read length on the PacBio data, usually 6kb or greater (although some strains have worked better with somewhat lower, and some with somewhat higher, cutoffs).
Third, until the current version (not sure it is even released yet) Sergey had not incorporated paired end information into the routine. Each 100 or 150 base read was thus being used directly to try and correct, but mapping those short reads to the 15% error reads was difficult. 454 data works much better, or CCS on PacBio. I understand that the 2x250 paired reads you can now do on the MiSeq kick ass for error correction when using the version that accounts for paired ends, but haven't yet tried it as our MiSeq is just now getting the upgrade.

Thanks. I am indeed now only using 454 reads. I also have 454 paired end of 3KB, so that might also help with the long pacbio reads.

**Farhat** · 10-11-2012, 12:45 AM

I am wrestling with optimizing the pacbio.spec file too. I have a 512 GB RAM machine with 64 cores. What might be the optimum values for these? Some of the values I tried caused the pipeline to crash at the 0-overlaptrim-overlap with too little memory. Also, is there some way to restart jobs at the failed stage only and not from the beginning?

**sagarutturkar** · 02-02-2013, 07:59 PM

Pacbio spec file for high-memory multi-core

Hi,

Anybody have specific updates to be done in pacbio.spec file which is designed for high-memory multi-core machines. I have machine with 48 cores and 128GB RAM.

I am using 50X of short illumina data to correct the pacbio reads with 30X coverage. I was able to run the pacBioToCA pipeline but problem is with generating pacbio.frg file.

My illumina data is 1.1GB and pacbio data is 250MB. However the correction run the pacbio.frg file was only 750KB and pacbio.fasta file was 400KB. Something is going wrong and I am not able to figure it out.

Any suggestions?

Thanks
Sagar

**samanta** · 02-05-2013, 12:04 PM

It may make sense to separate various components of PacBioToCA, run them separately on test files and then write your own optimized pipeline. We are looking into the possibility.

404 Not Found

http://www.homolog.us/blogs/2013/02/05/pacbiotoca-for-error-correcting-pacbio-reads/

**haig** · 04-24-2013, 11:15 AM

Does pacbioToCA correct raw reads independently?

For the same coverage in high quality reads will the input amount in raw reads affect the correction? Does pacbioToCA correct raw reads independently?

I typically run with a single fastq from 2-4 filtered SMRTcells with 50X-100X of high quality correction reads followed by assembly.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News