Seqanswers Leaderboard Ad

**lletourn** · 05-24-2012, 04:50 PM

I'm also running into speed issues. My 50x coverage was estimated to take 4weeks to complete.

I would be surprised if all the people using it are ready to wait that long :-)

**krobison** · 05-25-2012, 05:11 AM

pacBioToCA is quite a time sink, but I did find the results useful.

I do wonder if it would be more efficient to assemble the short read (or CCS) data alone very conservatively and then feed the contigs plus unused reads into PacBioToCA. An awful lot of the assembly is not altered by the PacBio reads, but those regions must eat a lot of time.

**GenoMax** · 05-25-2012, 05:23 AM

If you have short reads (these are intended to be illumina reads, for example) for your sample besides the PacBio data then consider the following.

PacBio just released a new version (v.1.3.1) of their SMRTAnalysis suite. A new error correction module (P_ErrorCorrection) is included.

The description for P_errorCorrection module from the manual says: "This module takes as input long reads and short reads in standard formats, aligns the short reads to the long reads, and outputs a corrected version of the long reads."

If your data used v.2 chemistry (and you have short read data) then it may be worthwhile to re-analyze your data using the new version of SMRTAnalysis package with the new error correction module.

**lletourn** · 05-25-2012, 06:11 AM

We will be upgrading our smart portal from 1.3.0 to 1.3.1 soon (in a few days) but I would really like to have pacbioToCa work in a reasonable amount of time for 2 reasons

1- Many people that do not have access to smart portal use it successfully. Being a sequencing center, it's a good thing to be able to propose this open solution (on another note since others don't complain too much about speed, I'm thinking it's a problem on our end, a setting say, but I don't know what it is).

2- PacBio will be adding pacbioToCA in 1.3.3, so might as well get familiar with it now.

**jpearl01** · 05-25-2012, 10:34 AM

I was actually able to eventually get a pipeline for error correction using pacbioToCA. Basically (with the help of the PacBio folks) what solved my issue was an updated pacbio.spec file. Once I had exchanged mine for the one Pacific Biosciences had modified, the error correction took *way* less time, using the machine I originally posted about it was done in under an hour. IIRC less than half an hour. Much improved. Assembly using Celera actually took longer than the error correction. Again, this is a pretty small genome (2MB) so YMMV.

I have also heard that the error correction pipeline with the update to the pacbio software works very well for some people. I've heard that it only works for small genomes, i.e. <10MB whereas the Celera pipeline can handle much larger data... I don't know when/if that is going to change but it was recommended to us to use the Celera pipeline if we were ever going to sequence "big" genomes. Since we do plan on it, Celera it was.

If people are still interested I can post the new pacbio.spec that worked well for me.

~josh

**lletourn** · 05-25-2012, 10:42 AM

Yes, yes please do!

**jpearl01** · 05-25-2012, 10:45 AM

Actually here it is:

Code:

stopAfter=overlapper

# original asm settings
utgErrorRate = 0.25
utgErrorLimit = 4.5

cnsErrorRate = 0.25
cgwErrorRate = 0.25
ovlErrorRate = 0.25

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 0 
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -A assembly
sgeScript = -pe threads 16
sgeConsensus = -pe threads 1
sgeOverlap = -pe threads 2
sgeFragmentCorrection = -pe threads 2
sgeOverlapCorrection = -pe threads 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 2
ovlHashBlockLength = 20000000
ovlRefBlockSize =  50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 8

cnsConcurrency = 8

merOverlapperThreads = 3 
merOverlapperSeedConcurrency = 3
merOverlapperExtendConcurrency = 3

frgCorrConcurrency = 2
ovlCorrConcurrency = 4 
cnsConcurrency = 4

A lot of this is greek to me. I tried going through and wrestling it out of the documentation, but the documentation won. Basically, because I have 16 logical processors on that machine, that's what I used for several of the "thread" options. Other than that... *shrug* I'm sure there are Celera experts here that can parse this.

**GenoMax** · 05-25-2012, 11:02 AM

8+ days to less than an hour is pretty spectacular

Originally posted by jpearl01 View Post

Once I had exchanged mine for the one Pacific Biosciences had modified, the error correction took *way* less time, using the machine I originally posted about it was done in under an hour. IIRC less than half an hour.

~josh

**jpearl01** · 05-25-2012, 11:44 AM

I have to admit, I was pretty skeptical at first when they said the time to do the error correction pipeline could be vastly reduced (8 days was when I first posted, I let it run for another week before I finally cancelled it). I assume that it wasn't actually doing anything, or rather whatever it was doing was not progressing the pipeline (it was running 100% on all 16 cpus during the entire time, so whatever "nothing" it was doing, it was doing a lot of it). Anyway half-hour error correction was pretty much beyond my dreams at that point so I was rather pleased.

@lletourn Could you let us know if the .spec file worked for you, and how long it took you to error-correct?

**lletourn** · 05-25-2012, 11:50 AM

Of course! I am really looking forward to running this this weekend.

**lletourn** · 05-26-2012, 05:02 AM

I just noticed, why the stopAfter=Overlaper?

Did they tell you to run something manually?

**jpearl01** · 05-28-2012, 08:05 AM

I don't believe so, unless the program would normally go directly into the assembler, which I did run manually. Other than that, I just let it do its thing. At the end of this process I think there is a 9_terminator folder that holds the results. But I didn't enter anything else manually into the error correction analysis. Just that spec file.

**HenrivdGeest** · 07-05-2012, 03:44 AM

I also tried to run the pacbioToCa pipeline, but for our case it initially took 200days to complere (extrapolated off course!) . It turned out that we have a huge E.coli contamination, making the coverage of that genome over 5000x. Once we got rid of the e.coli Illumina reads it run in 14hrs. ( I also modified the pacbio.spec file somewhat like shown by jpearl01)
But the results were not promising:
Input: (for a 100MB 'genome')
400MB pacbio data (est. 4x coverage)
100M reads Illuimina (est 100x coverage)
Output:
43MB clean pacbio data (est. <1x coverage)

Is there something we can look into?

I also want to try the P_ErrorCorrection module of the smrtportal software, but I read in this thread that it might not be capable of handling a 100mb genome

ps. Our 100mb genome is not a real genome, but this should be the total amount of scatered (1000 pieces) seqeunced genomic area.

**lletourn** · 07-05-2012, 04:15 AM

I was able to lower the time to about ~24hours
1- I only use 50x of illumina or ccs reads
2- I modified the .spec file a bit. We have large memeory machines so I changed settings to load as much as possible.

Also try to launch as many processes as you can and limit the amount of threads. BUT processes use up the set amount of memory for each process. Threads share memory. The problem is some steps of the pacbioToCA are single threaded so have more process goes faster.

One Thing that happened to us was that I forgot to change the LEN_BIT in the celera sources, so any pacbio read longer than 2kb got thrown out. We had less than 0.1% of our reads after correction.

After the change we kept about 70% of our bases.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

pacBioToCa .spec file options

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News