Seqanswers Leaderboard Ad

**Torst** · 02-22-2010, 06:12 PM

Originally posted by lcollado View Post

A few of my lab co-workers and myself are interested on working with some GAIIx data from a bacterial genome. And well, we want to start from the bottom up hence why right now we want to evaluate whether to use a base caller different to the SolexaPipeline (I think it was 1.4).

My advice is that you would be better off spending your time on the downstream analysis rather than fiddling at the edges of the base-calling stage. Alternative base callers are unlikely to provide significant improvements to the GAPipeline - perhaps small increases in quality and yields in the 5% mark. You are analysing bacterial genomes, so the genome size is typically under 10 Mbp, and it appears going beyond 100x coverage gains little. Chances are you'll have more than enough good reads, and improvoving them slightly won't affect the downstream results.

**Simon Anders** · 02-24-2010, 07:04 AM

I agree with Torst.

It is also worth pointing out that Illumina made major improvements to the base caller in the most recent release of the SolexaPipeline.

Before that, Bustard (the base caller) was in fact considered a weak part. Somebody looked through its code (can't find the reference at the moment) and found that it uses quite naive algorithms, which had a lot of room for improvement. Hence, it was comparatively easy to develop something better. Since the recent major overhaul of Bustard, it might be much harder to compete with it, and I imagine that it has now the lead again over third-party tools.

For us, this update made quite dramatic differences: Shortly before the release of the new pipeline version, we made some yeast RNA-Seq runs on our GAIIx and got (if I remember correctly) c. 13 mio reads per lane passing the chastity filter. As we still had the images, we ran the analysis again after installing the new version of the pipeline and now have more than 19 mio reads.

Simon

**drio** · 02-24-2010, 08:45 AM

Originally posted by Simon Anders View Post

I agree with Torst.
Before that, Bustard (the base caller) was in fact considered a weak part. Somebody looked through its code (can't find the reference at the moment) and found that it uses quite naive algorithms, which had a lot of room for improvement. Hence, it was comparatively easy to develop something better. Since the recent major overhaul of Bustard, it might be much harder to compete with it, and I imagine that it has now the lead again over third-party tools.
Simon

Any chance you can find that reference? Where did you read it?

**Simon Anders** · 02-24-2010, 09:00 AM

Actually, yes.

It is this one:

Nava Whiteford: The Solexa Pipeline

http://sgenomics.org/mediawiki/upload/8/80/Pipeline.pdf

Note that it treats an old version of the pipeline (the report is dated Dec 2008!). As I said, a lot was changed recently.

Simon

**lcollado** · 02-24-2010, 10:08 AM

Thanks for the replies and advice ^_^

The general, hmm, dilemma (if the word fits) is that our boss wants us to focus on a 2nd project (de novo assembling) rather than on this project which is about TSS/operons (similar to this recent paper). But well, I feel uncomfortable working on, say, the "3rd lvl" without knowing that the 1st (base calling) and 2nd lvl (mapping) are solid. By knowing that they are solid it doesn't mean that we'll re-do the work, but at least know more about it and understand them better. Also, de novo assembly is very much explored compared to TSS/operons (I feel that way) in bacteria, therefore TSS/operons are more relevant. If we understand more about all the "lvls" we might find something interesting or at least learn more in the end; which is something I feel is open from taking a look at the above paper.

For discussion sake, our boss argues that even if you could get 50% (yes, just for discussion) more data from the 1st 2 lvls on the TSS/operons, it isn't really worth it as we are working with a bacteria. From what I see it is that if we could get that much data, we should get more biological data (real data) than noise as the ratio between them would favor this. Say, 40% more data and 10% extra noise, or 30% and 20% in a "bad" case. With more overall data, we could separate "real" data from noise a bit more easily in the following lvls.

I guess that in the end what I need is a good enough guide for the 1st lvl (2nd one is quite popular) that would be enough to feel comfortable with the Pipeline without having to spend the time to get into the kinks of base calling.

From your posts, I infer that you agree with our boss.

Thank you!
Leonardo

**Torst** · 02-24-2010, 03:08 PM

Leonardo

Originally posted by lcollado View Post

we want to start from the bottom up hence why right now we want to evaluate whether to use a base caller different to the SolexaPipeline (I think it was 1.4).

I just re-read your post and realised you were using GAPipeline 1.4. That is quite an old version, and in fact the jump from 1.3 to 1.4 was a BIG improvement in algorithm. Most people are using 1.5 now, and migrating to 1.6 (which only works with the latest chemistry).

This only reinforces my advice to focus on downstream analysis given the data you have. If you were really serious about your strategy of going to the lowest levels first, you would work with biochemists to improve the whole sequencing-by-synthesis procedure! :-)

**lcollado** · 03-01-2010, 10:32 AM

Thanks for your feedback ^_^

And well, the bottom up for me starts at base calling because I'm not into chemistry. I do get your point though

**cgb** · 03-01-2010, 11:37 AM

Originally posted by Simon Anders View Post

Actually, yes.

It is this one:

Nava Whiteford: The Solexa Pipeline

http://sgenomics.org/mediawiki/upload/8/80/Pipeline.pdf

Note that it treats an old version of the pipeline (the report is dated Dec 2008!). As I said, a lot was changed recently.

Simon

Almost. In fact it is quite hard to make huge improvements as Nava showed in SWIFT. The whole platform has been revised in parallel to the pipeline. This includes changes to optics, affecting pixels per cluster and signal/noise (as well as number of cluster/image). Tweaks to chemistry reagents, protocols and dyes - improving SN and reducing phasing signals. All of which have a big impact on raw data quality and subsequent base calls. I have wondered what a 1.6 Pipeline would make of GA1 data and what a 1.3 pipeline would make of of GAIIx. I doubt the differences can all be ascribed to software.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 8 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Would you use the SolexaPipeline or an external base caller?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News