Seqanswers Leaderboard Ad

**westerman** · 01-11-2012, 08:20 AM

Haven't had a chance to work with ECC reads yet. However I am wondering if a close manual inspection of several of the reads reveal anything? In other words how good is the conversion of color-space to base-space working?

**NestorNotabilis** · 01-11-2012, 08:21 AM

Further observations:

on exploring those reads that map in colour-space but not in ECC base-space, I'm finding that the ECC reads are very poorly translated - presumably the Exact Call Chemistry correction has failed leading to a bad translation that is no better than a crude translation from colour-space to base-space without a reference.

For example take the following colour-space read:
T122031333003330220120103210013100000300000010001031100201003003010321201003

This is represented as the following ECC base-space read:
GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC

Whilst the ECC base-space read is unmapped by Bowtie, the colour-space read is mapped by Bowtie to the following location:

bs ref:__GAGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC
cs ref:_T122031333003330220120103212003112100300103130211233111201203123313231131322
Read:___T122031333003330220120103210013100000300000010001031100201003003010321201003
_________**************************_*_**___*****_*___*__*_*_*__***_**__*_*___*__*___

Thus effectively only the first 26 bases match.

Comparing the reference region, identified by Bowtie, with the ECC read the full extent of the mistranslation is revealed (assuming of course that Bowtie is correct in its mapping...)

ECC:_______GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC
reference:__AGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC
____________*************************_______***_______**___________*__________*_*___**

In other words the ECC read is no better than a crude without-reference translation:

ECC:____GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC
crude:__GAGGCATATTTATAAGAACTTGGCTGGGTACCCCCCGGGGGGGTTTTGGCACCCTTGGGCCCGGTTAGTCCAAAT
________****************************_______________***_*__******_______*_**________

It is debatable whether Bowtie should have mapped the colour-space read in the first place (and misleadingly reports an exact match along the full length of the read, reporting a base-space translation identical to the reference). Nevertheless, it failed to report the ECC base-space mapping (arguably a more accurate result given the quality of the read).

Lifescope also reports a mapping to the same region but accurately indicates only a partial mapping:

537_110_1030 16 chr5 41830653 7 49H26M * 0 0 CAGCCAAGTTCTTATAAATATGCCTC 2JJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:WGP_S001_1X NH:i:2 CM:i:0 NM:i:0 CQ:Z:LLLLLLLLLLLLLLLLLLLLLLKLL,%,%(,%%%1,%%(%,%%%1%%%%7%%%%((%%%1%%%%1%%%(%%%%%% CS:Z:T122031333003330220120103210013100000300000010001031100201003003010321201003

So, overall conclusion is this: base-space reads, generated by ECC chemistry, include a proportion of reads badly translated from colour-space and these bad translations are leading to the poorer mapping performance observed compared with equivalent colour-space mapping. Using these ECC generated basespace reads with third party mappers is therefore problematic, leading to poorer use of information available than would be the case with colour-space mappings (if treated intelligently...). In the limited number of cases examined so far, Lifescope appears to make the best use of the available information.

**NestorNotabilis** · 01-11-2012, 08:21 AM

Originally posted by westerman View Post

Haven't had a chance to work with ECC reads yet. However I am wondering if a close manual inspection of several of the reads reveal anything? In other words how good is the conversion of color-space to base-space working?

Ah, you have pre-empted my subsequent post!

**westerman** · 01-11-2012, 08:35 AM

Originally posted by NestorNotabilis View Post

Ah, you have pre-empted my subsequent post!

It does look like the ECC failed, at least in that read. Is is possible to post both the CS read plus the ECC points. In other words something like:

T12203C13330T033...

It would be interesting to see where the ECC slipped up.

As I said I haven't actually worked with ECC so I am not sure what type of output can be created.

**brentp** · 01-11-2012, 08:47 AM

My understanding was that, even if ECC does a good job, by definition, an aligner will do a better job converting to basespace using dynamic programming approach from the alignment since it has more information. So, you'll lose information by using ECC base-space rather than colorspace.
Am I missing something?

I did a comparison of mappers (excluding lifescope) here:

bowfast/aligner-compare at master · brentp/bowfast

https://github.com/brentp/bowfast/tree/master/aligner-compare

run bowtie then bfast on colorspace reads. Contribute to brentp/bowfast development by creating an account on GitHub.

I'll add an ECC base-space mapping at one point since it can show specificity, not just the number of reads mapped.

**NestorNotabilis** · 01-11-2012, 09:01 AM

Originally posted by westerman View Post

It does look like the ECC failed, at least in that read. Is is possible to post both the CS read plus the ECC points. In other words something like:

T12203C13330T033...

It would be interesting to see where the ECC slipped up.

As I said I haven't actually worked with ECC so I am not sure what type of output can be created.

Have yet to come across that sort of output - AB's script, convertFromXSQ.sh, outputs only 3 files per library: csfasta, QV.qual for colourspace and fastq for basespace. I'm unclear whether the information you refer to is extractable from the XSQ files - does anyone know how to do this?

Yes, it is clear, at least for this read, the ECC failed (and I am extrapolating from this ECC failure was responsible for the poorer mapping generally) - I'm wondering whether this level of failure is to be expected for an ECC run or is specific to this particular run, and if the latter, how might one be alerted of a likely ECC failure (am not aware there are any metrics available to make this assessment). Does anyone have any thoughts on this?

**NestorNotabilis** · 01-11-2012, 09:12 AM

Originally posted by brentp View Post

My understanding was that, even if ECC does a good job, by definition, an aligner will do a better job converting to basespace using dynamic programming approach from the alignment since it has more information. So, you'll lose information by using ECC base-space rather than colorspace.
Am I missing something?

That sounds like a reasonable assessment - and, if true, does beg the question the value of the basespace fastq. ECC clearly improves the Lifescope mapping as evidenced by the 91.1% mapping success (compared with ~80% expected for a colourspace mapping of this sort) - but how exploitable is this ECC information by current third party mappers?

**colindaven** · 01-12-2012, 12:30 AM

My understanding is that ECC data should be particularly useful if there is no reference available, i.e. for de novo assembly. This is consistent with brentp's observation that the a colour space reference should help the aligner to correctly align the colours much more than the extra ECC spacing information would.

Perhaps, as seen with the improved Lifescope mapping rates, there is room for improvement for the other third party aligners such as Bowtie(2) and NovoalignCS.

Brentp - thanks for the analysis and images. How do you explain the difference in results between Solid 3 and 5500, eg Novoalign as the top aligner for Solid 3 but worst for 5500 ?

Has anyone tried ECC de novo ? As far as I know colour space data are not pretty in de novo assembly.

**pmiguel** · 01-12-2012, 04:32 AM

Originally posted by NestorNotabilis View Post

Further observations:

on exploring those reads that map in colour-space but not in ECC base-space, I'm finding that the ECC reads are very poorly translated - presumably the Exact Call Chemistry correction has failed leading to a bad translation that is no better than a crude translation from colour-space to base-space without a reference.

For example take the following colour-space read:
T122031333003330220120103210013100000300000010001031100201003003010321201003

Could you post the .QUAL record for this .cs read? One possibility is that the bead was sparsely templated, leading to reasonably high quality in the early ligations, but noise after a few ligation cycles.

--
Phillip

**NestorNotabilis** · 01-12-2012, 05:04 AM

Originally posted by pmiguel View Post

Could you post the .QUAL record for this .cs read? One possibility is that the bead was sparsely templated, leading to reasonably high quality in the early ligations, but noise after a few ligation cycles.

--
Phillip

Sure. The info is as follows:

>537_110_1030_F3
T122031333003330220120103210013100000300000010001031100201003003010321201003

>537_110_1030_F3
43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 42 43 43 11 4 11 4 7 11 4 4 4 16 11 4 4 7 4 11 4 4 4 16 4 4 4 4 22 4 4 4 4 7 7 4 4 4 16 4 4 4 4 16 4 4 4 7 4 4 4 4 4 4

I've had a quick look at a few other examples (not posted) and yes, similarly poor QVs are reported for the later ligations. Thanks for the useful insight!

**pmiguel** · 01-12-2012, 05:45 AM

Nice to be right. I guess. Still, even though I gave an explanation for the result, I am mystified by it. The SOLiD collects data from every 5th CS position via successive ligations starting from a primer. After collecting a set of these (15 for 75 base reads), all the nascent strands are melted off the template and a new primer, with a different offset is annealed. (ECC adds an additional, more sparse, set of these ligations to allow some additional error detection/correction. )

So my question is why the stark difference after base 25? A 1000x drop in estimated sequence accuracy. It seems too much to ask to believe that 5 different sets of ligations suddenly started producing bad data right at ligation 6. Seems more likely that this occurred a single time: in the ECC ligation. If the ECC goes bad at ligation 6 for many of the beads, what happens to the sequence? Or, can low quality ECC bases impact high quality non-ECC bases?

I guess I am to the same point that Rick is -- wanting to see the ECC read separate from the non-ECC ligations. But I have not looked at .xseq format, so I don't even know if this is possible.

--
Phillip

**brentp** · 01-12-2012, 05:46 AM

Originally posted by colindaven View Post

Brentp - thanks for the analysis and images. How do you explain the difference in results between Solid 3 and 5500, eg Novoalign as the top aligner for Solid 3 but worst for 5500 ?
.

Well, the SOLiD 3 has shorter reads and higher error rates than SOLiD 5. So, the different aligners are probably tuned for 1 or the other. You see novoalign as the top aligner, but I think BFAST does well in both cases as long as you have a mapping quality cutoff.

Good point about de novo. Though it'd be hard to make a case for de novo with 2*50 with SOLiD when you can do 2*100+ on illumina.

**pmiguel** · 01-12-2012, 06:49 AM

Originally posted by brentp View Post

Good point about de novo. Though it'd be hard to make a case for de novo with 2*50 with SOLiD when you can do 2*100+ on illumina.

2*60 for the 5500. But those require Mate End libraries -- substantially more difficult to make than Paired End libraries.

So, I would agree. You would probably want to use and Illumina for de novo. SOLiD, if it still has a strong use case, would be for expression analysis. Or -- well, reasonably short (~3 kb or less) SOLiD mate end libraries, are pretty robust. Might want to use SOLiD reads for ME components of a de novo assembly.

--
Phillip

**JueFish** · 03-21-2012, 08:24 AM

Great discussion. We have noticed a similar issue with our first run of SOLiD 5500 data: poor mapping of ECC data vs. Lifescope; drop-off of quality metrics after first 20-25bp of reads, etc.. We'll have to look further and see if we are observing the same mapping phenomena. I do have one question, though, is everyone working with model systems here (i.e. human, mouse, etc.). We ran a couple data sets through so far (human transcriptome, non-model fish transcriptome, and some RIP-seq data) and while observing the same issue with quality, have had trouble implementing Lifescope for non-model RNA-seq analysis as we have a reference transcriptome but no reference genome. Has anyone done non-model work using only a reference transcriptome? If so, I'd love to ask you a few questions about how you got Lifescope to work.

Cheers,
Nate

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

SOLiD 5500xl EEC basespace vs colourspace mapping

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News