SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
5500xl starting questions rtpon SOLiD 5 03-19-2012 01:29 AM
Solid 5500xl colindaven SOLiD 16 02-01-2012 12:45 PM
SOLiD seq process: Covert colorspace to basespace zhigangwu Bioinformatics 4 12-08-2011 04:48 PM
Mapping of SOLiD reads JohanS SOLiD 2 05-26-2011 04:04 AM
Conversion of colourspace into basespace format. kasutubh SOLiD 12 11-04-2010 09:19 AM

Reply
 
Thread Tools
Old 01-11-2012, 02:41 AM   #1
NestorNotabilis
Member
 
Location: Cardiff

Join Date: Dec 2011
Posts: 19
Default SOLiD 5500xl EEC basespace vs colourspace mapping

Hi, have finally got my hands on some ECC SOLiD data. Encouragingly it mapped really well with Lifescope (91.1% reads mapped, 75 bp reads mapped to hg19). But with other mappers it is a very different story...

Using AB's convertFromXSQ.sh script it's possible to extract both the ECC generated basespace read data and underlying colourspace reads. However, when both of these are mapped separately using alternative third party mappers I am consistently finding the basespace reads map much POORER than their respective colourspace sequences.

For example comparing Lifescope (v2.5) with BWA (v. 5.9) and Bowtie1 (v0.12.7) I get the following results:

Outcome of mapping 142,102,059 ECC generated 75 bp reads:
--------------------------------------------------------------
Lifescope --- 129,530,059 reads mapped (91.1%)
BWA CS --- 77,511,121 reads mapped (54.5%)
BWA BS --- 55,231,727 reads mapped (38.9%)
Bowtie CS --- 105,120,949 reads mapped (74.0%)
Bowtie BS --- 65,657,080 reads mapped (46.2%)
--------------------------------------------------------------

i.e., Both BWA and Bowtie maps the ECC basespace version of the reads considerably poorer than the equivalent colourspace mapping.

These differences can also be seen in the figure below summarising coverage distributions for the various mappings. (Note how, at least for BWA, mean coverage varies whilst median stays largely the same suggesting that, at least for BWA mapping, there are fewer extremes in coverage over the genome - but for Bowtie, the distributions are quite different: colourspace is clearly better than its ECC basespace equivalent when using Bowtie - i.e., more reads are mapping throughput the chromosome - in fact the distribution, at least for chr 1, is not markedly different to that of Lifescope).



Has anyone else been handling EEC generated SOLiD data and have seen the same phenomenon? Does anyone have any suggestions as to why this is happening, given, naively I had assumed the ECC basespace output would, if anything, be more accurate than it's corresponding colourspace reads. Certainly that improvements IS seen with Lifescope (v2.5) but not with alternative mappers (or at least with the two considered so far).
Attached Images
File Type: png ECC_query.001.png (61.4 KB, 60 views)
NestorNotabilis is offline   Reply With Quote
Old 01-11-2012, 08:20 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Haven't had a chance to work with ECC reads yet. However I am wondering if a close manual inspection of several of the reads reveal anything? In other words how good is the conversion of color-space to base-space working?
westerman is offline   Reply With Quote
Old 01-11-2012, 08:21 AM   #3
NestorNotabilis
Member
 
Location: Cardiff

Join Date: Dec 2011
Posts: 19
Default

Further observations:

on exploring those reads that map in colour-space but not in ECC base-space, I'm finding that the ECC reads are very poorly translated - presumably the Exact Call Chemistry correction has failed leading to a bad translation that is no better than a crude translation from colour-space to base-space without a reference.

For example take the following colour-space read:
T122031333003330220120103210013100000300000010001031100201003003010321201003

This is represented as the following ECC base-space read:
GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC

Whilst the ECC base-space read is unmapped by Bowtie, the colour-space read is mapped by Bowtie to the following location:

bs ref:__GAGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC
cs ref:_T122031333003330220120103212003112100300103130211233111201203123313231131322
Read:___T122031333003330220120103210013100000300000010001031100201003003010321201003
_________**************************_*_**___*****_*___*__*_*_*__***_**__*_*___*__*___



Thus effectively only the first 26 bases match.

Comparing the reference region, identified by Bowtie, with the ECC read the full extent of the mistranslation is revealed (assuming of course that Bowtie is correct in its mapping...)

ECC:_______GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC
reference:__AGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC
____________*************************_______***_______**___________*__________*_*___**


In other words the ECC read is no better than a crude without-reference translation:

ECC:____GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC
crude:__GAGGCATATTTATAAGAACTTGGCTGGGTACCCCCCGGGGGGGTTTTGGCACCCTTGGGCCCGGTTAGTCCAAAT
________****************************_______________***_*__******_______*_**________


It is debatable whether Bowtie should have mapped the colour-space read in the first place (and misleadingly reports an exact match along the full length of the read, reporting a base-space translation identical to the reference). Nevertheless, it failed to report the ECC base-space mapping (arguably a more accurate result given the quality of the read).

Lifescope also reports a mapping to the same region but accurately indicates only a partial mapping:

537_110_1030 16 chr5 41830653 7 49H26M * 0 0 CAGCCAAGTTCTTATAAATATGCCTC 2JJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:WGP_S001_1X NH:i:2 CM:i:0 NM:i:0 CQ:Z:LLLLLLLLLLLLLLLLLLLLLLKLL,%,%(,%%%1,%%(%,%%%1%%%%7%%%%((%%%1%%%%1%%%(%%%%%% CS:Z:T122031333003330220120103210013100000300000010001031100201003003010321201003

So, overall conclusion is this: base-space reads, generated by ECC chemistry, include a proportion of reads badly translated from colour-space and these bad translations are leading to the poorer mapping performance observed compared with equivalent colour-space mapping. Using these ECC generated basespace reads with third party mappers is therefore problematic, leading to poorer use of information available than would be the case with colour-space mappings (if treated intelligently...). In the limited number of cases examined so far, Lifescope appears to make the best use of the available information.
NestorNotabilis is offline   Reply With Quote
Old 01-11-2012, 08:21 AM   #4
NestorNotabilis
Member
 
Location: Cardiff

Join Date: Dec 2011
Posts: 19
Default

Quote:
Originally Posted by westerman View Post
Haven't had a chance to work with ECC reads yet. However I am wondering if a close manual inspection of several of the reads reveal anything? In other words how good is the conversion of color-space to base-space working?
Ah, you have pre-empted my subsequent post!
NestorNotabilis is offline   Reply With Quote
Old 01-11-2012, 08:35 AM   #5
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by NestorNotabilis View Post
Ah, you have pre-empted my subsequent post!
It does look like the ECC failed, at least in that read. Is is possible to post both the CS read plus the ECC points. In other words something like:


T12203C13330T033...

It would be interesting to see where the ECC slipped up.

As I said I haven't actually worked with ECC so I am not sure what type of output can be created.
westerman is offline   Reply With Quote
Old 01-11-2012, 08:47 AM   #6
brentp
Member
 
Location: salt lake city, UT

Join Date: Apr 2010
Posts: 72
Default

My understanding was that, even if ECC does a good job, by definition, an aligner will do a better job converting to basespace using dynamic programming approach from the alignment since it has more information. So, you'll lose information by using ECC base-space rather than colorspace.
Am I missing something?

I did a comparison of mappers (excluding lifescope) here:
https://github.com/brentp/bowfast/tr...ligner-compare

I'll add an ECC base-space mapping at one point since it can show specificity, not just the number of reads mapped.
brentp is offline   Reply With Quote
Old 01-11-2012, 09:01 AM   #7
NestorNotabilis
Member
 
Location: Cardiff

Join Date: Dec 2011
Posts: 19
Default

Quote:
Originally Posted by westerman View Post
It does look like the ECC failed, at least in that read. Is is possible to post both the CS read plus the ECC points. In other words something like:


T12203C13330T033...

It would be interesting to see where the ECC slipped up.

As I said I haven't actually worked with ECC so I am not sure what type of output can be created.
Have yet to come across that sort of output - AB's script, convertFromXSQ.sh, outputs only 3 files per library: csfasta, QV.qual for colourspace and fastq for basespace. I'm unclear whether the information you refer to is extractable from the XSQ files - does anyone know how to do this?

Yes, it is clear, at least for this read, the ECC failed (and I am extrapolating from this ECC failure was responsible for the poorer mapping generally) - I'm wondering whether this level of failure is to be expected for an ECC run or is specific to this particular run, and if the latter, how might one be alerted of a likely ECC failure (am not aware there are any metrics available to make this assessment). Does anyone have any thoughts on this?
NestorNotabilis is offline   Reply With Quote
Old 01-11-2012, 09:12 AM   #8
NestorNotabilis
Member
 
Location: Cardiff

Join Date: Dec 2011
Posts: 19
Default

Quote:
Originally Posted by brentp View Post
My understanding was that, even if ECC does a good job, by definition, an aligner will do a better job converting to basespace using dynamic programming approach from the alignment since it has more information. So, you'll lose information by using ECC base-space rather than colorspace.
Am I missing something?
That sounds like a reasonable assessment - and, if true, does beg the question the value of the basespace fastq. ECC clearly improves the Lifescope mapping as evidenced by the 91.1% mapping success (compared with ~80% expected for a colourspace mapping of this sort) - but how exploitable is this ECC information by current third party mappers?
NestorNotabilis is offline   Reply With Quote
Old 01-12-2012, 12:30 AM   #9
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

My understanding is that ECC data should be particularly useful if there is no reference available, i.e. for de novo assembly. This is consistent with brentp's observation that the a colour space reference should help the aligner to correctly align the colours much more than the extra ECC spacing information would.

Perhaps, as seen with the improved Lifescope mapping rates, there is room for improvement for the other third party aligners such as Bowtie(2) and NovoalignCS.

Brentp - thanks for the analysis and images. How do you explain the difference in results between Solid 3 and 5500, eg Novoalign as the top aligner for Solid 3 but worst for 5500 ?

Has anyone tried ECC de novo ? As far as I know colour space data are not pretty in de novo assembly.
colindaven is offline   Reply With Quote
Old 01-12-2012, 04:32 AM   #10
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by NestorNotabilis View Post
Further observations:

on exploring those reads that map in colour-space but not in ECC base-space, I'm finding that the ECC reads are very poorly translated - presumably the Exact Call Chemistry correction has failed leading to a bad translation that is no better than a crude translation from colour-space to base-space without a reference.

For example take the following colour-space read:
T122031333003330220120103210013100000300000010001031100201003003010321201003
Could you post the .QUAL record for this .cs read? One possibility is that the bead was sparsely templated, leading to reasonably high quality in the early ligations, but noise after a few ligation cycles.

--
Phillip
pmiguel is offline   Reply With Quote
Old 01-12-2012, 05:04 AM   #11
NestorNotabilis
Member
 
Location: Cardiff

Join Date: Dec 2011
Posts: 19
Default

Quote:
Originally Posted by pmiguel View Post
Could you post the .QUAL record for this .cs read? One possibility is that the bead was sparsely templated, leading to reasonably high quality in the early ligations, but noise after a few ligation cycles.

--
Phillip
Sure. The info is as follows:

>537_110_1030_F3
T122031333003330220120103210013100000300000010001031100201003003010321201003


>537_110_1030_F3
43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 42 43 43 11 4 11 4 7 11 4 4 4 16 11 4 4 7 4 11 4 4 4 16 4 4 4 4 22 4 4 4 4 7 7 4 4 4 16 4 4 4 4 16 4 4 4 7 4 4 4 4 4 4


I've had a quick look at a few other examples (not posted) and yes, similarly poor QVs are reported for the later ligations. Thanks for the useful insight!
NestorNotabilis is offline   Reply With Quote
Old 01-12-2012, 05:45 AM   #12
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Nice to be right. I guess. Still, even though I gave an explanation for the result, I am mystified by it. The SOLiD collects data from every 5th CS position via successive ligations starting from a primer. After collecting a set of these (15 for 75 base reads), all the nascent strands are melted off the template and a new primer, with a different offset is annealed. (ECC adds an additional, more sparse, set of these ligations to allow some additional error detection/correction. )

So my question is why the stark difference after base 25? A 1000x drop in estimated sequence accuracy. It seems too much to ask to believe that 5 different sets of ligations suddenly started producing bad data right at ligation 6. Seems more likely that this occurred a single time: in the ECC ligation. If the ECC goes bad at ligation 6 for many of the beads, what happens to the sequence? Or, can low quality ECC bases impact high quality non-ECC bases?

I guess I am to the same point that Rick is -- wanting to see the ECC read separate from the non-ECC ligations. But I have not looked at .xseq format, so I don't even know if this is possible.

--
Phillip
pmiguel is offline   Reply With Quote
Old 01-12-2012, 05:46 AM   #13
brentp
Member
 
Location: salt lake city, UT

Join Date: Apr 2010
Posts: 72
Default

Quote:
Originally Posted by colindaven View Post
Brentp - thanks for the analysis and images. How do you explain the difference in results between Solid 3 and 5500, eg Novoalign as the top aligner for Solid 3 but worst for 5500 ?
.
Well, the SOLiD 3 has shorter reads and higher error rates than SOLiD 5. So, the different aligners are probably tuned for 1 or the other. You see novoalign as the top aligner, but I think BFAST does well in both cases as long as you have a mapping quality cutoff.

Good point about de novo. Though it'd be hard to make a case for de novo with 2*50 with SOLiD when you can do 2*100+ on illumina.
brentp is offline   Reply With Quote
Old 01-12-2012, 06:49 AM   #14
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by brentp View Post
Good point about de novo. Though it'd be hard to make a case for de novo with 2*50 with SOLiD when you can do 2*100+ on illumina.
2*60 for the 5500. But those require Mate End libraries -- substantially more difficult to make than Paired End libraries.

So, I would agree. You would probably want to use and Illumina for de novo. SOLiD, if it still has a strong use case, would be for expression analysis. Or -- well, reasonably short (~3 kb or less) SOLiD mate end libraries, are pretty robust. Might want to use SOLiD reads for ME components of a de novo assembly.

--
Phillip
pmiguel is offline   Reply With Quote
Old 03-21-2012, 09:24 AM   #15
JueFish
Member
 
Location: Connecticut

Join Date: May 2010
Posts: 42
Default

Great discussion. We have noticed a similar issue with our first run of SOLiD 5500 data: poor mapping of ECC data vs. Lifescope; drop-off of quality metrics after first 20-25bp of reads, etc.. We'll have to look further and see if we are observing the same mapping phenomena. I do have one question, though, is everyone working with model systems here (i.e. human, mouse, etc.). We ran a couple data sets through so far (human transcriptome, non-model fish transcriptome, and some RIP-seq data) and while observing the same issue with quality, have had trouble implementing Lifescope for non-model RNA-seq analysis as we have a reference transcriptome but no reference genome. Has anyone done non-model work using only a reference transcriptome? If so, I'd love to ask you a few questions about how you got Lifescope to work.

Cheers,
Nate
JueFish is offline   Reply With Quote
Old 11-14-2012, 04:26 AM   #16
bigfoot
Junior Member
 
Location: france

Join Date: Feb 2012
Posts: 1
Default

Dear Nate,

I am actually considering doing some RNAseq analysis on non-model organism (with reference transcriptome, but no genome) using SOLID data. Did you finally succeed in mapping your SOLID data to your reference transcriptome?

Cheers,

Marie
bigfoot is offline   Reply With Quote
Old 10-02-2013, 11:59 AM   #17
JueFish
Member
 
Location: Connecticut

Join Date: May 2010
Posts: 42
Default

Marie,

Sorry for the long delay, but I lost track of this thread. Yes, I did finish a bunch of this work, so if you still have any specific questions about what we did, please either add to this thread or drop me a line. Basically, we used SOLiD data in both de novo assembly and subsequent RNA-seq and other analyses.

Cheers,
Nate
JueFish is offline   Reply With Quote
Old 10-15-2013, 01:05 PM   #18
tonup69
Associate Professor
 
Location: Memphis, TN

Join Date: Apr 2011
Posts: 20
Default

I am trying to do RNAseq on SOLiD data now and would like to do the work in GALAXY. I did notice the poor quality scores for the .csfastq and the .qual files generated from the xsq files. In fact, when I filtered reads below a quality score of 20 I only got back .67% of the data! Should I go ahead with the untrimmed csfastq files or should I change trim parameters?
tonup69 is offline   Reply With Quote
Old 10-16-2013, 09:27 AM   #19
twaddlac
Member
 
Location: Pittsburgh, PA

Join Date: Feb 2011
Posts: 49
Default

I would try them both. Every data set is different so I'm not sure how either will change the data but it won't hurt to try. In my experience, the filtered SOLiD data yielded better results.
twaddlac is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:16 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO