![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
5500xl starting questions | rtpon | SOLiD | 5 | 03-19-2012 01:29 AM |
Solid 5500xl | colindaven | SOLiD | 16 | 02-01-2012 12:45 PM |
SOLiD seq process: Covert colorspace to basespace | zhigangwu | Bioinformatics | 4 | 12-08-2011 04:48 PM |
Mapping of SOLiD reads | JohanS | SOLiD | 2 | 05-26-2011 04:04 AM |
Conversion of colourspace into basespace format. | kasutubh | SOLiD | 12 | 11-04-2010 09:19 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Cardiff Join Date: Dec 2011
Posts: 19
|
![]()
Hi, have finally got my hands on some ECC SOLiD data. Encouragingly it mapped really well with Lifescope (91.1% reads mapped, 75 bp reads mapped to hg19). But with other mappers it is a very different story...
Using AB's convertFromXSQ.sh script it's possible to extract both the ECC generated basespace read data and underlying colourspace reads. However, when both of these are mapped separately using alternative third party mappers I am consistently finding the basespace reads map much POORER than their respective colourspace sequences. For example comparing Lifescope (v2.5) with BWA (v. 5.9) and Bowtie1 (v0.12.7) I get the following results: Outcome of mapping 142,102,059 ECC generated 75 bp reads: -------------------------------------------------------------- Lifescope --- 129,530,059 reads mapped (91.1%) BWA CS --- 77,511,121 reads mapped (54.5%) BWA BS --- 55,231,727 reads mapped (38.9%) Bowtie CS --- 105,120,949 reads mapped (74.0%) Bowtie BS --- 65,657,080 reads mapped (46.2%) -------------------------------------------------------------- i.e., Both BWA and Bowtie maps the ECC basespace version of the reads considerably poorer than the equivalent colourspace mapping. These differences can also be seen in the figure below summarising coverage distributions for the various mappings. (Note how, at least for BWA, mean coverage varies whilst median stays largely the same suggesting that, at least for BWA mapping, there are fewer extremes in coverage over the genome - but for Bowtie, the distributions are quite different: colourspace is clearly better than its ECC basespace equivalent when using Bowtie - i.e., more reads are mapping throughput the chromosome - in fact the distribution, at least for chr 1, is not markedly different to that of Lifescope). Has anyone else been handling EEC generated SOLiD data and have seen the same phenomenon? Does anyone have any suggestions as to why this is happening, given, naively I had assumed the ECC basespace output would, if anything, be more accurate than it's corresponding colourspace reads. Certainly that improvements IS seen with Lifescope (v2.5) but not with alternative mappers (or at least with the two considered so far). |
![]() |
![]() |
![]() |
#2 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
Haven't had a chance to work with ECC reads yet. However I am wondering if a close manual inspection of several of the reads reveal anything? In other words how good is the conversion of color-space to base-space working?
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Cardiff Join Date: Dec 2011
Posts: 19
|
![]()
Further observations:
on exploring those reads that map in colour-space but not in ECC base-space, I'm finding that the ECC reads are very poorly translated - presumably the Exact Call Chemistry correction has failed leading to a bad translation that is no better than a crude translation from colour-space to base-space without a reference. For example take the following colour-space read: T122031333003330220120103210013100000300000010001031100201003003010321201003 This is represented as the following ECC base-space read: GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC Whilst the ECC base-space read is unmapped by Bowtie, the colour-space read is mapped by Bowtie to the following location: bs ref:__GAGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC cs ref:_T122031333003330220120103212003112100300103130211233111201203123313231131322 Read:___T122031333003330220120103210013100000300000010001031100201003003010321201003 _________**************************_*_**___*****_*___*__*_*_*__***_**__*_*___*__*___ Thus effectively only the first 26 bases match. Comparing the reference region, identified by Bowtie, with the ECC read the full extent of the mistranslation is revealed (assuming of course that Bowtie is correct in its mapping...) ECC:_______GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC reference:__AGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC ____________*************************_______***_______**___________*__________*_*___** In other words the ECC read is no better than a crude without-reference translation: ECC:____GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC crude:__GAGGCATATTTATAAGAACTTGGCTGGGTACCCCCCGGGGGGGTTTTGGCACCCTTGGGCCCGGTTAGTCCAAAT ________****************************_______________***_*__******_______*_**________ It is debatable whether Bowtie should have mapped the colour-space read in the first place (and misleadingly reports an exact match along the full length of the read, reporting a base-space translation identical to the reference). Nevertheless, it failed to report the ECC base-space mapping (arguably a more accurate result given the quality of the read). Lifescope also reports a mapping to the same region but accurately indicates only a partial mapping: 537_110_1030 16 chr5 41830653 7 49H26M * 0 0 CAGCCAAGTTCTTATAAATATGCCTC 2JJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:WGP_S001_1X NH:i:2 CM:i:0 NM:i:0 CQ:Z:LLLLLLLLLLLLLLLLLLLLLLKLL,%,%(,%%%1,%%(%,%%%1%%%%7%%%%((%%%1%%%%1%%%(%%%%%% CS:Z:T122031333003330220120103210013100000300000010001031100201003003010321201003 So, overall conclusion is this: base-space reads, generated by ECC chemistry, include a proportion of reads badly translated from colour-space and these bad translations are leading to the poorer mapping performance observed compared with equivalent colour-space mapping. Using these ECC generated basespace reads with third party mappers is therefore problematic, leading to poorer use of information available than would be the case with colour-space mappings (if treated intelligently...). In the limited number of cases examined so far, Lifescope appears to make the best use of the available information. |
![]() |
![]() |
![]() |
#4 |
Member
Location: Cardiff Join Date: Dec 2011
Posts: 19
|
![]()
Ah, you have pre-empted my subsequent post!
|
![]() |
![]() |
![]() |
#5 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
It does look like the ECC failed, at least in that read. Is is possible to post both the CS read plus the ECC points. In other words something like:
T12203C13330T033... It would be interesting to see where the ECC slipped up. As I said I haven't actually worked with ECC so I am not sure what type of output can be created. |
![]() |
![]() |
![]() |
#6 |
Member
Location: salt lake city, UT Join Date: Apr 2010
Posts: 72
|
![]()
My understanding was that, even if ECC does a good job, by definition, an aligner will do a better job converting to basespace using dynamic programming approach from the alignment since it has more information. So, you'll lose information by using ECC base-space rather than colorspace.
Am I missing something? I did a comparison of mappers (excluding lifescope) here: https://github.com/brentp/bowfast/tr...ligner-compare I'll add an ECC base-space mapping at one point since it can show specificity, not just the number of reads mapped. |
![]() |
![]() |
![]() |
#7 | |
Member
Location: Cardiff Join Date: Dec 2011
Posts: 19
|
![]() Quote:
Yes, it is clear, at least for this read, the ECC failed (and I am extrapolating from this ECC failure was responsible for the poorer mapping generally) - I'm wondering whether this level of failure is to be expected for an ECC run or is specific to this particular run, and if the latter, how might one be alerted of a likely ECC failure (am not aware there are any metrics available to make this assessment). Does anyone have any thoughts on this? |
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: Cardiff Join Date: Dec 2011
Posts: 19
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Germany Join Date: Oct 2008
Posts: 415
|
![]()
My understanding is that ECC data should be particularly useful if there is no reference available, i.e. for de novo assembly. This is consistent with brentp's observation that the a colour space reference should help the aligner to correctly align the colours much more than the extra ECC spacing information would.
Perhaps, as seen with the improved Lifescope mapping rates, there is room for improvement for the other third party aligners such as Bowtie(2) and NovoalignCS. Brentp - thanks for the analysis and images. How do you explain the difference in results between Solid 3 and 5500, eg Novoalign as the top aligner for Solid 3 but worst for 5500 ? Has anyone tried ECC de novo ? As far as I know colour space data are not pretty in de novo assembly. |
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]() Quote:
-- Phillip |
|
![]() |
![]() |
![]() |
#11 | |
Member
Location: Cardiff Join Date: Dec 2011
Posts: 19
|
![]() Quote:
>537_110_1030_F3 T122031333003330220120103210013100000300000010001031100201003003010321201003 >537_110_1030_F3 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 42 43 43 11 4 11 4 7 11 4 4 4 16 11 4 4 7 4 11 4 4 4 16 4 4 4 4 22 4 4 4 4 7 7 4 4 4 16 4 4 4 4 16 4 4 4 7 4 4 4 4 4 4 I've had a quick look at a few other examples (not posted) and yes, similarly poor QVs are reported for the later ligations. Thanks for the useful insight! |
|
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]()
Nice to be right. I guess. Still, even though I gave an explanation for the result, I am mystified by it. The SOLiD collects data from every 5th CS position via successive ligations starting from a primer. After collecting a set of these (15 for 75 base reads), all the nascent strands are melted off the template and a new primer, with a different offset is annealed. (ECC adds an additional, more sparse, set of these ligations to allow some additional error detection/correction. )
So my question is why the stark difference after base 25? A 1000x drop in estimated sequence accuracy. It seems too much to ask to believe that 5 different sets of ligations suddenly started producing bad data right at ligation 6. Seems more likely that this occurred a single time: in the ECC ligation. If the ECC goes bad at ligation 6 for many of the beads, what happens to the sequence? Or, can low quality ECC bases impact high quality non-ECC bases? I guess I am to the same point that Rick is -- wanting to see the ECC read separate from the non-ECC ligations. But I have not looked at .xseq format, so I don't even know if this is possible. -- Phillip |
![]() |
![]() |
![]() |
#13 | |
Member
Location: salt lake city, UT Join Date: Apr 2010
Posts: 72
|
![]() Quote:
Good point about de novo. Though it'd be hard to make a case for de novo with 2*50 with SOLiD when you can do 2*100+ on illumina. |
|
![]() |
![]() |
![]() |
#14 | |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]() Quote:
So, I would agree. You would probably want to use and Illumina for de novo. SOLiD, if it still has a strong use case, would be for expression analysis. Or -- well, reasonably short (~3 kb or less) SOLiD mate end libraries, are pretty robust. Might want to use SOLiD reads for ME components of a de novo assembly. -- Phillip |
|
![]() |
![]() |
![]() |
#15 |
Member
Location: Connecticut Join Date: May 2010
Posts: 42
|
![]()
Great discussion. We have noticed a similar issue with our first run of SOLiD 5500 data: poor mapping of ECC data vs. Lifescope; drop-off of quality metrics after first 20-25bp of reads, etc.. We'll have to look further and see if we are observing the same mapping phenomena. I do have one question, though, is everyone working with model systems here (i.e. human, mouse, etc.). We ran a couple data sets through so far (human transcriptome, non-model fish transcriptome, and some RIP-seq data) and while observing the same issue with quality, have had trouble implementing Lifescope for non-model RNA-seq analysis as we have a reference transcriptome but no reference genome. Has anyone done non-model work using only a reference transcriptome? If so, I'd love to ask you a few questions about how you got Lifescope to work.
Cheers, Nate |
![]() |
![]() |
![]() |
#16 |
Junior Member
Location: france Join Date: Feb 2012
Posts: 1
|
![]()
Dear Nate,
I am actually considering doing some RNAseq analysis on non-model organism (with reference transcriptome, but no genome) using SOLID data. Did you finally succeed in mapping your SOLID data to your reference transcriptome? Cheers, Marie |
![]() |
![]() |
![]() |
#17 |
Member
Location: Connecticut Join Date: May 2010
Posts: 42
|
![]()
Marie,
Sorry for the long delay, but I lost track of this thread. Yes, I did finish a bunch of this work, so if you still have any specific questions about what we did, please either add to this thread or drop me a line. Basically, we used SOLiD data in both de novo assembly and subsequent RNA-seq and other analyses. Cheers, Nate |
![]() |
![]() |
![]() |
#18 |
Associate Professor
Location: Memphis, TN Join Date: Apr 2011
Posts: 20
|
![]()
I am trying to do RNAseq on SOLiD data now and would like to do the work in GALAXY. I did notice the poor quality scores for the .csfastq and the .qual files generated from the xsq files. In fact, when I filtered reads below a quality score of 20 I only got back .67% of the data! Should I go ahead with the untrimmed csfastq files or should I change trim parameters?
|
![]() |
![]() |
![]() |
#19 |
Member
Location: Pittsburgh, PA Join Date: Feb 2011
Posts: 49
|
![]()
I would try them both. Every data set is different so I'm not sure how either will change the data but it won't hurt to try. In my experience, the filtered SOLiD data yielded better results.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|