Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SOLiD 5500xl EEC basespace vs colourspace mapping

    Hi, have finally got my hands on some ECC SOLiD data. Encouragingly it mapped really well with Lifescope (91.1% reads mapped, 75 bp reads mapped to hg19). But with other mappers it is a very different story...

    Using AB's convertFromXSQ.sh script it's possible to extract both the ECC generated basespace read data and underlying colourspace reads. However, when both of these are mapped separately using alternative third party mappers I am consistently finding the basespace reads map much POORER than their respective colourspace sequences.

    For example comparing Lifescope (v2.5) with BWA (v. 5.9) and Bowtie1 (v0.12.7) I get the following results:

    Outcome of mapping 142,102,059 ECC generated 75 bp reads:
    --------------------------------------------------------------
    Lifescope --- 129,530,059 reads mapped (91.1%)
    BWA CS --- 77,511,121 reads mapped (54.5%)
    BWA BS --- 55,231,727 reads mapped (38.9%)
    Bowtie CS --- 105,120,949 reads mapped (74.0%)
    Bowtie BS --- 65,657,080 reads mapped (46.2%)
    --------------------------------------------------------------

    i.e., Both BWA and Bowtie maps the ECC basespace version of the reads considerably poorer than the equivalent colourspace mapping.

    These differences can also be seen in the figure below summarising coverage distributions for the various mappings. (Note how, at least for BWA, mean coverage varies whilst median stays largely the same suggesting that, at least for BWA mapping, there are fewer extremes in coverage over the genome - but for Bowtie, the distributions are quite different: colourspace is clearly better than its ECC basespace equivalent when using Bowtie - i.e., more reads are mapping throughput the chromosome - in fact the distribution, at least for chr 1, is not markedly different to that of Lifescope).



    Has anyone else been handling EEC generated SOLiD data and have seen the same phenomenon? Does anyone have any suggestions as to why this is happening, given, naively I had assumed the ECC basespace output would, if anything, be more accurate than it's corresponding colourspace reads. Certainly that improvements IS seen with Lifescope (v2.5) but not with alternative mappers (or at least with the two considered so far).
    Attached Files

  • #2
    Haven't had a chance to work with ECC reads yet. However I am wondering if a close manual inspection of several of the reads reveal anything? In other words how good is the conversion of color-space to base-space working?

    Comment


    • #3
      Further observations:

      on exploring those reads that map in colour-space but not in ECC base-space, I'm finding that the ECC reads are very poorly translated - presumably the Exact Call Chemistry correction has failed leading to a bad translation that is no better than a crude translation from colour-space to base-space without a reference.

      For example take the following colour-space read:
      T122031333003330220120103210013100000300000010001031100201003003010321201003

      This is represented as the following ECC base-space read:
      GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC

      Whilst the ECC base-space read is unmapped by Bowtie, the colour-space read is mapped by Bowtie to the following location:

      bs ref:__GAGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC
      cs ref:_T122031333003330220120103212003112100300103130211233111201203123313231131322
      Read:___T122031333003330220120103210013100000300000010001031100201003003010321201003
      _________**************************_*_**___*****_*___*__*_*_*__***_**__*_*___*__*___



      Thus effectively only the first 26 bases match.

      Comparing the reference region, identified by Bowtie, with the ECC read the full extent of the mistranslation is revealed (assuming of course that Bowtie is correct in its mapping...)

      ECC:_______GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC
      reference:__AGGCATATTTATAAGAACTTGGCTGAAATGTCAAATTTGGCATTCACTATGTGAACTTACTATGCTACATGCTC
      ____________*************************_______***_______**___________*__________*_*___**


      In other words the ECC read is no better than a crude without-reference translation:

      ECC:____GAGGCATATTTATAAGAACTTGGCTGGGGCAAAAAAAAAAAACTTTCGATACCCTTCCCGGACGCTAAGAGCTTC
      crude:__GAGGCATATTTATAAGAACTTGGCTGGGTACCCCCCGGGGGGGTTTTGGCACCCTTGGGCCCGGTTAGTCCAAAT
      ________****************************_______________***_*__******_______*_**________


      It is debatable whether Bowtie should have mapped the colour-space read in the first place (and misleadingly reports an exact match along the full length of the read, reporting a base-space translation identical to the reference). Nevertheless, it failed to report the ECC base-space mapping (arguably a more accurate result given the quality of the read).

      Lifescope also reports a mapping to the same region but accurately indicates only a partial mapping:

      537_110_1030 16 chr5 41830653 7 49H26M * 0 0 CAGCCAAGTTCTTATAAATATGCCTC 2JJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:WGP_S001_1X NH:i:2 CM:i:0 NM:i:0 CQ:Z:LLLLLLLLLLLLLLLLLLLLLLKLL,%,%(,%%%1,%%(%,%%%1%%%%7%%%%((%%%1%%%%1%%%(%%%%%% CS:Z:T122031333003330220120103210013100000300000010001031100201003003010321201003

      So, overall conclusion is this: base-space reads, generated by ECC chemistry, include a proportion of reads badly translated from colour-space and these bad translations are leading to the poorer mapping performance observed compared with equivalent colour-space mapping. Using these ECC generated basespace reads with third party mappers is therefore problematic, leading to poorer use of information available than would be the case with colour-space mappings (if treated intelligently...). In the limited number of cases examined so far, Lifescope appears to make the best use of the available information.

      Comment


      • #4
        Originally posted by westerman View Post
        Haven't had a chance to work with ECC reads yet. However I am wondering if a close manual inspection of several of the reads reveal anything? In other words how good is the conversion of color-space to base-space working?
        Ah, you have pre-empted my subsequent post!

        Comment


        • #5
          Originally posted by NestorNotabilis View Post
          Ah, you have pre-empted my subsequent post!
          It does look like the ECC failed, at least in that read. Is is possible to post both the CS read plus the ECC points. In other words something like:


          T12203C13330T033...

          It would be interesting to see where the ECC slipped up.

          As I said I haven't actually worked with ECC so I am not sure what type of output can be created.

          Comment


          • #6
            My understanding was that, even if ECC does a good job, by definition, an aligner will do a better job converting to basespace using dynamic programming approach from the alignment since it has more information. So, you'll lose information by using ECC base-space rather than colorspace.
            Am I missing something?

            I did a comparison of mappers (excluding lifescope) here:
            run bowtie then bfast on colorspace reads. Contribute to brentp/bowfast development by creating an account on GitHub.


            I'll add an ECC base-space mapping at one point since it can show specificity, not just the number of reads mapped.

            Comment


            • #7
              Originally posted by westerman View Post
              It does look like the ECC failed, at least in that read. Is is possible to post both the CS read plus the ECC points. In other words something like:


              T12203C13330T033...

              It would be interesting to see where the ECC slipped up.

              As I said I haven't actually worked with ECC so I am not sure what type of output can be created.
              Have yet to come across that sort of output - AB's script, convertFromXSQ.sh, outputs only 3 files per library: csfasta, QV.qual for colourspace and fastq for basespace. I'm unclear whether the information you refer to is extractable from the XSQ files - does anyone know how to do this?

              Yes, it is clear, at least for this read, the ECC failed (and I am extrapolating from this ECC failure was responsible for the poorer mapping generally) - I'm wondering whether this level of failure is to be expected for an ECC run or is specific to this particular run, and if the latter, how might one be alerted of a likely ECC failure (am not aware there are any metrics available to make this assessment). Does anyone have any thoughts on this?

              Comment


              • #8
                Originally posted by brentp View Post
                My understanding was that, even if ECC does a good job, by definition, an aligner will do a better job converting to basespace using dynamic programming approach from the alignment since it has more information. So, you'll lose information by using ECC base-space rather than colorspace.
                Am I missing something?
                That sounds like a reasonable assessment - and, if true, does beg the question the value of the basespace fastq. ECC clearly improves the Lifescope mapping as evidenced by the 91.1% mapping success (compared with ~80% expected for a colourspace mapping of this sort) - but how exploitable is this ECC information by current third party mappers?

                Comment


                • #9
                  My understanding is that ECC data should be particularly useful if there is no reference available, i.e. for de novo assembly. This is consistent with brentp's observation that the a colour space reference should help the aligner to correctly align the colours much more than the extra ECC spacing information would.

                  Perhaps, as seen with the improved Lifescope mapping rates, there is room for improvement for the other third party aligners such as Bowtie(2) and NovoalignCS.

                  Brentp - thanks for the analysis and images. How do you explain the difference in results between Solid 3 and 5500, eg Novoalign as the top aligner for Solid 3 but worst for 5500 ?

                  Has anyone tried ECC de novo ? As far as I know colour space data are not pretty in de novo assembly.

                  Comment


                  • #10
                    Originally posted by NestorNotabilis View Post
                    Further observations:

                    on exploring those reads that map in colour-space but not in ECC base-space, I'm finding that the ECC reads are very poorly translated - presumably the Exact Call Chemistry correction has failed leading to a bad translation that is no better than a crude translation from colour-space to base-space without a reference.

                    For example take the following colour-space read:
                    T122031333003330220120103210013100000300000010001031100201003003010321201003
                    Could you post the .QUAL record for this .cs read? One possibility is that the bead was sparsely templated, leading to reasonably high quality in the early ligations, but noise after a few ligation cycles.

                    --
                    Phillip

                    Comment


                    • #11
                      Originally posted by pmiguel View Post
                      Could you post the .QUAL record for this .cs read? One possibility is that the bead was sparsely templated, leading to reasonably high quality in the early ligations, but noise after a few ligation cycles.

                      --
                      Phillip
                      Sure. The info is as follows:

                      >537_110_1030_F3
                      T122031333003330220120103210013100000300000010001031100201003003010321201003


                      >537_110_1030_F3
                      43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 42 43 43 11 4 11 4 7 11 4 4 4 16 11 4 4 7 4 11 4 4 4 16 4 4 4 4 22 4 4 4 4 7 7 4 4 4 16 4 4 4 4 16 4 4 4 7 4 4 4 4 4 4


                      I've had a quick look at a few other examples (not posted) and yes, similarly poor QVs are reported for the later ligations. Thanks for the useful insight!

                      Comment


                      • #12
                        Nice to be right. I guess. Still, even though I gave an explanation for the result, I am mystified by it. The SOLiD collects data from every 5th CS position via successive ligations starting from a primer. After collecting a set of these (15 for 75 base reads), all the nascent strands are melted off the template and a new primer, with a different offset is annealed. (ECC adds an additional, more sparse, set of these ligations to allow some additional error detection/correction. )

                        So my question is why the stark difference after base 25? A 1000x drop in estimated sequence accuracy. It seems too much to ask to believe that 5 different sets of ligations suddenly started producing bad data right at ligation 6. Seems more likely that this occurred a single time: in the ECC ligation. If the ECC goes bad at ligation 6 for many of the beads, what happens to the sequence? Or, can low quality ECC bases impact high quality non-ECC bases?

                        I guess I am to the same point that Rick is -- wanting to see the ECC read separate from the non-ECC ligations. But I have not looked at .xseq format, so I don't even know if this is possible.

                        --
                        Phillip

                        Comment


                        • #13
                          Originally posted by colindaven View Post
                          Brentp - thanks for the analysis and images. How do you explain the difference in results between Solid 3 and 5500, eg Novoalign as the top aligner for Solid 3 but worst for 5500 ?
                          .
                          Well, the SOLiD 3 has shorter reads and higher error rates than SOLiD 5. So, the different aligners are probably tuned for 1 or the other. You see novoalign as the top aligner, but I think BFAST does well in both cases as long as you have a mapping quality cutoff.

                          Good point about de novo. Though it'd be hard to make a case for de novo with 2*50 with SOLiD when you can do 2*100+ on illumina.

                          Comment


                          • #14
                            Originally posted by brentp View Post
                            Good point about de novo. Though it'd be hard to make a case for de novo with 2*50 with SOLiD when you can do 2*100+ on illumina.
                            2*60 for the 5500. But those require Mate End libraries -- substantially more difficult to make than Paired End libraries.

                            So, I would agree. You would probably want to use and Illumina for de novo. SOLiD, if it still has a strong use case, would be for expression analysis. Or -- well, reasonably short (~3 kb or less) SOLiD mate end libraries, are pretty robust. Might want to use SOLiD reads for ME components of a de novo assembly.

                            --
                            Phillip

                            Comment


                            • #15
                              Great discussion. We have noticed a similar issue with our first run of SOLiD 5500 data: poor mapping of ECC data vs. Lifescope; drop-off of quality metrics after first 20-25bp of reads, etc.. We'll have to look further and see if we are observing the same mapping phenomena. I do have one question, though, is everyone working with model systems here (i.e. human, mouse, etc.). We ran a couple data sets through so far (human transcriptome, non-model fish transcriptome, and some RIP-seq data) and while observing the same issue with quality, have had trouble implementing Lifescope for non-model RNA-seq analysis as we have a reference transcriptome but no reference genome. Has anyone done non-model work using only a reference transcriptome? If so, I'd love to ask you a few questions about how you got Lifescope to work.

                              Cheers,
                              Nate

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X