Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Color-space/SOLID assembly issues?

    Hello all,

    Our lab is about to undertake a large transcriptome sequencing project where we will be doing both de novo assembly of transcriptomes and RNA-seq comparisons across and within species and we are trying to make some decisions on which tech to use. We have direct access to the new SOLID 5500 and are considering using that for our project (as opposed to the Illumina HiSeq which would probably be a little more expensive and have longer turn-around times). If we run on this machine, we have the option of getting back a variety of different types of data: short read color-space, short read in base-space (new feature), and paired-end short read in color-space. From what I understand, I am inclined to favor the last option as it seems paired-end info will help a lot with ambiguities in assembly and mapping to any subsequently identified isoforms; however, I often hear non-specific grumblings and frustrations with color-space data and skepticism about color-space assembly in particular (obviously has advantages with anything involving SNPs) and am concerned that getting data back in color-space may pose problems down the line. Anyone got any advice on this or comment that might be of use?

    Thanks in advance.

  • #2
    I've been mainly working with ABI data. If you want to do a de novo transcriptome assembly, I would not recommend SOLiD reads, even if they are pair-end.

    Comment


    • #3
      Assembly in color-space per se is no worse than assembly in base-space. It is just a different way of designating information. At the very least you can always convert color-space reads in the dreaded "double-encoded" pseudo-base-space format.

      The grumbles you are likely getting arise from:

      (a) unfamiliarity with color-space

      (b) wanting to use familiar tools that only work in base-space

      (c) short reads; this has gotten better with the 5500 but even now the SOLiD pair-end reads are much shorter than Illumina ... and shorter is almost always worse for assembly.

      (d) perceived poor quality reads from the SOLiD which are compensated, in mapping, by color-space but which can play havoc (maybe!) in assembly

      (e) The eventual conversion of an assembly done in color-space into base-space can lead to potential problems due to what I call a 'frame-shift' in the conversion process.

      I don't have much experience with the 5500 but have done a lot of work on the SOLiD 4. While mapping worked reasonably well assembly never seemed to work very well. We also had problems even with mapping with the 2nd (short) paired-end read; it did not seem to add that much value. But the longer reads from the 5500 should help in assembly.

      Personally -- and this is without much information to back it up -- for assembly I would go with single-end 75bp short read in color-space/base-space. E.g., work is done in color-space with base-space information. Do the assembly work in color-space and then in the final base-space conversion use the base-space information to make sure you do not go "off-frame". Paired-end is useful and can help detect the splice sites ... but for assembly work I would want that base-space information. For mapping I'd go with paired-ends. For de-novo genome assembly I'd go with fragments plus, if possible, mate-pairs.

      If someone else comes up with better advice then please use that! I look forward to getting some 5500 data to play around with but until then I am working from theory instead of practice.

      Comment


      • #4
        Originally posted by damiankao View Post
        I've been mainly working with ABI data. If you want to do a de novo transcriptome assembly, I would not recommend SOLiD reads, even if they are pair-end.
        And why not? I can guess but it would have been nice to have a reason.

        Color-space issues?

        Length issues?

        Quality issues?

        Cost issues?

        Number of reads issue? .... unlikely for the SOLiD!

        Comment


        • #5
          Isn't ECC essentially cs assembly...
          and what about running their SAET first in cs, then assembling.

          Comment


          • #6
            The problem with paired end SOLiD chemistry is that the reverse reads suck.

            They were off to a promising start initially because they combined the release of paired end chemistry with a dramatic overall drop in price. The reverse reads were 35 nt, not 50 nt like the forward reads, but as long as your instrument was not oversubscribed, the extra 4 days they took to run was not bad and they pretty good mapping rates (>40%).

            I don't know exactly what happened, but after the first few batches of PE chemistry, the quality tanked for reverse reads. On top of that ABI had major supply bottlenecks at the time. Anyway as of the last run I did (early summer) the reverse reads were of such low quality that they contributed little to the overall data set.

            I have no idea how well PE reads will look on the 5500XL. I don't intend to use them. ECC is not available for reverse chemistry.

            The ECC chemistry is a different story. You still should do assemblies and mapping in color space. But all the conversion to base space issues should be gone.

            My tendency is to agree with damiankao -- not for de novo transcriptomes. But that is just the difference between SR and PE assemblies. So, it would work, just not as well as PE data would. And, again, I have not seen how the reverse chemistry is working since early summer. Could be all the issues are worked out. Still not great for de novo because there would be no ECC to anchor the color to base space conversion.

            By the way, the distinction between PE and ME (mate end) is important here. ABI uses the same, good, chemistry for both ME reads.

            --
            Phillip

            Comment


            • #7
              Originally posted by SeqAA View Post
              Isn't ECC essentially cs assembly...
              and what about running their SAET first in cs, then assembling.
              My understanding -- and once again understand that I haven't actually worked with a 5500-generated data set -- is that the ECC (Exact Chemistry Calling) will give corrected base-space read in addition to the color-base reads. That correction is not an assembly. What I expect that machine to give me is a 75-base fragment that extremely accurate because every 5 bases the colorspace starts fresh.

              In other words on the SOLiD-4 I would get 50-base reads that look like:

              T201322301332....

              When I translate that to base-space then the tailing end (3') might deviate from reality due to quality degradation and the consequent 'frame-shift' that derails color-space conversion. As I previously mentioned this doesn't matter in mapping because the known reference pulls the conversion back-on-track but it is problematic for assembly.

              On the SOLiD-5500 I *think* I will get a similar looking 75-base read:


              T23122102020....

              Plus a base-space read. The manual says:

              If an ECC primer round has been performed, the XSQ output
              also includes the sequence information in base space, in addition to color space.
              I am hoping that because of the ECC then the color-space is also re-calibrated every 5 bases. Then I can be almost 100% guaranteed that I can go from color-space to base-space with 100% accuracy. The 3' end will be near perfect. Plus I will have all of the advantages of color-space.

              --------------

              SAET could be used but with ECC I don't think that it will be that useful. SAET basically does correction depending on the other reads in the data set. Why have two correction methods? SAET itself, unless they have improved it, is a dreadfully slow program. It was meant for bacterial genome sizes and not the larger datasets that we seem to use these days.


              --------------

              Once again, not having played around with an actual 5500 data-set my comments are theory not practice.

              Comment


              • #8
                Save yourself a lot of trouble. Go with Illumina HiSeq. The reads are longer, the paired-end reads are longer. If you are doing de novo assembly of transcriptomes, then longer read length will help.

                The LifeScope software is buggy. This is from experience and frustration with working with ABI. You'll spend more time/money in the long run than if you go with Illumina.

                Comment


                • #9
                  Well let's be honest. To get the best assembly, you probably want to pair GAIIx PE long reads, with some 454 data. I haven't touched an ECC dataset to know if the high Q scores will make a difference. I would agree you should still try to assemble in colourspace.

                  Comment


                  • #10
                    PE 5500xl transcriptome datasets are available through solidsoftwaretools (http://solidsoftwaretools.com/gf/project/5500wtdataset/).

                    I think you should compare (cost wise) SOLiD PE to HiSeq SE 100 bp reads. HiSeq will give you more reads per lane, higher accuracy and perhaps more importantly more assembly options.

                    Comment


                    • #11
                      Originally posted by westerman View Post
                      And why not? I can guess but it would have been nice to have a reason.

                      Color-space issues?

                      Length issues?

                      Quality issues?

                      Cost issues?

                      Number of reads issue? .... unlikely for the SOLiD!
                      I find doing a reference assembly to be satisfactory for confirming annotations or maybe extending previous annotations, but not really great for a completely new reference assembly. It might just be how we prepared the libraries though.

                      I've tried doing de novo assembly with SOLiD 4 50bp single and pair end reads before with clc cell, velvet, abyss and non of them gave me any decent results. I have ran them through SAET. To be fair, I've only done them with single libraries (~20 mil reads). The results with single libraries were so poor that I didn't bother with more reads. Perhaps that was a mistake.

                      I can't really pinpoint the reason for the poor assembly results. Statistical methods that works on nucleotide reads do not work on color space reads? Nucleotide alignments are relatively straightforward, but colorspace alignments are completely different. The usual mismatch rules don't apply. I understand the color space assemblers do try to account for that, but just changing the penalties doesn't seem like the right way to go.

                      For our last transcriptome assembly, we ended up combining a de novo 454 assembly with a reference ABI assembly. I've ran our ABI reads on the new Tophat + Cufflinks software and that seems to be working pretty well actually.

                      We just started running out 5500. Maybe they will give us better results. I would be interested to see any papers that have successfully de novo assembled a ABI SOLiD data set.

                      Comment


                      • #12
                        Originally posted by damiankao View Post
                        I've tried doing de novo assembly with SOLiD 4 50bp single and pair end reads before with clc cell, velvet, abyss and non of them gave me any decent results. I have ran them through SAET. To be fair, I've only done them with single libraries (~20 mil reads). The results with single libraries were so poor that I didn't bother with more reads. Perhaps that was a mistake.
                        [...]
                        I can't really pinpoint the reason for the poor assembly results. Statistical methods that works on nucleotide reads do not work on color space reads? Nucleotide alignments are relatively straightforward, but colorspace alignments are completely different. The usual mismatch rules don't apply. I understand the color space assemblers do try to account for that, but just changing the penalties doesn't seem like the right way to go.
                        Alright, critical issue here: library construction. That is, it isn't the instrument at fault, it is the library construction kit.

                        The SOLiD Whole Transcriptome library construction kit is not designed with de novo assembly in mind. For RNAseq (counts) it performs well and is strand-specific.

                        However it relies on a double stranded RNAse (RNAseIII) for RNA fragmentation. Ambion (who designed the kit) jiggered the reaction conditions to allow it to fragment ssRNA, but it is highly biased in its cleavage preferences.

                        If you used a less-biased fragmentation method, or just used another kit altogether, then de novo transcriptome assemblies from SOLiD, I am sure, would work fine.

                        --
                        Phillip

                        Comment


                        • #13
                          I really appreciate everyone's thoughts on this matter. Really interesting discussion. Brings up a couple more questions for me though.

                          1) In response to Rick's comments:

                          "Do the assembly work in color-space and then in the final base-space conversion use the base-space information to make sure you do not go "off-frame". Paired-end is useful and can help detect the splice sites ... but for assembly work I would want that base-space information. For mapping I'd go with paired-ends. For de-novo genome assembly I'd go with fragments plus, if possible, mate-pairs."

                          How do you make sure you're not "off-frame"? Seems difficult since it'd be a de novo assembly. Is there a symptomatic signal for that kind of "frame-shift"? Also, I hear the mate-pair library construction is a challenge, anyone tried it before? Seems like if I did that though, I'd lose all the advantages of paired end info for identifying alternative spliced transcripts (although I agree about not being too sure about how useful that extra 35 bp read is - not much of an insert as well).

                          2) In response to Phillip's comments:

                          Couple things. First, I think SOLiD has a new transcriptome library prep using "chemical hydrolysis" which they suggest for de novo-type applications, so that might be promising. They still suggest using the enzymatic approach for transcript quantification. Phillip, do you have any more detail on how the cleavage is biased? A pointer to a reference would be fine if it's too much trouble to write out. Overall though, to be honest, I'd rather not have to use a separate method for both de novo assembly and RNA-seq if possible. Also, thanks for the thoughts on the reverse read chemistry. We were led to believe (either by our own misguided hopes or ABI, we can't remember) that the paired end runs would use ECC as well and were disappointed (perhaps "jaded" is a better term to use) to find out that wasn't the case. I'll actually be in contact with our rep and we are doing a bunch of training soon, so I'll be sure to harass them about the whether or not they've been able to offer any improvements on that product.

                          3) In response to Damiankao:

                          I've done some similiar things playing around with velvet and oases trying to assembly libraries of similar read size from some previous preliminary runs on the SOLID 4. To be honest, I wonder if they are even big enough to do much assembly. I suppose for common transcripts they should work, but the differences in coverages between those and less common transcripts are gonna make it a little rough I thing. We had planned on using a de novo 454 assembly in one species and RNA-seq data in other closely related species aligned to that reference, but I was worried about too much species-specific divergence from the reference introducing all kinds of weird issues in our analysis (particularly things like novel alternatively spliced product misidentification or too much SNP variation between species messing up the mapping, etc.). I've noticed that our mapping between most distantly related species (using Bowtie defaults - I'm still playing with options) is around 40%, which isn't terrible to be honest. Haven't run it through cufflinks pipeline yet, but will be interested to do and see if we can pick up tissue and species specific differences. I'm still a little curious about your assemblies though: did you use the SAET and ASiD tools to get everything in base-space at the end? I haven't yet and was wondering what people thought of them.

                          4) In response to others:

                          Great points about Illumina, especially about flexibility of the data and what your able to get back in terms of paired end data, but I still wonder about the quality of data I'm getting from them. Last I looked, their literature said that 10-20% of their data was below Q30 so error rates seem to be quite high. Anyone want to ring in about that? Seems like that'd be a big issue in short-read de novo assembly. I'm also curious about LifeScope and if it's useful at all. Bioscope was abysmal as far as I am concerned so I was hoping for an improvement this time around. On more thing on the Bioinformatic side, anyone done an de novo assembly with 454 and solid color-space? What'd you use and how'd that go, if so?

                          As I mentioned before, we some training and such for the 5500 coming up soon and you folks have given me a bunch of interesting topics to address with them. If you have anything else, you're curious about. Pass it on and I'll see if I can't get any more info out of the folks from Life.

                          Comment


                          • #14
                            Originally posted by JueFish View Post
                            1) In response to Rick's comments:

                            "Do the assembly work in color-space and then in the final base-space conversion use the base-space information to make sure you do not go "off-frame". Paired-end is useful and can help detect the splice sites ... but for assembly work I would want that base-space information. For mapping I'd go with paired-ends. For de-novo genome assembly I'd go with fragments plus, if possible, mate-pairs."

                            How do you make sure you're not "off-frame"? Seems difficult since it'd be a de novo assembly.
                            I probably should have emphasized "... go with fragments with ECC plus ...". No way would I do de-novo without the ECC. ECC will 'drag you back on track' every 5 bases. That should be enough to make sure that the assembly is correct.

                            Is there a symptomatic signal for that kind of "frame-shift"?
                            Not that I know of. Although I suppose you could look for excessive stop codons or other markers that your sequence is no longer making sense.

                            Also, I hear the mate-pair library construction is a challenge, anyone tried it before? Seems like if I did that though, I'd lose all the advantages of paired end info for identifying alternative spliced transcripts (although I agree about not being too sure about how useful that extra 35 bp read is - not much of an insert as well).
                            My comment about mate-pairs was for genome assembly. I agree that for transcript projects mate-pair distance would be too large. Since SOLiD pair-end doesn't provide ECC then fragment with ECC is your only option. In my opinion.

                            ========

                            ... If you have anything else, you're curious about. Pass it on and I'll see if I can't get any more info out of the folks from Life.
                            Well, I suppose the major question is "have you done a denovo assembly with 5500 data and how well did it turn out?" I am not sure if anyone here on SeqAnswers has done so -- SOLiD4, yes, but not SOLiD 5500 with 75bp ECC reads -- thus we are just throwing out speculation. Unfortunately the data sets that Chipper references are "... the mapped colorspace output from LifeScope? v2.0 using default parameters ..." and therefore are not useful to determine how well a de-novo assembly would work.

                            Comment


                            • #15
                              Originally posted by JueFish View Post
                              I really appreciate everyone's thoughts on this matter. Really interesting discussion. Brings up a couple more questions for me though.

                              1) In response to Rick's comments:

                              [...]

                              How do you make sure you're not "off-frame"? Seems difficult since it'd be a de novo assembly. Is there a symptomatic signal for that kind of "frame-shift"? Also, I hear the mate-pair library construction is a challenge, anyone tried it before? Seems like if I did that though, I'd lose all the advantages of paired end info for identifying alternative spliced transcripts (although I agree about not being too sure about how useful that extra 35 bp read is - not much of an insert as well).
                              As long as you are in color space, there is no "off-frame". The "frame" issue comes about during deconvolution to base space. Each base deconvoluted serves as the key for deconvolution of the base following it. So a single error propagates 3'-ward. At the very least the ECC chemistry would restrict this propagation to short segments.

                              After assembly a savvy conversion routine should be able to drastically limit, if not eliminate altogether conversion framing issues at sufficient read depth. As to whether such a routine has been written, I do not know.
                              Originally posted by JueFish View Post
                              2) In response to Phillip's comments:

                              Couple things. First, I think SOLiD has a new transcriptome library prep using "chemical hydrolysis" which they suggest for de novo-type applications, so that might be promising.
                              I did notice that the protocol had been overhauled, but not that particular aspect of it. Interesting. However my point was that anyone trying to assemble pre SOLID5500XL RNAseq data is probably dealing with data generated via RNAseIII fragmented libraries.
                              Originally posted by JueFish View Post
                              They still suggest using the enzymatic approach for transcript quantification. Phillip, do you have any more detail on how the cleavage is biased?
                              No. If you look at how the reads map in a viewer, you can see that the coverage is not even. But as to its nature, I would hypothesize that RNAseIII's natural role as a dsRNA nuclease would tend to bias it towards regions of secondary structure.
                              Originally posted by JueFish View Post
                              A pointer to a reference would be fine if it's too much trouble to write out. Overall though, to be honest, I'd rather not have to use a separate method for both de novo assembly and RNA-seq if possible. Also, thanks for the thoughts on the reverse read chemistry. We were led to believe (either by our own misguided hopes or ABI, we can't remember) that the paired end runs would use ECC as well and were disappointed (perhaps "jaded" is a better term to use) to find out that wasn't the case. I'll actually be in contact with our rep and we are doing a bunch of training soon, so I'll be sure to harass them about the whether or not they've been able to offer any improvements on that product.
                              Good luck on that. It seems to barely work at all, adding another (ECC) twist to it is probably not going to get onto the agenda.
                              Originally posted by JueFish View Post
                              [...]
                              4) In response to others:

                              Great points about Illumina, especially about flexibility of the data and what your able to get back in terms of paired end data, but I still wonder about the quality of data I'm getting from them. Last I looked, their literature said that 10-20% of their data was below Q30 so error rates seem to be quite high. Anyone want to ring in about that? Seems like that'd be a big issue in short-read de novo assembly.
                              Illumina data assembles fine. If your assembler does not use quality values, trim your data prior to assembly.
                              Is SOLiD data actually less erroneous? Most claims of this sort seem to derive from the fact that with the color space it should be. But this seems facile to me. The raw quality values of a run are no better than an Illumina run.

                              --
                              Phillip

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X