Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge two Trinity transcriptome assemblies into one

    Hi All

    I have got two transcriptomes from two populations of a Drosophila species, both assembled using Trinity.
    How can I merge those into one final comprehensive transcriptome?
    Please note that I have never used Trinity myself yet, but I am willing to do it if required.

    Thanks for your help!

    Cheers
    Dam

  • #2
    The best way is to go back to square one. Combine the reads from both populations and run Trinity on the combination.

    Comment


    • #3
      That would be my best advice too.

      As the assembly is based on the reads you're feeding to Trinity, it is probably more appropriate to combine all your reads and make a new assembly.

      Also, if you're working on Drosophila, you might want to have a look at the Genome-guided pipeline offered by Trinity. I'm not working on Drosophila, but it seems to me that you could easily use a drosophila genome (well annonated) to improve your assembly.
      Here's the link: http://trinityrnaseq.sourceforge.net...d_trinity.html

      Comment


      • #4
        Yup, I would also combine the reads and re-do the trinity assembly.
        If that isn't an option for some reason then you can try cdhit-est to group reads that are the same between both assemblies.

        Comment


        • #5
          You might get lucky, but depending on how close your strains are you might also get an assembly that is worse than either of the two individual assemblies, if you use the pooled approach.

          Comment


          • #6
            Thank guys!
            Last comment from rskr concerns me actually.

            wouldn't be better if I can combine transcripts from the two pools by matching their translation, so use protein sequences?
            What tools would allow me to do that?
            Thanks again

            Comment


            • #7
              While a combined assembly might be slightly worse at a few divergent loci, you will at least get an assembly that will allow you do some comparisons. Merging two independent assemblies will run into most of the same problems as producing a single assembly plus some extra ones.

              If I was doing the same on a species with a poorly annotated genome I would try a merged assembly, try grouping the two assemblies using cd-hit est, (As for comparing translations of the two, you could a tblastx), then I would pick the method that gives the 'best' results.

              But Drosophila is a fantastically annotated species, why not map against the genome using a gff file for gene locations?

              Comment


              • #8
                Hi Jeremy

                Unfortunately there is not a sequenced genome for this species yet
                The closest sequenced genome is D. pseudoobscura, which is already 9% divergent (just at coding sequences).
                Thanks.
                I will try with both ways then

                Comment


                • #9
                  I've tried assembling combined reads vs. assembling separate reads for several species, and combining reads has always given a more contiguous and accurate assembly.

                  If you have > 200 million combined reads, I strongly recommend using Trinity In-Silico Normalization, which will give a good assembly in much shorter run time when compared to non-normalized assembly.

                  Comment


                  • #10
                    if you have two completed assemblies the easiest thing is to merge them with an overlap layout consensus assembly rather than go back to square one like others have suggested.

                    I would use cap3 after concatenating the two FASTA files together, and then use GapFiller with the full set of reads to see if you can improve contiguity any further. Redoing the whole assembly is more likely to lead to more assembly artefacts.

                    Comment


                    • #11
                      @Blahah404. I am not sure if your point applies to RNAseq (aka transcriptome) projects. For genome projects, sure, combine assemblies in order to increase contig length; ideally you would end up with chromosome size contigs. But for RNAseq projects we generally have enough read depth to make full length contigs. What we lack are the rare transcripts and alternative splicing. For that we need as many reads as possible so that the data is not lost in the noise. Thus combining read sets is a good idea.

                      To put it in very simple terms. Assume that there is a rare transcript that is expressed once in sample A and once in sample B. The assembly process might very well throw away that rare transcript because it is indistinguishable from noise (i.e., spurious machine-error reads that are found only once). However combine the two data sets and that rare transcript will be found twice thus bringing out out from the noise.

                      @Dampor. You could use the protein translations to combine data sets. It is probably superior to combing nucleotide assemblies. But as above you may lose the 'power' to resolve low-expression transcripts and may not be able to determine alternative splicing.

                      Comment


                      • #12
                        @westerman yes, I'm talking about de-novo transcriptome assemblies. In most cases assemblers produce full-length contigs for only a fraction (~40-60%) of transcripts that are represented, at least in our tests with plant species (we've got ~960 species sequenced for 1KP). Fragmentation is a problem, and post-assembly OLC and gapfilling improves quantification accuracy as well as the ability to analyse UTRs.

                        I agree with you that alternative splicing and low-abundance transcript information could be lost with the strategy I suggested - whether that matters depends on the purpose of the assembly. But I disagree that a crude pooling of the reads is the best strategy. By pooling a larger set of reads you also pool the errors - doubling sequencing depth increases the number of true positive assemblies up to a point, and increases false positives too (e.g. false chimeras and false-bubble isoforms). You can't distinguish novel, low abundance isoforms from high-abundance errors.

                        An intermediate strategy with the benefits of both would be to do the OLC + gapfiller merge as I suggested, then to pool the reads and filter out pairs mapping concordantly to the merged contigs. That leaves you with the set of reads that was not included in the original assembly. You could then do a second merged assembly of the contigs with the unused reads, preserving the contigs you've already assembled and harnessing any pooled gain in abundance for transcripts that were too low abundance in either sample to assemble the first time round.

                        edit: of course, if you're doing a reference-guided assembly, or a de-novo assembly for a species with a reference genome, you have the luxury of not worrying too much about artefacts because you can identify them using the genome. My comments only apply for de-novo assembly of a species with no available genome sequence.
                        Last edited by Blahah404; 09-23-2013, 11:37 AM.

                        Comment


                        • #13
                          @blahah404: 960 species is an order of magnitude more than I have seen come through our lab so I will defer to your experience. I'll give your Cap3/GapFiller method a try sometime and see how it compares to a full denovo assembly.

                          Comment


                          • #14
                            Thank you guys,

                            I did actually decide to try both approaches
                            My result will follow in this thread.

                            @westerman, also, what tool will allow me to combine data sets using protein translations?

                            Cheers

                            Comment


                            • #15
                              Originally posted by Dampor View Post
                              Thank you guys,
                              @westerman, also, what tool will allow me to combine data sets using protein translations?
                              Not sure. I've never tried it. But it seemed like an interesting approach.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X