Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • In Silico read normalization prior to de novo assembly

    I am about to use Trinity for de novo transcriptome assembly prior to differential expression analyses.

    I have 11 individuals (5 control, 6 treated) with 3 tissue types = 33 samples with ~20 million ~80bp single-end reads each (after trimming and QC).... so that's about 660 million single end reads!

    In order to reduce what is likely to be a LONG trinity run, would you suggest utilizing Trinity's normalization script or similar (e.g. khmer) prior to assembly?
    Or should I just take a small subset of samples to make assembly?

    I don't know how much individual genetic variability there is so I'm worried that using a subset for assembly will miss rarer transcripts.

    Does anyone here have any experience with normalization? Are there any downsides to this method over using a subset of samples?

    Any advice or experiences much appreciated!

  • #2
    It seems Trinity's In silico Read Normalization hasn't been publised.

    Comment


    • #3
      the following links may be helpful:

      DigiNorm on Paired-end samples
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      What is digital normalization, anyway?
      I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization....


      Digital normalization of short-read shotgun data
      We just posted a pre-submission paper to arXiv.org: A single pass approach to reducing sampling variation, removing errors, and scaling de novo...


      Basic Digital Normalization


      A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
      Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes has enabled investigation of a wide range of organisms and ecosystems. However, sampling variation in short-read data sets and high sequencing error rates of modern sequencers present many new computational challenges in data interpretation. These challenges have led to the development of new classes of mapping tools and {\em de novo} assemblers. These algorithms are challenged by the continued improvement in sequencing throughput. We here describe digital normalization, a single-pass computational algorithm that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors. Digital normalization substantially reduces the size of shotgun data sets and decreases the memory and time requirements for {\em de novo} sequence assembly, all without significantly impacting content of the generated contigs. We apply digital normalization to the assembly of microbial genomic data, amplified single-cell genomic data, and transcriptomic data. Our implementation is freely available for use and modification.


      What does Trinity's In Silico normalization do?
      This post can be referenced and cited at the following DOI: http://dx.doi.org/10.6084/m9.figshare.98198. For a few months, the Trinity list was...
      Last edited by pengchy; 10-16-2013, 12:06 AM.

      Comment


      • #4
        Thanks pengchy!

        I have read all already and was wondering if anyone had any experiences with their own data?

        It has also been suggested to me to take all the reads from 1 individual (all tissue types) and assemble as there may be too much ambiguity with using multiple individuals.

        The samples are clutch mates (frogs), but not inbred lines so there will be some variability between them and heterozygosity. But the question is with that option - which individual? Control or treated?

        Any thoughts from you knowledgeable lot on seqanswers much appreciated!!

        Comment


        • #5
          Hi Amy,

          I am preparing to do the work and glad to exchange the experience with you here when i finish the test.

          best,
          pch

          Comment


          • #6
            Great thanks!

            I am running Trinity's method at the moment (would have liked to use Titus's more efficient version of Trinity's method but waiting for that to be installed) - it's been running for 2 days now.

            I gave it 100GB RAM and 10 CPUs - which seems to have been OK for jellyfish, reading kmer occurences. Its been writing the .stats file now for a looooong time but it's not maxing out the memory and only using 1 cpu.

            Comment


            • #7
              I find that Trinity's normalization takes a long time to run. Almost defeats the purpose of normalization in the first place. Days of run time -- yeap. We need to fix this some day.

              Comment


              • #8
                If anyone is interested:

                Trinity normalization on ~854 million reads took about 2 days on a high-memory machine (gave it 300GB memory and 40 cores)

                Got it down from 854 to just 66 million reads!

                Comment


                • #9
                  That's great that you got the number of reads reduced, but how is that reduction expected to improve performance on Trinity? Will it cut the time down considerably (enough to justify normalization)?

                  Best regards and great to find others working on similar projects!

                  Comment


                  • #10
                    Well that's the part I'd like other's experiences! This is my first de novo assembly.

                    To assemble the normalized reads (using same number of cores, memory etc) took less than a day. I'm running the full 854 million now to see how the assemblies compare - it's been going 2 days already.

                    It was also suggested I try assembling using all tissues from 1 individual (but to be careful for further analysis as this individual will map better to the assembly) as variability between individuals could create ambiguity in assembly.

                    I tried this: all normalized N50 = 1596, single individual N50 = 2029. Bowtie mapping back to assembly normalized = 80.65%, individual = 79.02%. I'm currently blasting to see which annotates better.

                    Does anyone have any other thoughts on how to test which is the "best" assembly?

                    Comment


                    • #11
                      I'm sorry I don't have any answers for you. I've got ~270 million reads so I'm not doing the normalization step for this run, but I will continue to watch this thread to see how your experiment comes out in the end. I'll be posting a question about installation of trinity with regard to jellyfish...feel free to take a peak, maybe its something you encountered?

                      Comment


                      • #12
                        In the end the assembly of the full set of reads took only about 3 days - so 2 days normalizing and 1 day assembly amounts to little or no saving on time.

                        The full read assembly only gave rise to marginally more contigs (~455000 vs ~447000 from normalized reads) and a lower N50 (1227 vs 1596).

                        I think Titus Brown's version of Trinity's method (which unfortunately I could not get installed on our machines here yet) probably does make normalizing worth it for my kind of sized read set.

                        Comment


                        • #13
                          fyi for those with access to more computing power - I used 24 cores and 119G on my 270 million reads without normalization and finished in 1 day.
                          But it may have also gone a bit faster b/c I ran it with the --min_kmer_cov 2 parameter

                          Comment


                          • #14
                            map back to the assembly

                            Hello everybody, interesting discussion.
                            Here we used Trinity on 10 samples, 5 tissues from 2 animals, sick and not sick. Total reads were >600M and on a 'big machine', sorry not sure of RAM and cores, it took <3 days.

                            The problem is that mapping the reads back to the contigs as suggested will map only 30% back!! Any clue? Does anyone else have this problem? Is this an issue or it is normal due to tissue diversity?

                            Thanks for your help!
                            eppi

                            Comment


                            • #15
                              Hi eppi,

                              I'm afraid I don't have that problem but out of interest, how many transcripts and components did you get?

                              I have produced ~447,000 transcripts (~350,000 component) - this seems far too many. I'm worried its from pooling tissues and individuals together for assembly? Anyone else got such a large number?? Any suggestions on how to reduce redundancy?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X