Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ray Meta: scalable de novo metagenome assembly and profiling

    Ray Meta: scalable de novo metagenome assembly and profiling
    Genome Biology 2012, 13:R122 doi:10.1186/gb-2012-13-12-r122

    Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights on specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net.

  • #2
    Ray Meta

    How do I include genomes other than the bacteria that are found in the NCBI-taxonomy directory that your script generates? I could drop the fasta file into a folder however...

    Is there an easy way to include the taxonomy information about the genomes I add? You added Human in the paper, but if I wanted to include multiple species that the taxonomy is known do I have to do this manually or is there a tool that can help me achieve this?

    Also, I am interested in not just obtaining the abundances but also assigning the scaffolds to particular species or other level in the taxonomy. Does Ray output the scaffold to taxon information somewhere?

    One last question.
    If I have an assembly from say Trinity can I run the assembly through Ray-Meta and have it return abundances based on the transcripts themselves? How dependent is the algorithm to have done the assembly prior? Can I feed Ray-Meta a kmer graph?


    Thanks and really excited to use this tool.
    Last edited by severin; 02-20-2013, 01:08 PM.

    Comment


    • #3
      Hi,

      Originally posted by severin View Post
      How do I include genomes other than the bacteria that are found in the NCBI-taxonomy directory that your script generates?

      Genome-to-Taxon.tsv has 2 columns (tab-separated): GenBankIdentifier taxonIdentifier.

      Both are integers.

      So you need to append entries to this file.

      See https://github.com/sebhtml/ray/blob/...n/Taxonomy.txt

      Originally posted by severin View Post
      I could drop the fasta file into a folder however...
      Indeed, sequences deposited in directories that you provide to Ray with the -search option
      will be picked up by Ray Communities plugins.

      Originally posted by severin View Post

      Is there an easy way to include the taxonomy information about the genomes I add?
      No, you need to add one line for each relationship you desire.

      Originally posted by severin View Post
      You added Human in the paper, but if I wanted to include multiple species that the taxonomy is known do I have to do this manually or is there a tool that can help me achieve this?
      Well, because what people want to add in this system can come from various sources (not
      just NCBI), it's hard to devise a tool that will be usable and portable for all these sources.

      So I guess your best bet is to write a small tool that does it for you so that you
      don't have to do it manually.

      If you think that this should be a service provided by Ray, you can fill in a ticket at

      GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


      Originally posted by severin View Post

      Also, I am interested in not just obtaining the abundances but also assigning the scaffolds to particular species or other level in the taxonomy. Does Ray output the scaffold to taxon information somewhere?
      The system will identify contigs for you on the basis on sequences provided by the -search
      options.

      Files:

      Code:
      RayMicrobiomeAnalysis/
      BiologicalAbundances/
      _DenovoAssembly/
      Contigs.tsv
      *.CoverageData.xml
      
      _Coloring/
      _Frequencies/
      
      NCBI-bacteria-directory/
      ContigIdentifications.tsv
      _Files.tsv
      SequenceAbundances.xml
      
      NCBI-viruses-directory/
      ContigIdentifications.tsv
      _Files.tsv
      SequenceAbundances.xml
      See https://github.com/sebhtml/ray/blob/...Abundances.txt

      Originally posted by severin View Post


      One last question.
      If I have an assembly from say Trinity can I run the assembly through Ray-Meta and have it return abundances based on the transcripts themselves?
      This is a feature that a sizable number of people at my institution are desiring too --
      that Ray provides a feature to build the de Bruijn graph from assembled sequences (with
      other tools) to benefit from other capabilities like Ray Communities.

      The Ray C++ API for messages actually supports this, but the plugins that build the de Bruijn graph
      (namely plugin_SequencesLoader, plugin_KmerAcademyBuilder and plugin_VerticesExtractor) are
      working only on reads at the moment.

      Originally posted by severin View Post

      How dependent is the algorithm to have done the assembly prior?
      It's independant. The quantification algorithms work on a colored de Bruijn graph.
      But it does not really use assembled paths for these computations (aside from what's in
      files for contig identification obviously).

      Originally posted by severin View Post

      Can I feed Ray-Meta a kmer graph?
      No, this is not possible at the moment.
      But that's something that could be implemented as Ray (and ABySS too)
      supports the Ray Cloud Browser kmer graph format.

      The file format is like this:

      map.csv (ASCII) (called kmers.txt in Ray)

      The file is tab-separated, any line starting with a '#' is a comment.


      A line looks like this.

      GCGGTTATGCTTGCGTCCACCGTAAGTTCGGATTCAGACTTAATCAAAGGTTTTAACAAAGCGCTGGCAACCCCACGGCGGGGGTATTCAG;47;T;G

      See https://github.com/sebhtml/Ray-Cloud...Map-format.txt


      If you did not know about Ray Cloud Browser, it allows end users to interactively skim processed genomics data with energy.

      Demo: http://browser.cloud.boisvert.info/c...location=13000

      All you need to get started is a kmer graph and fasta sequences (with Ray: kmers.txt and Contigs.fasta).

      Regarding kmer graphs (you mentionned that in your question):

      Originally posted by severin View Post

      Thanks and really excited to use this tool.
      We are also very exciting to have end users adopting our highly scalable methods for genomics.

      Comment


      • #4
        estimates of composition

        Thanks for the quick reply. As I am working with these features more I am curious about the following.

        What does ray do with contigs and scaffolds it cannot assign to a taxon?

        Are they included in the composition analysis?

        Comment


        • #5
          Originally posted by severin View Post
          Thanks for the quick reply. As I am working with these features more I am curious about the following.

          What does ray do with contigs and scaffolds it cannot assign to a taxon?

          Are they included in the composition analysis?
          The composition analysis is performed on the colored de Bruijn graph, not on contigs.


          See our Genome Biology paper

          Comment


          • #6
            Nice tool

            Sebastien,

            This really is a nice tool. Sorry to bombard you with so many questions but I would like to know the limitations of the tools I am using. Some of the runs I have experienced where not all the contigs are assigned to a species. In which case wouldn't this lead to a misrepresentation of what is present in the sample?

            How hard would it be to also output the relationship between contig and Taxonomic level? ... Order family genus etc

            ie contig-001 Micrococcineae

            In other cases every contig is assigned, in which case, how do we determine quality of match to a bacteria or virus if those are the genomes we are using when in actuality the contig belongs to a Eukaryote? Ie possible miss-assignment due to limited number of genomes in the search.

            Finally, How does kmer length affect ability to assign a contig to a species/taxonomic group? Have you look at this?

            Thanks for all your help on this.

            Regards,

            Andrew

            Comment


            • #7
              Originally posted by severin View Post
              Sebastien,

              This really is a nice tool. Sorry to bombard you with so many questions but I would like to know the limitations of the tools I am using.

              Some of the runs I have experienced where not all the contigs are assigned to a species. In which case wouldn't this lead to a misrepresentation of what is present in the sample?
              Do you mean that the percentage of unknown life forms is underrepresented ?

              Originally posted by severin View Post

              How hard would it be to also output the relationship between contig and Taxonomic level? ... Order family genus etc
              It's just a matter of adding the code at the good place.

              Originally posted by severin View Post

              ie contig-001 Micrococcineae

              In other cases every contig is assigned, in which case, how do we determine quality of match to a bacteria or virus if those are the genomes we are using when in actuality the contig belongs to a Eukaryote? Ie possible miss-assignment due to limited number of genomes in the search.
              If you search for a virus, and a given mammal genome contains all the sequences
              of the virus and this mammal genome is not provided to Ray Communities, then yes, Ray
              will tell you that it's from a virus.

              If you provide Ray Communities with the virus genome and the mammal genome, then the
              software will look for those kmers that are not in common, if any.

              Originally posted by severin View Post

              Finally, How does kmer length affect ability to assign a contig to a species/taxonomic group?
              Longer kmers are more specific.

              Allowing mismatches would allow sensitive kmer search with large kmers. Mismatches
              are not implemented at the moment.

              Originally posted by severin View Post
              Have you look at this?
              Not a lot, honestly.

              Originally posted by severin View Post

              Thanks for all your help on this.

              Regards,

              Andrew

              Comment


              • #8
                lots of searching

                Hi again.

                I was wondering if there is a way to restart a search if the run is terminated prematurely.

                I am running Ray meta with all genomes from ncbi. I have a sample that contains multiple eukaryotic and microbial transcriptomes of unknown origin.
                I have 256 cores on this and it takes about 3 hours to assemble the genome but it takes more than 21 hours to load the genomes I want to search. I get the impression that checkpoints do not include the ray meta analysis. is it possible that this could be included in the checkpoints?


                Andrew

                Comment


                • #9
                  Originally posted by severin View Post
                  Hi again.

                  I was wondering if there is a way to restart a search if the run is terminated prematurely.

                  I am running Ray meta with all genomes from ncbi. I have a sample that contains multiple eukaryotic and microbial transcriptomes of unknown origin.
                  I have 256 cores on this and it takes about 3 hours to assemble the genome but it takes more than 21 hours to load the genomes I want to search. I get the impression that checkpoints do not include the ray meta analysis. is it possible that this could be included in the checkpoints?


                  Andrew
                  What is your command ?

                  Comment


                  • #10
                    command

                    Originally posted by seb567 View Post
                    What is your command ?
                    mpirun -np 256 Ray-v2.1.0/Ray -k 41 -read-write-checkpoints checkpoints -one-color-per-file -search ./6b/ftp.ncbi.nih.gov/genomes/EURKARYOTES/ -search ./6b/ftp.ncbi.nih.gov/genomes/Viruses -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria -search ./6b/ftp.ncbi.nih.gov/genomes/Fungi -with-taxonomy ./4/NCBI-taxonomy/Genome-to-Taxon.tsv ./4/NCBI-taxonomy/TreeOfLife-Edges.tsv ./4/NCBI-taxonomy/Taxon-Names.tsv -i ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.pe.fasta -s ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.se.fasta

                    Comment


                    • #11
                      Originally posted by severin View Post
                      mpirun -np 256 Ray-v2.1.0/Ray -k 41 -read-write-checkpoints checkpoints -one-color-per-file -search ./6b/ftp.ncbi.nih.gov/genomes/EURKARYOTES/ -search ./6b/ftp.ncbi.nih.gov/genomes/Viruses -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria -search ./6b/ftp.ncbi.nih.gov/genomes/Fungi -with-taxonomy ./4/NCBI-taxonomy/Genome-to-Taxon.tsv ./4/NCBI-taxonomy/TreeOfLife-Edges.tsv ./4/NCBI-taxonomy/Taxon-Names.tsv -i ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.pe.fasta -s ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.se.fasta
                      Is the standard output file still being updated ?

                      Also, the -read-write-checkpoints option does not do anything after the scaffolding.

                      Comment


                      • #12
                        Originally posted by severin View Post
                        Hi again.

                        I was wondering if there is a way to restart a search if the run is terminated prematurely.

                        I am running Ray meta with all genomes from ncbi. I have a sample that contains multiple eukaryotic and microbial transcriptomes of unknown origin.
                        I have 256 cores on this and it takes about 3 hours to assemble the genome but it takes more than 21 hours to load the genomes I want to search. I get the impression that checkpoints do not include the ray meta analysis. is it possible that this could be included in the checkpoints?


                        Andrew
                        Hi,

                        I checked the logs, this was fixed on 2012-09-27.

                        The change is already available to all users with the development version of Ray.

                        The last stable version of Ray is v2.1.0, which was released on 2012-10-30.

                        Which version are you using ?

                        Comment


                        • #13
                          Originally posted by seb567 View Post
                          Hi,

                          I checked the logs, this was fixed on 2012-09-27.

                          The change is already available to all users with the development version of Ray.

                          The last stable version of Ray is v2.1.0, which was released on 2012-10-30.

                          Which version are you using ?
                          I am using Ray v2.1.0. Where do I download the developers version?

                          Ray --version
                          Ray version 2.1.0
                          License for Ray: GNU General Public License version 3
                          RayPlatform version: 1.1.0
                          License for RayPlatform: GNU Lesser General Public License version 3

                          MAXKMERLENGTH: 99
                          KMER_U64_ARRAY_SIZE: 4
                          Maximum coverage depth stored by CoverageDepth: 4294967295
                          MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000 bytes
                          FORCE_PACKING = n
                          ASSERT = n
                          HAVE_LIBZ = n
                          HAVE_LIBBZ2 = n
                          CONFIG_PROFILER_COLLECT = n
                          CONFIG_CLOCK_GETTIME = n
                          __linux__ = y
                          _MSC_VER = n
                          __GNUC__ = y
                          RAY_32_BITS = n
                          RAY_64_BITS = y
                          MPI standard version: MPI 2.1
                          MPI library: Open-MPI 1.6.1
                          Compiler: GNU gcc/g++ Intel(R) C++ g++ 4.4 mode

                          Comment


                          • #14
                            Originally posted by severin View Post
                            I am using Ray v2.1.0. Where do I download the developers version?

                            Ray --version
                            Ray version 2.1.0
                            License for Ray: GNU General Public License version 3
                            RayPlatform version: 1.1.0
                            License for RayPlatform: GNU Lesser General Public License version 3

                            MAXKMERLENGTH: 99
                            KMER_U64_ARRAY_SIZE: 4
                            Maximum coverage depth stored by CoverageDepth: 4294967295
                            MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000 bytes
                            FORCE_PACKING = n
                            ASSERT = n
                            HAVE_LIBZ = n
                            HAVE_LIBBZ2 = n
                            CONFIG_PROFILER_COLLECT = n
                            CONFIG_CLOCK_GETTIME = n
                            __linux__ = y
                            _MSC_VER = n
                            __GNUC__ = y
                            RAY_32_BITS = n
                            RAY_64_BITS = y
                            MPI standard version: MPI 2.1
                            MPI library: Open-MPI 1.6.1
                            Compiler: GNU gcc/g++ Intel(R) C++ g++ 4.4 mode
                            To get the development version:

                            Code:
                            git clone git://github.com/sebhtml/ray.git
                            git clone git://github.com/sebhtml/RayPlatform.git
                            cd ray
                            make
                            ./Ray -version

                            Comment


                            • #15
                              read-write checkpoints

                              Originally posted by seb567 View Post
                              To get the development version:

                              Code:
                              git clone git://github.com/sebhtml/ray.git
                              git clone git://github.com/sebhtml/RayPlatform.git
                              cd ray
                              make
                              ./Ray -version

                              So when you say it is fixed in the developers version does that mean the read-write checkpoints will go beyond the scaffolding process?

                              Thanks

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X