![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
meta-velvet returns nodes instead of contigs in assembly? | deprekate | Bioinformatics | 2 | 10-25-2012 02:53 PM |
De Novo Assembly using Ray | Farhat | De novo discovery | 18 | 05-23-2012 02:19 PM |
Meta assembly | Autotroph | Metagenomics | 1 | 04-05-2012 01:32 PM |
PubMed: Ray: simultaneous assembly of reads from a mix of high-throughput sequencing | Newsbot! | Literature Watch | 0 | 03-01-2011 11:30 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]()
Ray Meta: scalable de novo metagenome assembly and profiling
Genome Biology 2012, 13:R122 doi:10.1186/gb-2012-13-12-r122 Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights on specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net. |
![]() |
![]() |
![]() |
#2 |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]()
How do I include genomes other than the bacteria that are found in the NCBI-taxonomy directory that your script generates? I could drop the fasta file into a folder however...
Is there an easy way to include the taxonomy information about the genomes I add? You added Human in the paper, but if I wanted to include multiple species that the taxonomy is known do I have to do this manually or is there a tool that can help me achieve this? Also, I am interested in not just obtaining the abundances but also assigning the scaffolds to particular species or other level in the taxonomy. Does Ray output the scaffold to taxon information somewhere? One last question. If I have an assembly from say Trinity can I run the assembly through Ray-Meta and have it return abundances based on the transcripts themselves? How dependent is the algorithm to have done the assembly prior? Can I feed Ray-Meta a kmer graph? Thanks and really excited to use this tool. Last edited by severin; 02-20-2013 at 01:08 PM. |
![]() |
![]() |
![]() |
#3 | |||||
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]()
Hi,
Quote:
Genome-to-Taxon.tsv has 2 columns (tab-separated): GenBankIdentifier taxonIdentifier. Both are integers. So you need to append entries to this file. See https://github.com/sebhtml/ray/blob/...n/Taxonomy.txt Indeed, sequences deposited in directories that you provide to Ray with the -search option will be picked up by Ray Communities plugins. Quote:
Quote:
just NCBI), it's hard to devise a tool that will be usable and portable for all these sources. So I guess your best bet is to write a small tool that does it for you so that you don't have to do it manually. If you think that this should be a service provided by Ray, you can fill in a ticket at https://github.com/sebhtml/ray/issues/new Quote:
options. Files: Code:
RayMicrobiomeAnalysis/ BiologicalAbundances/ _DenovoAssembly/ Contigs.tsv *.CoverageData.xml _Coloring/ _Frequencies/ NCBI-bacteria-directory/ ContigIdentifications.tsv _Files.tsv SequenceAbundances.xml NCBI-viruses-directory/ ContigIdentifications.tsv _Files.tsv SequenceAbundances.xml Quote:
that Ray provides a feature to build the de Bruijn graph from assembled sequences (with other tools) to benefit from other capabilities like Ray Communities. The Ray C++ API for messages actually supports this, but the plugins that build the de Bruijn graph (namely plugin_SequencesLoader, plugin_KmerAcademyBuilder and plugin_VerticesExtractor) are working only on reads at the moment. It's independant. The quantification algorithms work on a colored de Bruijn graph. But it does not really use assembled paths for these computations (aside from what's in files for contig identification obviously). No, this is not possible at the moment. But that's something that could be implemented as Ray (and ABySS too) supports the Ray Cloud Browser kmer graph format. The file format is like this: map.csv (ASCII) (called kmers.txt in Ray) The file is tab-separated, any line starting with a '#' is a comment. A line looks like this. GCGGTTATGCTTGCGTCCACCGTAAGTTCGGATTCAGACTTAATCAAAGGTTTTAACAAAGCGCTGGCAACCCCACGGCGGGGGTATTCAG;47;T;G See https://github.com/sebhtml/Ray-Cloud...Map-format.txt If you did not know about Ray Cloud Browser, it allows end users to interactively skim processed genomics data with energy. Demo: http://browser.cloud.boisvert.info/c...location=13000 All you need to get started is a kmer graph and fasta sequences (with Ray: kmers.txt and Contigs.fasta). Regarding kmer graphs (you mentionned that in your question): We are also very exciting to have end users adopting our highly scalable methods for genomics. |
|||||
![]() |
![]() |
![]() |
#4 |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]()
Thanks for the quick reply. As I am working with these features more I am curious about the following.
What does ray do with contigs and scaffolds it cannot assign to a taxon? Are they included in the composition analysis? |
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
See our Genome Biology paper |
|
![]() |
![]() |
![]() |
#6 |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]()
Sebastien,
This really is a nice tool. Sorry to bombard you with so many questions but I would like to know the limitations of the tools I am using. Some of the runs I have experienced where not all the contigs are assigned to a species. In which case wouldn't this lead to a misrepresentation of what is present in the sample? How hard would it be to also output the relationship between contig and Taxonomic level? ... Order family genus etc ie contig-001 Micrococcineae In other cases every contig is assigned, in which case, how do we determine quality of match to a bacteria or virus if those are the genomes we are using when in actuality the contig belongs to a Eukaryote? Ie possible miss-assignment due to limited number of genomes in the search. Finally, How does kmer length affect ability to assign a contig to a species/taxonomic group? Have you look at this? Thanks for all your help on this. Regards, Andrew |
![]() |
![]() |
![]() |
#7 | ||||
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
Quote:
Quote:
of the virus and this mammal genome is not provided to Ray Communities, then yes, Ray will tell you that it's from a virus. If you provide Ray Communities with the virus genome and the mammal genome, then the software will look for those kmers that are not in common, if any. Quote:
Allowing mismatches would allow sensitive kmer search with large kmers. Mismatches are not implemented at the moment. Not a lot, honestly. |
||||
![]() |
![]() |
![]() |
#8 |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]()
Hi again.
I was wondering if there is a way to restart a search if the run is terminated prematurely. I am running Ray meta with all genomes from ncbi. I have a sample that contains multiple eukaryotic and microbial transcriptomes of unknown origin. I have 256 cores on this and it takes about 3 hours to assemble the genome but it takes more than 21 hours to load the genomes I want to search. I get the impression that checkpoints do not include the ray meta analysis. is it possible that this could be included in the checkpoints? Andrew |
![]() |
![]() |
![]() |
#9 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]()
mpirun -np 256 Ray-v2.1.0/Ray -k 41 -read-write-checkpoints checkpoints -one-color-per-file -search ./6b/ftp.ncbi.nih.gov/genomes/EURKARYOTES/ -search ./6b/ftp.ncbi.nih.gov/genomes/Viruses -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria -search ./6b/ftp.ncbi.nih.gov/genomes/Fungi -with-taxonomy ./4/NCBI-taxonomy/Genome-to-Taxon.tsv ./4/NCBI-taxonomy/TreeOfLife-Edges.tsv ./4/NCBI-taxonomy/Taxon-Names.tsv -i ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.pe.fasta -s ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.se.fasta
|
![]() |
![]() |
![]() |
#11 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
Also, the -read-write-checkpoints option does not do anything after the scaffolding. |
|
![]() |
![]() |
![]() |
#12 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
I checked the logs, this was fixed on 2012-09-27. The change is already available to all users with the development version of Ray. The last stable version of Ray is v2.1.0, which was released on 2012-10-30. Which version are you using ? |
|
![]() |
![]() |
![]() |
#13 | |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]() Quote:
Ray --version Ray version 2.1.0 License for Ray: GNU General Public License version 3 RayPlatform version: 1.1.0 License for RayPlatform: GNU Lesser General Public License version 3 MAXKMERLENGTH: 99 KMER_U64_ARRAY_SIZE: 4 Maximum coverage depth stored by CoverageDepth: 4294967295 MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000 bytes FORCE_PACKING = n ASSERT = n HAVE_LIBZ = n HAVE_LIBBZ2 = n CONFIG_PROFILER_COLLECT = n CONFIG_CLOCK_GETTIME = n __linux__ = y _MSC_VER = n __GNUC__ = y RAY_32_BITS = n RAY_64_BITS = y MPI standard version: MPI 2.1 MPI library: Open-MPI 1.6.1 Compiler: GNU gcc/g++ Intel(R) C++ g++ 4.4 mode |
|
![]() |
![]() |
![]() |
#14 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
Code:
git clone git://github.com/sebhtml/ray.git git clone git://github.com/sebhtml/RayPlatform.git cd ray make ./Ray -version |
|
![]() |
![]() |
![]() |
#15 | |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]() Quote:
So when you say it is fixed in the developers version does that mean the read-write checkpoints will go beyond the scaffolding process? Thanks |
|
![]() |
![]() |
![]() |
#16 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
However, that's a feature that could be added. |
|
![]() |
![]() |
![]() |
#17 | |
Genome Informatics Facility
Location: Iowa @isugif Join Date: Sep 2009
Posts: 105
|
![]() Quote:
icpc: command line warning #10159: invalid argument for option '-std' CXX code/plugin_KmerAcademyBuilder/Kmer.o icpc: command line warning #10159: invalid argument for option '-std' CXX code/plugin_Library/LibraryPeakFinder.o icpc: command line warning #10159: invalid argument for option '-std' CXX code/plugin_Library/LibraryWorker.o icpc: command line warning #10159: invalid argument for option '-std' CXX code/plugin_Library/Library.o icpc: command line warning #10159: invalid argument for option '-std' CXX code/plugin_MachineHelper/MachineHelper.o icpc: command line warning #10159: invalid argument for option '-std' CXX code/plugin_MessageProcessor/MessageProcessor.o icpc: command line warning #10159: invalid argument for option '-std' CXX code/plugin_Mock/Parameters.o icpc: command line warning #10159: invalid argument for option '-std' code/plugin_Mock/Parameters.cpp(2129): warning #68: integer conversion resulted in a change of sign uint64_t value=-1; ^ If I run the make file without the -std=c++98 the ray program crashes during the step that follows Selection of optimal read markers [node195:41872] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x33bb21ec5d] [node195:41872] [11] Ray() [0x469429] [node195:41872] *** End of error message *** [node193:49049] 8 more processes have sent help message help-odls-default.txt / odls-default:could-not-kill ==> BATCH_OUTPUT.ray4 <== [-9] ------> AAAAAAAAATGTGCCTTCGTTTCAAGTTCTATTCATTCTAC [-8] ------> AAAAAAAATGTGCCTTCGTTTCAAGTTCTATTCATTCTACG [-7] ------> AAAAAAATGTGCCTTCGTTTCAAGTTCTATTCATTCTACGA [-6] ------> AAAAAATGTGCCTTCGTTTCAAGTTCTATTCATTCTACGAC [-5] ------> AAAAATGTGCCTTCGTTTCAAGTTCTATTCATTCTACGACC [-4] ------> AAAATGTGCCTTCGTTTCAAGTTCTATTCATTCTACGACCT [-3] ------> AAATGTGCCTTCGTTTCAAGTTCTATTCATTCTACGACCTC [-2] ------> AATGTGCCTTCGTTTCAAGTTCTATTCATTCTACGACCTCA [-1] ------> ATGTGCCTTCGTTTCAAGTTCTATTCATTCTACGACCTCAA [0] ------> TGTGCCTTCGTTTCAAGTTCTATTCATTCTACGACCTCAAC I see someone else had the same error but I didn't see a resolution for it http://www.mail-archive.com/denovoas.../msg00317.html |
|
![]() |
![]() |
![]() |
#18 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
I will fix this. Maybe for the v2.2.0 release, but it will probably appear in the v2.2.1 release later. Last edited by seb567; 03-15-2013 at 08:59 AM. Reason: added use case |
|
![]() |
![]() |
![]() |
#19 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]()
Hi,
I did a test with the Intel compiler and everything went fine. Code:
icpc: command line warning #10159: invalid argument for option '-std' Quote:
|
|
![]() |
![]() |
![]() |
#20 |
Junior Member
Location: France Join Date: Sep 2012
Posts: 2
|
![]()
Hi Sebastian and thanks for developing Ray. I am working on a sponge metagenome (ion torrent) and I am trying to setup ray for taxonomy and communities.
I an trying to setup the files for the latest version of greengenes (2012_08) and have parsed the information in the fasta file to the same format as 2011_01, and I am trying to manually run the script Paper-Replication-2012 / Build-Input-Files-for-GreenGenes-Taxonomy / main.sh and have one question regarding fasta files for Ray Taxonomy and Communities I have notices that for the NCBI taxonomy the script Paper-Replication-2012 / Build-Input-Files-for-NCBI-Taxonomy / CreateRayInputStructures.sh Creates a single fasta file with for each genome. My question is whether those reference fasta files are just a concatenation of all .fna files associated with anty given genome. (and so there are multiples IDs and accessions associated with a given "genome". This becomes an is an issue for draft genomes (lots of scaffolds) or eukaryotic chromosomes, which I will have to "manually merge" Actually after I double checked the CreateRayInputStructures.sh script it seems to be the case, but would you please confirm it? Marcelino |
![]() |
![]() |
![]() |
Thread Tools | |
|
|