Seqanswers Leaderboard Ad

**severin** · 02-20-2013, 01:06 PM

Ray Meta

How do I include genomes other than the bacteria that are found in the NCBI-taxonomy directory that your script generates? I could drop the fasta file into a folder however...

Is there an easy way to include the taxonomy information about the genomes I add? You added Human in the paper, but if I wanted to include multiple species that the taxonomy is known do I have to do this manually or is there a tool that can help me achieve this?

Also, I am interested in not just obtaining the abundances but also assigning the scaffolds to particular species or other level in the taxonomy. Does Ray output the scaffold to taxon information somewhere?

One last question.
If I have an assembly from say Trinity can I run the assembly through Ray-Meta and have it return abundances based on the transcripts themselves? How dependent is the algorithm to have done the assembly prior? Can I feed Ray-Meta a kmer graph?

Thanks and really excited to use this tool.

**seb567** · 02-21-2013, 09:41 AM

Hi,

Originally posted by severin View Post

How do I include genomes other than the bacteria that are found in the NCBI-taxonomy directory that your script generates?

Genome-to-Taxon.tsv has 2 columns (tab-separated): GenBankIdentifier taxonIdentifier.

Both are integers.

So you need to append entries to this file.

See https://github.com/sebhtml/ray/blob/...n/Taxonomy.txt

Originally posted by severin View Post

I could drop the fasta file into a folder however...

Indeed, sequences deposited in directories that you provide to Ray with the -search option
will be picked up by Ray Communities plugins.

Originally posted by severin View Post

Is there an easy way to include the taxonomy information about the genomes I add?

No, you need to add one line for each relationship you desire.

Originally posted by severin View Post

You added Human in the paper, but if I wanted to include multiple species that the taxonomy is known do I have to do this manually or is there a tool that can help me achieve this?

Well, because what people want to add in this system can come from various sources (not
just NCBI), it's hard to devise a tool that will be usable and portable for all these sources.

So I guess your best bet is to write a small tool that does it for you so that you
don't have to do it manually.

If you think that this should be a service provided by Ray, you can fill in a ticket at

Build software better, together

https://github.com/sebhtml/ray/issues/new

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

Originally posted by severin View Post

Also, I am interested in not just obtaining the abundances but also assigning the scaffolds to particular species or other level in the taxonomy. Does Ray output the scaffold to taxon information somewhere?

The system will identify contigs for you on the basis on sequences provided by the -search
options.

Files:

Code:

RayMicrobiomeAnalysis/
BiologicalAbundances/
_DenovoAssembly/
Contigs.tsv
*.CoverageData.xml

_Coloring/
_Frequencies/

NCBI-bacteria-directory/
ContigIdentifications.tsv
_Files.tsv
SequenceAbundances.xml

NCBI-viruses-directory/
ContigIdentifications.tsv
_Files.tsv
SequenceAbundances.xml

See https://github.com/sebhtml/ray/blob/...Abundances.txt

Originally posted by severin View Post

One last question.
If I have an assembly from say Trinity can I run the assembly through Ray-Meta and have it return abundances based on the transcripts themselves?

This is a feature that a sizable number of people at my institution are desiring too --
that Ray provides a feature to build the de Bruijn graph from assembled sequences (with
other tools) to benefit from other capabilities like Ray Communities.

The Ray C++ API for messages actually supports this, but the plugins that build the de Bruijn graph
(namely plugin_SequencesLoader, plugin_KmerAcademyBuilder and plugin_VerticesExtractor) are
working only on reads at the moment.

Originally posted by severin View Post

How dependent is the algorithm to have done the assembly prior?

It's independant. The quantification algorithms work on a colored de Bruijn graph.
But it does not really use assembled paths for these computations (aside from what's in
files for contig identification obviously).

Originally posted by severin View Post

Can I feed Ray-Meta a kmer graph?

No, this is not possible at the moment.
But that's something that could be implemented as Ray (and ABySS too)
supports the Ray Cloud Browser kmer graph format.

The file format is like this:

map.csv (ASCII) (called kmers.txt in Ray)

The file is tab-separated, any line starting with a '#' is a comment.

A line looks like this.

GCGGTTATGCTTGCGTCCACCGTAAGTTCGGATTCAGACTTAATCAAAGGTTTTAACAAAGCGCTGGCAACCCCACGGCGGGGGTATTCAG;47;T;G

See https://github.com/sebhtml/Ray-Cloud...Map-format.txt

If you did not know about Ray Cloud Browser, it allows end users to interactively skim processed genomics data with energy.

Demo: http://browser.cloud.boisvert.info/c...location=13000

All you need to get started is a kmer graph and fasta sequences (with Ray: kmers.txt and Contigs.fasta).

Regarding kmer graphs (you mentionned that in your question):

Originally posted by severin View Post

Thanks and really excited to use this tool.

We are also very exciting to have end users adopting our highly scalable methods for genomics.

**severin** · 02-21-2013, 10:54 AM

estimates of composition

Thanks for the quick reply. As I am working with these features more I am curious about the following.

What does ray do with contigs and scaffolds it cannot assign to a taxon?

Are they included in the composition analysis?

**seb567** · 02-21-2013, 11:23 AM

Originally posted by severin View Post

Thanks for the quick reply. As I am working with these features more I am curious about the following.

What does ray do with contigs and scaffolds it cannot assign to a taxon?

Are they included in the composition analysis?

The composition analysis is performed on the colored de Bruijn graph, not on contigs.

See our Genome Biology paper

**severin** · 02-26-2013, 08:35 AM

Nice tool

Sebastien,

This really is a nice tool. Sorry to bombard you with so many questions but I would like to know the limitations of the tools I am using. Some of the runs I have experienced where not all the contigs are assigned to a species. In which case wouldn't this lead to a misrepresentation of what is present in the sample?

How hard would it be to also output the relationship between contig and Taxonomic level? ... Order family genus etc

ie contig-001 Micrococcineae

In other cases every contig is assigned, in which case, how do we determine quality of match to a bacteria or virus if those are the genomes we are using when in actuality the contig belongs to a Eukaryote? Ie possible miss-assignment due to limited number of genomes in the search.

Finally, How does kmer length affect ability to assign a contig to a species/taxonomic group? Have you look at this?

Thanks for all your help on this.

Regards,

Andrew

**seb567** · 02-26-2013, 05:48 PM

Originally posted by severin View Post

Sebastien,

This really is a nice tool. Sorry to bombard you with so many questions but I would like to know the limitations of the tools I am using.

Some of the runs I have experienced where not all the contigs are assigned to a species. In which case wouldn't this lead to a misrepresentation of what is present in the sample?

Do you mean that the percentage of unknown life forms is underrepresented ?

Originally posted by severin View Post

How hard would it be to also output the relationship between contig and Taxonomic level? ... Order family genus etc

It's just a matter of adding the code at the good place.

Originally posted by severin View Post

ie contig-001 Micrococcineae

In other cases every contig is assigned, in which case, how do we determine quality of match to a bacteria or virus if those are the genomes we are using when in actuality the contig belongs to a Eukaryote? Ie possible miss-assignment due to limited number of genomes in the search.

If you search for a virus, and a given mammal genome contains all the sequences
of the virus and this mammal genome is not provided to Ray Communities, then yes, Ray
will tell you that it's from a virus.

If you provide Ray Communities with the virus genome and the mammal genome, then the
software will look for those kmers that are not in common, if any.

Originally posted by severin View Post

Finally, How does kmer length affect ability to assign a contig to a species/taxonomic group?

Longer kmers are more specific.

Allowing mismatches would allow sensitive kmer search with large kmers. Mismatches
are not implemented at the moment.

Originally posted by severin View Post

Have you look at this?

Not a lot, honestly.

Originally posted by severin View Post

Thanks for all your help on this.

Regards,

Andrew

**severin** · 03-13-2013, 09:49 AM

lots of searching

Hi again.

I was wondering if there is a way to restart a search if the run is terminated prematurely.

I am running Ray meta with all genomes from ncbi. I have a sample that contains multiple eukaryotic and microbial transcriptomes of unknown origin.
I have 256 cores on this and it takes about 3 hours to assemble the genome but it takes more than 21 hours to load the genomes I want to search. I get the impression that checkpoints do not include the ray meta analysis. is it possible that this could be included in the checkpoints?

Andrew

**seb567** · 03-13-2013, 09:54 AM

Originally posted by severin View Post

Hi again.

I was wondering if there is a way to restart a search if the run is terminated prematurely.

I am running Ray meta with all genomes from ncbi. I have a sample that contains multiple eukaryotic and microbial transcriptomes of unknown origin.
I have 256 cores on this and it takes about 3 hours to assemble the genome but it takes more than 21 hours to load the genomes I want to search. I get the impression that checkpoints do not include the ray meta analysis. is it possible that this could be included in the checkpoints?

Andrew

What is your command ?

**severin** · 03-13-2013, 10:01 AM

command

Originally posted by seb567 View Post

What is your command ?

mpirun -np 256 Ray-v2.1.0/Ray -k 41 -read-write-checkpoints checkpoints -one-color-per-file -search ./6b/ftp.ncbi.nih.gov/genomes/EURKARYOTES/ -search ./6b/ftp.ncbi.nih.gov/genomes/Viruses -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria -search ./6b/ftp.ncbi.nih.gov/genomes/Fungi -with-taxonomy ./4/NCBI-taxonomy/Genome-to-Taxon.tsv ./4/NCBI-taxonomy/TreeOfLife-Edges.tsv ./4/NCBI-taxonomy/Taxon-Names.tsv -i ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.pe.fasta -s ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.se.fasta

**seb567** · 03-14-2013, 05:44 AM

Originally posted by severin View Post

mpirun -np 256 Ray-v2.1.0/Ray -k 41 -read-write-checkpoints checkpoints -one-color-per-file -search ./6b/ftp.ncbi.nih.gov/genomes/EURKARYOTES/ -search ./6b/ftp.ncbi.nih.gov/genomes/Viruses -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT -search ./6b/GIF_2c/ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria -search ./6b/ftp.ncbi.nih.gov/genomes/Fungi -with-taxonomy ./4/NCBI-taxonomy/Genome-to-Taxon.tsv ./4/NCBI-taxonomy/TreeOfLife-Edges.tsv ./4/NCBI-taxonomy/Taxon-Names.tsv -i ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.pe.fasta -s ./TrimmedFiles/Combined.data.Trmatic.sorted.keep.se.fasta

Is the standard output file still being updated ?

Also, the -read-write-checkpoints option does not do anything after the scaffolding.

**seb567** · 03-14-2013, 06:15 AM

Originally posted by severin View Post

Hi again.

I was wondering if there is a way to restart a search if the run is terminated prematurely.

I am running Ray meta with all genomes from ncbi. I have a sample that contains multiple eukaryotic and microbial transcriptomes of unknown origin.
I have 256 cores on this and it takes about 3 hours to assemble the genome but it takes more than 21 hours to load the genomes I want to search. I get the impression that checkpoints do not include the ray meta analysis. is it possible that this could be included in the checkpoints?

Andrew

Hi,

I checked the logs, this was fixed on 2012-09-27.

The change is already available to all users with the development version of Ray.

The last stable version of Ray is v2.1.0, which was released on 2012-10-30.

Which version are you using ?

**severin** · 03-14-2013, 06:39 AM

Originally posted by seb567 View Post

Hi,

I checked the logs, this was fixed on 2012-09-27.

The change is already available to all users with the development version of Ray.

The last stable version of Ray is v2.1.0, which was released on 2012-10-30.

Which version are you using ?

I am using Ray v2.1.0. Where do I download the developers version?

Ray --version
Ray version 2.1.0
License for Ray: GNU General Public License version 3
RayPlatform version: 1.1.0
License for RayPlatform: GNU Lesser General Public License version 3

MAXKMERLENGTH: 99
KMER_U64_ARRAY_SIZE: 4
Maximum coverage depth stored by CoverageDepth: 4294967295
MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000 bytes
FORCE_PACKING = n
ASSERT = n
HAVE_LIBZ = n
HAVE_LIBBZ2 = n
CONFIG_PROFILER_COLLECT = n
CONFIG_CLOCK_GETTIME = n
__linux__ = y
_MSC_VER = n
__GNUC__ = y
RAY_32_BITS = n
RAY_64_BITS = y
MPI standard version: MPI 2.1
MPI library: Open-MPI 1.6.1
Compiler: GNU gcc/g++ Intel(R) C++ g++ 4.4 mode

**seb567** · 03-14-2013, 07:16 AM

Originally posted by severin View Post

I am using Ray v2.1.0. Where do I download the developers version?

Ray --version
Ray version 2.1.0
License for Ray: GNU General Public License version 3
RayPlatform version: 1.1.0
License for RayPlatform: GNU Lesser General Public License version 3

MAXKMERLENGTH: 99
KMER_U64_ARRAY_SIZE: 4
Maximum coverage depth stored by CoverageDepth: 4294967295
MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000 bytes
FORCE_PACKING = n
ASSERT = n
HAVE_LIBZ = n
HAVE_LIBBZ2 = n
CONFIG_PROFILER_COLLECT = n
CONFIG_CLOCK_GETTIME = n
__linux__ = y
_MSC_VER = n
__GNUC__ = y
RAY_32_BITS = n
RAY_64_BITS = y
MPI standard version: MPI 2.1
MPI library: Open-MPI 1.6.1
Compiler: GNU gcc/g++ Intel(R) C++ g++ 4.4 mode

To get the development version:

Code:

git clone git://github.com/sebhtml/ray.git
git clone git://github.com/sebhtml/RayPlatform.git
cd ray
make
./Ray -version

**severin** · 03-14-2013, 08:28 AM

read-write checkpoints

Originally posted by seb567 View Post

To get the development version:

Code:

git clone git://github.com/sebhtml/ray.git
git clone git://github.com/sebhtml/RayPlatform.git
cd ray
make
./Ray -version

So when you say it is fixed in the developers version does that mean the read-write checkpoints will go beyond the scaffolding process?

Thanks

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Ray Meta: scalable de novo metagenome assembly and profiling

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News