Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mother of Ribosomal Dtatabses ?

    Is the M5rna database available on MG-RAST server is including a non redundant database of ribosomal genes from combination of SILVA, Greengenes, and RDP ?

    Do you know in which ftp can I find this database to use it in local ?
    Last edited by Bachbioinfo; 12-12-2013, 11:23 AM.
    __Bach__

  • #2
    ftp://ftp.metagenomics.anl.gov/data/
    Last edited by rhinoceros; 11-20-2013, 08:36 AM.
    savetherhino.org

    Comment


    • #3
      Thanks !

      for MD5nrthe publication link is here http://www.biomedcentral.com/1471-2105/13/141

      It seems they did the same for Ribosomal databases md5RNA
      (ftp://ftp.metagenomics.anl.gov/data/...rent/md5rna.gz)
      __Bach__

      Comment


      • #4
        16S microbial

        Originally posted by rhinoceros View Post

        Hello,

        I have again another question please

        Is the database existed on "ftp://ftp.ncbi.nlm.nih.gov/blas/db/16SMicrobial.tar.gz" is including the same informations of "ftp://ftp.metagenomics.anl.gov/data/MD5nr/20130801/md5rna.gz"?

        In the README of blast db ncbi there is no information how this db was constructed and what does it contains ?

        Many Thanks
        Last edited by Bachbioinfo; 12-09-2013, 07:11 AM.
        __Bach__

        Comment


        • #5
          NCBI's 16S db is tiny and contains about 7k near full bacterial and a few hundred near full archaeal SSU sequences. m5rna on the other hand is greengenes, silva and rdp (maybe something else too?) combined and contains something like 3.5M SSU/LSU sequences of various lengths.

          The alphanumeric characters you see are md5 checksums. See here.
          Last edited by rhinoceros; 12-09-2013, 07:26 AM.
          savetherhino.org

          Comment


          • #6
            Thanks for pointing out this 16s non-redundant database. I am trying to use this database, but have some difficulties with the md5 checksum. My goal is to align my reads to the database and process them with MEGAN. However, MEGAN needs to have some ID's to get the taxonomic name. For example my current GreenGenes file looks like;

            HTML Code:
            >AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204
            Here, the ID 'AF068820.2' is important.

            The header of the md5rna file looks like;

            HTML Code:
            >0000175eddb4b05d0bd52467315668ac
            As rhinoceros pointed out, there is some information about the md5 checksums here: http://blog.metagenomics.anl.gov/m5nr-api/

            and after some searching I found this;

            HTML Code:
            http://blog.metagenomics.anl.gov/m5tools-pl-the-m5nr-database-command-line-tool/
            Two questions:
            - The first is that I can't find the tool 'm5tools.pl' on the FTP site. Can someone provide me this tool?
            - With this tool, can I regenerate the 'original' header from GreenGenes, thus with the ID 'AF068820', or at least the taxonomy ID of the organism? In the examples I saw this which could help me;



            But if I do this with my md5 ID, I get no results;
            http://api.metagenomics.anl.gov/m5nr/md5/0000175eddb4b05d0bd52467315668ac

            Thanks in advance,
            Boetsie

            Comment


            • #7
              The m5tools script is at least here as "m5nr-tools.pl"

              There's api documentation here too. I don't know why your query doesn't work, are you sure it's a good checksum?

              You could probably get taxonomic annotations with the map files too from here. Just need to apply join and sort to the right columns of the right files..
              Last edited by rhinoceros; 12-11-2013, 01:03 PM.
              savetherhino.org

              Comment


              • #8
                Thank you for pointing me to the m5nr-tools.pl script. However, if I take the first two md5 sums of the md5rna database
                HTML Code:
                grep ">" md5rna -m 3
                >000000bce90ad07d3161ffac8cea5874
                >0000029042cc6c69f2b830142508acb1
                And search them in the map file;

                HTML Code:
                grep "000000bce90ad07d3161ffac8cea5874" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                000000bce90ad07d3161ffac8cea5874        16      3385    2304
                grep "0000029042cc6c69f2b830142508acb1" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                0000029042cc6c69f2b830142508acb1        16      3385    382680
                Both have '16' as database, which is the RDP database (if 16 corresponds to the 'source'). So I try to find them in the RDP database;


                HTML Code:
                perl MG-RAST-Tools-master/tools/bin/m5nr-tools.pl --api http://kbase.us/services/communities/1 --option annotation --source RDP --md5 000000bce90ad07d3161ffac8cea5874,0000029042cc6c69f2b830142508acb1
                S003289208      000000bce90ad07d3161ffac8cea5874        16S ribosomal RNA       Acinetobacter lwoffi
                I get only one hit.

                Since this did not work and probably is very slow, I am trying to work with the map files.

                Thank you rhinoceros
                Boetsie

                Comment


                • #9
                  The syntaxt of sort and join combination you'll be using will be something like:

                  Code:
                  join -1 2 -2 1 -o 2.1,1.3 <(sort -k2,2 file1) <(sort -k1,1 file2)
                  Which would look for matches in column 2 of file1 and column 1 of file2 and output column 1 of file2 and column 3 of file1. Obviously you'll first need to figure out which columns are relevant in whatever files). In my experience this kind of combination of join and sort is very fast and works well for huge multimillion row tables..
                  savetherhino.org

                  Comment


                  • #10
                    I've already figured that out Thanks for your help though!

                    Comment


                    • #11
                      Thanks for posting your question here. I did not yet try to map the md5rna IDs to taxonomic info or other annotations.
                      As I am using MEGAN5 too, I would like to know whether it is a good idea to select the soft masking option with blastn.

                      I will post later my comments for the mdrna mapping steps

                      Originally posted by boetsie View Post
                      Thanks for pointing out this 16s non-redundant database. I am trying to use this database, but have some difficulties with the md5 checksum. My goal is to align my reads to the database and process them with MEGAN. However, MEGAN needs to have some ID's to get the taxonomic name. For example my current GreenGenes file looks like;

                      HTML Code:
                      >AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204
                      Here, the ID 'AF068820.2' is important.

                      The header of the md5rna file looks like;

                      HTML Code:
                      >0000175eddb4b05d0bd52467315668ac
                      As rhinoceros pointed out, there is some information about the md5 checksums here: http://blog.metagenomics.anl.gov/m5nr-api/

                      and after some searching I found this;

                      HTML Code:
                      http://blog.metagenomics.anl.gov/m5tools-pl-the-m5nr-database-command-line-tool/
                      Two questions:
                      - The first is that I can't find the tool 'm5tools.pl' on the FTP site. Can someone provide me this tool?
                      - With this tool, can I regenerate the 'original' header from GreenGenes, thus with the ID 'AF068820', or at least the taxonomy ID of the organism? In the examples I saw this which could help me;



                      But if I do this with my md5 ID, I get no results;
                      http://api.metagenomics.anl.gov/m5nr/md5/0000175eddb4b05d0bd52467315668ac

                      Thanks in advance,
                      Boetsie
                      __Bach__

                      Comment


                      • #12
                        Originally posted by Bachbioinfo View Post
                        As I am using MEGAN5 too, I would like to know whether it is a good idea to select the soft masking option with blastn.
                        There's this article that gives rather good suggestions for blast in general. They also have an accompanying website updated for blast+. I'm looking forward to their 2013 article.

                        But I wouldn't know how applicable this stuff is to 16S and nucleotide queries in general. In my opinion, blast is the wrong approach to 16S amplicon data to begin with. Both QIIME (MacQIIME for Mac OS X) and mothur are far better suited for 16S stuff, and blast is definitely not the best method for assigning taxonomy to 16S reads.
                        Last edited by rhinoceros; 12-12-2013, 11:10 AM.
                        savetherhino.org

                        Comment


                        • #13
                          I totally agree with what are you suggesting for MOTHUR and QUIIME. I have metagenomes rather than amplicons. In this case what is the best way to estimate the OTUs abundance. I do not know if QUIIME could be best too to do that. There is a lot of tools and methods, there is a lot of literature of comparison, but the most appropriate approach of metagenomes is not always the same. Somewhere, I should start the data analysis

                          best,



                          Originally posted by rhinoceros View Post
                          There's this article that gives rather good suggestions for blast in general. They also have an accompanying website update for blast+. But I wouldn't know how applicable this stuff is to 16S queries in general. In my opinion, blast is the wrong approach to 16S amplicon data to begin with. Both QIIME (MacQIIME for Mac OS X) and mothur are far better suited for 16S stuff, and blast is definitely not the best method for assigning taxonomy to 16S reads.
                          __Bach__

                          Comment


                          • #14
                            Well, you could start by submitting your data to mg-rast. You can read at their website what the pipeline does. You can download your data following any particular step, e.g. predicted proteins or annotations against some specific db. You can also e.g. export biom tables for QIIME. It's not perfect, but it's a good start, and gives you initial results very fast. I noticed that the way they assign Kegg orthologs leaves a lot of real hits out. I'm sure it's the same with a lot of other stuff too.
                            savetherhino.org

                            Comment


                            • #15
                              Originally posted by boetsie View Post
                              Thank you for pointing me to the m5nr-tools.pl script. However, if I take the first two md5 sums of the md5rna database
                              HTML Code:
                              grep ">" md5rna -m 3
                              >000000bce90ad07d3161ffac8cea5874
                              >0000029042cc6c69f2b830142508acb1
                              And search them in the map file;

                              HTML Code:
                              grep "000000bce90ad07d3161ffac8cea5874" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                              000000bce90ad07d3161ffac8cea5874        16      3385    2304
                              grep "0000029042cc6c69f2b830142508acb1" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                              0000029042cc6c69f2b830142508acb1        16      3385    382680
                              Both have '16' as database, which is the RDP database (if 16 corresponds to the 'source'). So I try to find them in the RDP database;


                              HTML Code:
                              perl MG-RAST-Tools-master/tools/bin/m5nr-tools.pl --api http://kbase.us/services/communities/1 --option annotation --source RDP --md5 000000bce90ad07d3161ffac8cea5874,0000029042cc6c69f2b830142508acb1
                              S003289208      000000bce90ad07d3161ffac8cea5874        16S ribosomal RNA       Acinetobacter lwoffi
                              I get only one hit.

                              Since this did not work and probably is very slow, I am trying to work with the map files.

                              Thank you rhinoceros
                              Boetsie
                              Hello,
                              I have just noticed the same things, the key 0000029042cc6c69f2b830142508acb1 for example , I cannot find it with m5nr-tools.pl in spite of trying all ribosomal sources described here :"http://api.metagenomics.anl.gov/api.html#annotation". I have please a question what do correspond the two last columns in md5_rna_map ?
                              i.e. 0000029042cc6c69f2b830142508acb1 16 3385 382680

                              Taxon ID and Gi respectively ? if this is the case I cannot be able to find "382680" in a simple search on ncbi databases

                              Thank you all
                              __Bach__

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X