Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The missing tool in bioinformatics

    Hi to all SEQAnswers forum members.
    My name's Robert, and I'm a stem cell researcher and bioinformatics developer. Me and my unit deal daily with next gen sequencing technologies (ChIP-Seq, RNA-Seq, RIP-Seq, Bis-Seq, and a couple of techniques we've developed), and what we've observed is that in most cases (except for standardized procedures such as reads mapping and so on), writing our own custom tools is better than dealing with third-party softwares.
    Now we're planning to start an ambitious open-source project. The aim is to the develop a tool/framework to enable the ordered and non-redundant integration of genomic/epigenomic data from the billions of informations actually available on internet. First of all, we're planning to create an unique and full genes annotation by extended cross referencing between all annotations actually available (e.g. ENSEMBL, NCBI, VEGA, etc.). Next we would like to enable integration and handling of most ChIP-Seq data available from ENCODE and GEO Datasets, with quality checks on data to discard all low-quality datasets (which are actually really abundant).
    Now, before starting the work, we would like to ask you all for suggestions, ideas, and what you will expect from a tool like this.
    If you can spend a couple of minutes to help us (helping you), we will really appreciate that.
    All my best

    Robert

  • #2
    Maybe, the best question would be, will a tool like this be useful for genome researchers? Are there any other tools or frameworks you will need? We want to realize something really useful to the scientific community, so... You all are the scientific community, so let's go with your suggestions and ideas!

    Comment


    • #3
      we would like to ask you all for suggestions, ideas, and what you will expect from a tool like this.
      A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.

      Comment


      • #4
        Originally posted by gringer View Post
        A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.
        Like BLAST?

        Comment


        • #5
          Arabidopsis

          Will it include Arabidopsis data ?

          Comment


          • #6
            Originally posted by biznatch View Post
            Like BLAST?
            Yes, a bit like BLAST, but it will need a substantially altered algorithm to work with the massive amounts of sequence data that would be in the database described by rkneue. BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.
            Last edited by gringer; 01-08-2014, 04:12 PM.

            Comment


            • #7
              Hi
              so if I understand it correctly a) a gene annotation pipeline and b) a compedium of well evaluated data?

              a) I would be careful beause annotation is not annotation (you might want to go deeply into Evidence code ontology ECO) and there are pipelines/tools like this that also partially take ECO codes into account or simple GO codes. (If this is what you meant) We do something not to dissimilar ourselves for plants (Mercator). And there is the whole field of phylogenomics.
              I'd rater settle with useful information than the information overloade that you are now exposed with like expressed in 50 tissues or rather showing a signal on some chips in these, effictively being a non-information (plant researchers will likely know what I mean). Also the whole similar to a protein shown to be similar to..... is not really helpful at all times and can be misleading. (Coming from the plant side, neuronal and angiongenesis proteins are always ---- interesting and a good example)

              b) not in the chip-seq field (yet???) but genevestigator collects expression data and is quite nice. BUT not open source.

              Cheers
              björn
              PS Hope this helped and was not completely off topic

              Comment


              • #8
                Hi all, and thank you for your replies.
                gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.

                mattanswers: Once the core is properly written, adding new organisms will not be a problem, so ideally, my answer is yes.

                usad: a) Not exactly what I meant. We are not trying to realize a gene annotation pipeline, but a comprehensive annotation of "already annotated" genes on different database. For example, a gene X may be annotated as NR_000001 in RefSeq with a single isoform, ENSG00000000001 in Ensembl with multiple isoforms, not annotated in VEGA, annotated in lnciclopedia as XXXXX, etc.
                Providing an automatic updatable cross-referencing database of genes annotations may be really useful, since in most cases finding the correspondence between different databases is a really annoying task.
                b) Yeah, genevestigator may be an idea... But yes it's commercial.

                Comment


                • #9
                  gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.
                  Not really. BLAT still has the indexing problem at its core: everything that is in the database needs to be indexed (at least for subsequences) at a compression level of around 4X (e.g. 2bit encoding). The speed of the actual search is irrelevant if the database cannot be indexed for the search to be carried out.

                  Comment


                  • #10
                    I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?

                    Comment


                    • #11
                      Ah ok I see, yeah different names for the same thing is a major bummer. But instead of having a data warehouse concept, maybe you could relalize the same thing by using some AJAXian data collector doing this on the fly when the user queries the data?
                      Many years ago Biomoby allowed such aggregating services. Of course the problem with this approach is that of the weakest link.

                      b

                      Comment


                      • #12
                        Originally posted by gringer View Post
                        BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.
                        Originally posted by rkneue View Post
                        I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?
                        A sequence database for the "genomic/epigenomic data from the billions of informations actually available on internet". You will need to index a few petabases of sequence data for that to happen, and I don't expect that either BLAST or BLAT will work well for that.

                        Comment


                        • #13
                          We don't expect to work with sequences in that case. Working with genomic coordinates is the best choice, since you can extract sequences on the fly from an indexed multi-fasta in a few ms.

                          Comment


                          • #14
                            what about compression?

                            I mean, randon access compressed information...

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            22 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            19 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            50 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X