SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Useful bioinformatics tool ideas? vinay427 Bioinformatics 11 07-20-2012 06:53 AM
Missing cufflinks drdna RNA Sequencing 0 05-10-2012 11:16 AM
Galaxy Tool Development: Reading user id before the execution of the tool chipseq Bioinformatics 1 03-23-2012 12:24 AM
Bowtie - What am I missing here? quantrix Bioinformatics 6 03-28-2011 07:17 AM
Missing Enrichments andibody Illumina/Solexa 2 09-26-2008 06:44 AM

Reply
 
Thread Tools
Old 01-07-2014, 03:38 AM   #1
rkneue
Junior Member
 
Location: New York

Join Date: Jan 2014
Posts: 5
Lightbulb The missing tool in bioinformatics

Hi to all SEQAnswers forum members.
My name's Robert, and I'm a stem cell researcher and bioinformatics developer. Me and my unit deal daily with next gen sequencing technologies (ChIP-Seq, RNA-Seq, RIP-Seq, Bis-Seq, and a couple of techniques we've developed), and what we've observed is that in most cases (except for standardized procedures such as reads mapping and so on), writing our own custom tools is better than dealing with third-party softwares.
Now we're planning to start an ambitious open-source project. The aim is to the develop a tool/framework to enable the ordered and non-redundant integration of genomic/epigenomic data from the billions of informations actually available on internet. First of all, we're planning to create an unique and full genes annotation by extended cross referencing between all annotations actually available (e.g. ENSEMBL, NCBI, VEGA, etc.). Next we would like to enable integration and handling of most ChIP-Seq data available from ENCODE and GEO Datasets, with quality checks on data to discard all low-quality datasets (which are actually really abundant).
Now, before starting the work, we would like to ask you all for suggestions, ideas, and what you will expect from a tool like this.
If you can spend a couple of minutes to help us (helping you), we will really appreciate that.
All my best

Robert
rkneue is offline   Reply With Quote
Old 01-08-2014, 12:13 AM   #2
rkneue
Junior Member
 
Location: New York

Join Date: Jan 2014
Posts: 5
Arrow

Maybe, the best question would be, will a tool like this be useful for genome researchers? Are there any other tools or frameworks you will need? We want to realize something really useful to the scientific community, so... You all are the scientific community, so let's go with your suggestions and ideas!
rkneue is offline   Reply With Quote
Old 01-08-2014, 11:15 AM   #3
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
we would like to ask you all for suggestions, ideas, and what you will expect from a tool like this.
A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.
gringer is offline   Reply With Quote
Old 01-08-2014, 12:57 PM   #4
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 124
Default

Quote:
Originally Posted by gringer View Post
A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.
Like BLAST?
biznatch is offline   Reply With Quote
Old 01-08-2014, 02:56 PM   #5
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default Arabidopsis

Will it include Arabidopsis data ?
mattanswers is offline   Reply With Quote
Old 01-08-2014, 03:10 PM   #6
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by biznatch View Post
Like BLAST?
Yes, a bit like BLAST, but it will need a substantially altered algorithm to work with the massive amounts of sequence data that would be in the database described by rkneue. BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.

Last edited by gringer; 01-08-2014 at 03:12 PM.
gringer is offline   Reply With Quote
Old 01-08-2014, 07:01 PM   #7
usad
Member
 
Location: aachen

Join Date: Sep 2009
Posts: 53
Default

Hi
so if I understand it correctly a) a gene annotation pipeline and b) a compedium of well evaluated data?

a) I would be careful beause annotation is not annotation (you might want to go deeply into Evidence code ontology ECO) and there are pipelines/tools like this that also partially take ECO codes into account or simple GO codes. (If this is what you meant) We do something not to dissimilar ourselves for plants (Mercator). And there is the whole field of phylogenomics.
I'd rater settle with useful information than the information overloade that you are now exposed with like expressed in 50 tissues or rather showing a signal on some chips in these, effictively being a non-information (plant researchers will likely know what I mean). Also the whole similar to a protein shown to be similar to..... is not really helpful at all times and can be misleading. (Coming from the plant side, neuronal and angiongenesis proteins are always ---- interesting and a good example)

b) not in the chip-seq field (yet???) but genevestigator collects expression data and is quite nice. BUT not open source.

Cheers
björn
PS Hope this helped and was not completely off topic
usad is offline   Reply With Quote
Old 01-09-2014, 01:27 AM   #8
rkneue
Junior Member
 
Location: New York

Join Date: Jan 2014
Posts: 5
Default

Hi all, and thank you for your replies.
gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.

mattanswers: Once the core is properly written, adding new organisms will not be a problem, so ideally, my answer is yes.

usad: a) Not exactly what I meant. We are not trying to realize a gene annotation pipeline, but a comprehensive annotation of "already annotated" genes on different database. For example, a gene X may be annotated as NR_000001 in RefSeq with a single isoform, ENSG00000000001 in Ensembl with multiple isoforms, not annotated in VEGA, annotated in lnciclopedia as XXXXX, etc.
Providing an automatic updatable cross-referencing database of genes annotations may be really useful, since in most cases finding the correspondence between different databases is a really annoying task.
b) Yeah, genevestigator may be an idea... But yes it's commercial.
rkneue is offline   Reply With Quote
Old 01-09-2014, 03:07 AM   #9
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.
Not really. BLAT still has the indexing problem at its core: everything that is in the database needs to be indexed (at least for subsequences) at a compression level of around 4X (e.g. 2bit encoding). The speed of the actual search is irrelevant if the database cannot be indexed for the search to be carried out.
gringer is offline   Reply With Quote
Old 01-09-2014, 06:23 AM   #10
rkneue
Junior Member
 
Location: New York

Join Date: Jan 2014
Posts: 5
Default

I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?
rkneue is offline   Reply With Quote
Old 01-09-2014, 08:54 AM   #11
usad
Member
 
Location: aachen

Join Date: Sep 2009
Posts: 53
Default

Ah ok I see, yeah different names for the same thing is a major bummer. But instead of having a data warehouse concept, maybe you could relalize the same thing by using some AJAXian data collector doing this on the fly when the user queries the data?
Many years ago Biomoby allowed such aggregating services. Of course the problem with this approach is that of the weakest link.

b
usad is offline   Reply With Quote
Old 01-09-2014, 07:28 PM   #12
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by gringer View Post
BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.
Quote:
Originally Posted by rkneue View Post
I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?
A sequence database for the "genomic/epigenomic data from the billions of informations actually available on internet". You will need to index a few petabases of sequence data for that to happen, and I don't expect that either BLAST or BLAT will work well for that.
gringer is offline   Reply With Quote
Old 01-10-2014, 06:57 AM   #13
rkneue
Junior Member
 
Location: New York

Join Date: Jan 2014
Posts: 5
Default

We don't expect to work with sequences in that case. Working with genomic coordinates is the best choice, since you can extract sequences on the fly from an indexed multi-fasta in a few ms.
rkneue is offline   Reply With Quote
Old 04-28-2014, 04:40 AM   #14
kredens
Junior Member
 
Location: BRAZIL

Join Date: Apr 2014
Posts: 1
Default

what about compression?

I mean, randon access compressed information...
kredens is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO