Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Goby

    Hello,

    We've recently released a new open-source framework to facilitate the development of efficient next-gen data analyses. The framework is called Goby (we know it's just a small fish in the ocean of next-gen tools being developed lately) and described at



    Goby is a framework to develop new tools, but comes with a few programs ready to use.

    Released applications include:

    1. Generation of wiggle plots.
    2. Generation of counts over arbitrarily defined genomic annotations.
    3. Group comparisons, with Fisher exact test, Chi Square test, T-test and suitable FDR corrections.

    We designed Goby for speed, scalability and cross-platform support. The framework is pure Java, easy on memory and disk, and works on hardware ranging from laptop to linux clusters. We've tested it with Illumina, SOLID, Helicos and Roche 454 reads. Did I mention that we optimized algorithms and data structures for speed?

    As usual, feedback, ideas, suggestions are most welcome.

    Fabien

    Fabien Campagne, PhD -- http://campagnelab.org

    Assistant Professor, Dept. of Physiology and Biophysics
    Bioinformatics Officer, Institute for Computational Biomedicine
    Director, Computational Genomics Core Facility
    Associate Director, Biomedical Informatics Core,
    Clinical Translational Science Center

    Weill Medical College of Cornell University
    phone: (646)-962-5613 1305 York Avenue
    fax: (646)-962-0383 Box 140
    New York, NY 10021
    Attached Files
    Last edited by Fabien Campagne; 04-23-2010, 07:02 AM. Reason: Changed project URL to current, added a more descriptive title.

  • #2
    Hi, thanks!
    Does "cross-platform" mean that it works for Windows, Mac & Linux?

    Comment


    • #3
      I meant cross-platform more as working across Illumina, SOLID, Helicos and the Roche 454 platforms, but Goby also works on multiple computer platforms (Windows, Mac, Linux and any computer with a Java virtual machine).
      Last edited by Fabien Campagne; 02-03-2010, 06:54 AM.

      Comment


      • #4
        Originally posted by Fabien Campagne View Post
        I meant cross-platform more as working across Illumina, SOLID, Helicos and the Roche 454 platforms, but BDVal also works on multiple computer platforms (Windows, Mac, Linux and any computer with a Java virtual machine).
        err.. i beg your pardon, but does that mean that *Goby* does work on these computer platforms (i am sorry to insist but i don't even know what "BDVal" is..)?
        I just would like to know if i can make it run on the laptop i have under windows xp before downloading, trying to install, etc. i had a quick look at your web site but i did not see this info.
        cheers,
        s.

        Comment


        • #5
          Sorry for the typo (now corrected in my previous post). The answer to your question is yes: Goby is implemented in Java and works on Windows XP without recompilation, if you have Java installed.
          Secondary analyses can be done on laptops, but you will probably need a server or two to run alignments (we use bwa or last in the background for these steps).

          Comment


          • #6
            Great! thanks a lot. for sure i'll play with this..
            cheers,
            s.

            Comment


            • #7
              Could you detail why you decided to develop your own formats?

              Comment


              • #8
                Good question. It was not an easy decision. We first tried really hard to work with the file formats that other groups had developed. For instance, earlier versions of our framework worked with the binary MAQ format. We had several problems with existing formats:

                1. Most formats are not chunkable. We have experimented with Hadoop/Map Reduce to perform massive alignments in parallel. Hadoop works best when an input (i.e., read file) can be split in chunks arbitrarily. You cannot do that with gzipped FASTA/FASTQ. The formats we developed are chunkable and compressed. We don't use Hadoop anymore, but we still use chunkability to split large read files in parallel for alignment on a cluster of machines.

                2. Many formats use space even when an element of information is not present in the input file (i.e., they pad with empty values in the way the C language stores structures in memory). We leverage Google Protocol Buffer to store only the elements of data we need for a specific application. Developers get to decide how much to store, not the framework.

                3. Some formats are text-based. That's a no no if you design for performance. Again, Protocol Buffer helps here: parse speed is much faster than the best XML parsers, and the format is more compact (see this good article about Protocol Buffer if you are curious: http://google-opensource.blogspot.co...gles-data.html).

                4. A specific problem we encountered with MAQ (which I believe still exists in SAM) was that reads were identified with strings. Most applications do not need read identifiers. A read index (integer) is sufficient to know that two alignments are from the same read. If you look closely at programs built for information retrieval, which deal with gigabytes of text documents, a notable design consideration is to avoid strings and code words/terms as integers as soon as possible. We do this in the compact read format. This simple design decision saves quite a bit of CPU and space for many applications.

                These are just a few elements to answer your question. I hope they explain some of the reasons why we thought we needed new "next-gen" formats. We'll write more about this in a forthcoming manuscript if there is enough interest.

                Comment


                • #9
                  Given all the issues you have covered, I would strongly urge you to write this up for publication. Feel free to suggest me as a referee :-)

                  Comment


                  • #10
                    Failure in the annotation count

                    Goby works perfect until the alignment step, but it fails in the alignment-to-annotation-counts mode. It generates .tsv files only with a head without contents. Here is my command line:
                    java -Xmx3g -jar goby.jar --mode alignment-to-annotation-counts sample.entries --annotation biomart_human_exon_ensembl_GRCh37.txt --include-annotation-types gene
                    Any idea? Thanks.

                    Comment


                    • #11
                      Hello Alex,

                      Thanks for the feedback, sorry you are experiencing this problem. A few things could happen. I would check:

                      1. Make sure you are using the latest Goby version. Latest released version is 1.4. Since version 1.3, you can check the version number of your distribution with
                      java -jar goby.jar -m version

                      Please let us know the result of this command, it will help us replicate the problem you are experiencing with the exact same code.

                      2. We provide a few annotation files in the data directory. The annotation file you list does not seem to be part of the distribution (unless you renamed it). If you made the file custom, could you please provide a sample so that we can help you check the format.

                      The expected format is a tab delimited file with the fields:
                      Chromosome Name Strand Ensembl Gene ID Ensembl Exon ID Exon Chr Start (bp) Exon Chr End (bp)
                      7 1 ENSG00000208234 ENSE00001500505 157956020 157956126
                      17 -1 ENSG00000199674 ENSE00001437567 15981807 15981911
                      9 1 ENSG00000221622 ENSE00001565330 134884755 134884842

                      (see http://icb.med.cornell.edu/wiki/index.php/Goby/DE for a formatted version.)
                      The file can be built from biomart, but we also provide a few reference annotation files in the data directory of the distribution:

                      16M Jan 11 14:56 data/biomart_human_exon_esmbl52genes_NCBI36.txt
                      25M Jan 11 14:56 data/biomart-mouse-exons-ensembl55-genes-NCBIM37.txt

                      Make sure the file you generated follows this format.

                      3. The mode alignment-to-annotation-counts assumes that you have aligned to a genome and that the sequence names in this genome encode chromosome names that match the annotation file. This is necessary to map gene and exons to the correct genomic locations.
                      Beyond making sure the genome has sequence identifiers that match chromosome names, you should make sure that

                      3a. you included sequence identifiers in the compact file you generated from the reference/genome file
                      You can do this with the -x or --include-identifiers option, as shown in the demo on the home page.
                      java -Xmx3g -jar goby.jar --mode fasta-to-compact --include-identifiers data/reference/mm9/chr1.fa.gz

                      You can check if the compact reference file you generated includes these identifiers with the command:

                      java -Xmx3g -jar goby.jar --mode compact-file-stat <your-ref.compact-reads>
                      3b. The reference identifiers must be carried over to the alignment files (.entries, .header). This should happen transparently when you use the mode "align", but if you create SAM files separately and convert them to compact format (sam-to-compact mode), some options need to be present for the transfer to work.

                      You can verify if the alignment includes identifiers with the command:

                      java -Xmx3g -jar goby.jar --mode compact-file-stat sample.entries
                      The output should look something like that (I am using a development version of Goby, future 1.5, so there will be some differences):
                      INFO GobyDriver - edu.cornell.med.icb.goby.modes.GobyDriver Implementation-Version: development (20100314115136)
                      Compact Alignment basename = goby-sample
                      Info from header:
                      Number of query sequences = 1,000,000
                      Number of target sequences = 1
                      Has query identifiers = true
                      Has target identifiers = true
                      num query indices= 999,956
                      num target indices= 1
                      Number of alignment entries = 75,118
                      Percent matched = 7.5%
                      Avg query alignment length = 43
                      Avg score alignment = 43.719162
                      Avg number of variations per query sequence = 0.02

                      If none of this works, could you please submit the log of the commands you entered up to the point where the error occurs, the annotation file, and if it is small, the reference sequence. We use Goby extensively in-house and we'll be happy to help you troubleshoot this problem further.

                      Comment


                      • #12
                        Goby trouble shooting

                        1. Here is the version info.
                        INFO GobyDriver - edu.cornell.med.icb.goby.modes.GobyDriver Implementation-Version: release (goby_1.4)
                        2. I downloaded the newest human genome annotation file from the biomart like below.
                        Chromosome Name Strand Ensembl Gene ID Ensembl Exon ID Exon Chr Start (bp) Exon Chr End (bp)
                        GL000239.1 -1 ENSG00000241154 ENSE00001869420 9385 9733
                        GL000239.1 -1 ENSG00000241154 ENSE00001913487 8170 8195
                        GL000214.1 -1 ENSG00000215525 ENSE00001647296 71373 71720
                        GL000214.1 -1 ENSG00000215525 ENSE00001806433 71272 71370
                        GL000214.1 -1 ENSG00000215525 ENSE00001746930 69685 69834
                        GL000214.1 -1 ENSG00000215525 ENSE00001676160 53527 53808
                        3. Reference compact info
                        has identifiers = true (93)
                        has descriptions = false (0)
                        has sequences = true (93)
                        Number of entries = 93
                        Min read length = 4,262
                        Max read length = 249,250,621
                        Avg read length = 33,732,916
                        Read length quantiles = [ 4,262.000000 ]
                        4. Reads compact info
                        Number of query sequences = 18532085
                        Number of target sequences = 93
                        has query identifiers = true
                        has target identifiers = true
                        num query indices= 18532085
                        num target indices= 91
                        Number of alignment entries = 12284703
                        Percent matched = 66%
                        Avg query alignment length = 34
                        Avg score alignment = 34.701973
                        5. Commands log
                        java -Xmx3g -jar goby.jar --mode fasta-to-compact *.fq &
                        java -Xmx20g -jar goby.jar --mode fasta-to-compact --sequence-per-chunk 1 --include-identifiers hg19.fa &
                        java -Xmx20g -jar goby.jar --mode align --aligner bwa --index --database-name hg19-index --reference hg19.compact-reads --database-directory reference --options t=16 &
                        java -Xmx20g -jar goby.jar --mode align --aligner bwa --search --database-name hg19-index --reference hg19.compact-reads --database-directory reference --reads A.compact-reads --basename A --options t=16 &
                        java -Xmx20g -jar goby.jar --mode alignment-to-annotation-counts *.entries --annotation *GRCh37.txt --include-annotation-types gene &
                        (I removed all the path above to make it short.)
                        It failed at the last command. I tried the test files (mm-chr1 & annot & read) in the goby package, but they failed too at the same step.

                        Thanks for your support.

                        Comment


                        • #13
                          Thanks for the detailed log. I was able to reproduce the problem with Goby version 1.4 and the files we distribute as examples. The problem is caused by an issue we fixed after 1.4.

                          You can work around this issue by inserting the string "chr" in front of the chromosome id in the annotation file. The following awk script does the trick:

                          awk '{print "chr"$0} ' data/biomart-mouse-exons-ensembl55-genes-NCBIM37.txt >data/biomart-mouse-exons-ensembl55-genes-NCBIM37-chr-fix.txt

                          java -Xmx3g -jar goby.jar --mode alignment-to-annotation-counts goby-sample.entries --annotation data/biomart-mouse-exons-ensembl55-genes-NCBIM37-chr-fix.txt --include-annotation-types gene


                          This command should then result in a file such as:

                          head goby-sample.ann-counts.tsv
                          basename main-id secondary-id type chro strand length start end in-count over-count RPKM log2(RPKM+1) expression num-exons
                          goby-sample ENSMUSG00000073741 gene chr1 -1 681 6204693 6205373 2 2 39.0966 5.32541 2 1
                          goby-sample ENSMUSG00000047021 gene chr1 -1 33520 74948654 74982173 3 3 5.50402 2.70133 1 41
                          goby-sample ENSMUSG00000050625 gene chr1 -1 390 183440545 183440934 0 0 0.00000 0.00000 0 1
                          goby-sample ENSMUSG00000064612 gene chr1 1 78 63225251 63225328 0 0 0.00000 0.00000 0 1
                          goby-sample ENSMUSG00000049690 gene chr1 -1 916996 127810214 128727209 33 33 30.0156 4.95492 4 34
                          goby-sample ENSMUSG00000047053 gene chr1 1 1267 155738922 155740188 0 0 0.00000 0.00000 0 1
                          goby-sample ENSMUSG00000047067 gene chr1 1 1440 94803566 94805005 5 5 51.2409 5.70711 5 2
                          goby-sample ENSMUSG00000047539 gene chr1 -1 28505 184243233 184271737 47 47 127.352 7.00397 39 5
                          goby-sample ENSMUSG00000025774 gene chr1 -1 30712 18105272 18135983 0 0 0.00000 0.00000 0 32


                          Please let us know if this work-around does not work with GRCh37 (I tested only NCBIM37). Goby 1.5 will work directly with annotation files as described previously. Sorry for the inconvenience.

                          Comment


                          • #14
                            Thanks. Yes, it solves the problem. But, I encountered another problem in the next analysis for the statistical tests. Here is my command and error messages.
                            java -Xmx20g -jar goby.jar --mode alignment-to-annotation-counts *.entries --annotation GRCh37.txt --include-annotation-types gene --compare A/B --groups A=A1,A2,A3/B=B1,B2,B3 --stats stats.tsv

                            ERROR ChiSquareTestCalculator - elementId:ENSG00000196262
                            ERROR ChiSquareTestCalculator - expected:[10896.216554066976, 10472.783445933024]
                            ERROR ChiSquareTestCalculator - observed:[7516, 13853]
                            ERROR ChiSquareTestCalculator - org.apache.commons.math.MaxIterationsExceededException: Maximal number of iterations (2,147,483,647) exceeded
                            java.lang.ArrayIndexOutOfBoundsException: -1
                            at it.unimi.dsi.fastutil.doubles.DoubleArrayList.getDouble(DoubleArrayList.java:231)
                            at it.unimi.dsi.fastutil.doubles.AbstractDoubleList.get(AbstractDoubleList.java:403)
                            at edu.cornell.med.icb.goby.stats.FDRAdjustment.getListSize(FDRAdjustment.java:41)
                            at edu.cornell.med.icb.goby.stats.BonferroniAdjustment.adjust(BonferroniAdjustment.java:41)
                            at edu.cornell.med.icb.goby.stats.FDRAdjustment.adjust(FDRAdjustment.java:32)
                            at edu.cornell.med.icb.goby.modes.CompactAlignmentToAnnotationCountsMode.execute(CompactAlignmentToAnnotationCountsMode.java:320)
                            at edu.cornell.med.icb.goby.modes.GenericToolsDriver.execute(GenericToolsDriver.java:151)
                            at edu.cornell.med.icb.goby.modes.GobyDriver.main(GobyDriver.java:53)
                            INFO GobyRengine - Shutdown hook is terminating R

                            Also, the result file stats.tsv include only header no contents.

                            Comment


                            • #15
                              The ERROR log from ChiSquareTestCalculator is not likely to be the problem. We use apache commons math and this is a known issue with the version distributed in Goby 1.4. We have observed this error as well and it will result in some chi-square p-values being set to NaN in the output. We are testing a new version of the apache commons jar that has a fix for this (see http://issues.apache.org/jira/browse/MATH-301 for details).

                              The second exception, ArrayIndexOutOfBoundsException, is what stops processing. Version 1.4 is not very good at checking the command line for errors. For instance, we found an issue (fixed in the development version) where 1.4 will not complain if you name a basename in the --groups argument that you did not provide on the command line as an input basename.

                              What this means is that if you type:
                              java -jar goby.jar --mode alignment-to-annotation-counts D.entries E.entries compare A/B --groups A=A,D/B=B,E
                              Goby 1.4 will try to process and fail when it tries to find either of the basenames A or B (because input basenames include D.entries and E.entries, but not A.entries or B.entries).

                              From the command line you provided, I cannot tell if *.entries will match A1.entries, A2.entries, A3.entries, B1.entries, B2.entries, B3.entries. All these inputs are required to exist if you provide --compare A/B with --groups A=A1,A2,A3/B=B1,B2,B3
                              If you do not have three files in each group, try adjusting the --groups directive to include only the alignments you have for each group.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X