Seqanswers Leaderboard Ad

**steven** · 02-03-2010, 03:44 AM

Hi, thanks!
Does "cross-platform" mean that it works for Windows, Mac & Linux?

**Fabien Campagne** · 02-03-2010, 05:35 AM

I meant cross-platform more as working across Illumina, SOLID, Helicos and the Roche 454 platforms, but Goby also works on multiple computer platforms (Windows, Mac, Linux and any computer with a Java virtual machine).

**steven** · 02-03-2010, 06:44 AM

Originally posted by Fabien Campagne View Post

I meant cross-platform more as working across Illumina, SOLID, Helicos and the Roche 454 platforms, but BDVal also works on multiple computer platforms (Windows, Mac, Linux and any computer with a Java virtual machine).

err.. i beg your pardon, but does that mean that *Goby* does work on these computer platforms (i am sorry to insist but i don't even know what "BDVal" is..)?
I just would like to know if i can make it run on the laptop i have under windows xp before downloading, trying to install, etc. i had a quick look at your web site but i did not see this info.
cheers,
s.

**Fabien Campagne** · 02-03-2010, 06:58 AM

Sorry for the typo (now corrected in my previous post). The answer to your question is yes: Goby is implemented in Java and works on Windows XP without recompilation, if you have Java installed.
Secondary analyses can be done on laptops, but you will probably need a server or two to run alignments (we use bwa or last in the background for these steps).

**steven** · 02-03-2010, 07:12 AM

Great! thanks a lot. for sure i'll play with this..
cheers,
s.

**krobison** · 02-03-2010, 11:55 AM

Could you detail why you decided to develop your own formats?

**Fabien Campagne** · 02-03-2010, 01:51 PM

Good question. It was not an easy decision. We first tried really hard to work with the file formats that other groups had developed. For instance, earlier versions of our framework worked with the binary MAQ format. We had several problems with existing formats:

1. Most formats are not chunkable. We have experimented with Hadoop/Map Reduce to perform massive alignments in parallel. Hadoop works best when an input (i.e., read file) can be split in chunks arbitrarily. You cannot do that with gzipped FASTA/FASTQ. The formats we developed are chunkable and compressed. We don't use Hadoop anymore, but we still use chunkability to split large read files in parallel for alignment on a cluster of machines.

2. Many formats use space even when an element of information is not present in the input file (i.e., they pad with empty values in the way the C language stores structures in memory). We leverage Google Protocol Buffer to store only the elements of data we need for a specific application. Developers get to decide how much to store, not the framework.

3. Some formats are text-based. That's a no no if you design for performance. Again, Protocol Buffer helps here: parse speed is much faster than the best XML parsers, and the format is more compact (see this good article about Protocol Buffer if you are curious: http://google-opensource.blogspot.co...gles-data.html).

4. A specific problem we encountered with MAQ (which I believe still exists in SAM) was that reads were identified with strings. Most applications do not need read identifiers. A read index (integer) is sufficient to know that two alignments are from the same read. If you look closely at programs built for information retrieval, which deal with gigabytes of text documents, a notable design consideration is to avoid strings and code words/terms as integers as soon as possible. We do this in the compact read format. This simple design decision saves quite a bit of CPU and space for many applications.

These are just a few elements to answer your question. I hope they explain some of the reasons why we thought we needed new "next-gen" formats. We'll write more about this in a forthcoming manuscript if there is enough interest.

**krobison** · 02-03-2010, 05:15 PM

Given all the issues you have covered, I would strongly urge you to write this up for publication. Feel free to suggest me as a referee :-)

**alex72** · 03-19-2010, 07:30 PM

Failure in the annotation count

Goby works perfect until the alignment step, but it fails in the alignment-to-annotation-counts mode. It generates .tsv files only with a head without contents. Here is my command line:
java -Xmx3g -jar goby.jar --mode alignment-to-annotation-counts sample.entries --annotation biomart_human_exon_ensembl_GRCh37.txt --include-annotation-types gene
Any idea? Thanks.

**Fabien Campagne** · 03-20-2010, 07:04 AM

Hello Alex,

Thanks for the feedback, sorry you are experiencing this problem. A few things could happen. I would check:

1. Make sure you are using the latest Goby version. Latest released version is 1.4. Since version 1.3, you can check the version number of your distribution with
java -jar goby.jar -m version

Please let us know the result of this command, it will help us replicate the problem you are experiencing with the exact same code.

2. We provide a few annotation files in the data directory. The annotation file you list does not seem to be part of the distribution (unless you renamed it). If you made the file custom, could you please provide a sample so that we can help you check the format.

The expected format is a tab delimited file with the fields:
Chromosome Name Strand Ensembl Gene ID Ensembl Exon ID Exon Chr Start (bp) Exon Chr End (bp)
7 1 ENSG00000208234 ENSE00001500505 157956020 157956126
17 -1 ENSG00000199674 ENSE00001437567 15981807 15981911
9 1 ENSG00000221622 ENSE00001565330 134884755 134884842

(see http://icb.med.cornell.edu/wiki/index.php/Goby/DE for a formatted version.)
The file can be built from biomart, but we also provide a few reference annotation files in the data directory of the distribution:

16M Jan 11 14:56 data/biomart_human_exon_esmbl52genes_NCBI36.txt
25M Jan 11 14:56 data/biomart-mouse-exons-ensembl55-genes-NCBIM37.txt

Make sure the file you generated follows this format.

3. The mode alignment-to-annotation-counts assumes that you have aligned to a genome and that the sequence names in this genome encode chromosome names that match the annotation file. This is necessary to map gene and exons to the correct genomic locations.
Beyond making sure the genome has sequence identifiers that match chromosome names, you should make sure that

3a. you included sequence identifiers in the compact file you generated from the reference/genome file
You can do this with the -x or --include-identifiers option, as shown in the demo on the home page.
java -Xmx3g -jar goby.jar --mode fasta-to-compact --include-identifiers data/reference/mm9/chr1.fa.gz

You can check if the compact reference file you generated includes these identifiers with the command:

java -Xmx3g -jar goby.jar --mode compact-file-stat <your-ref.compact-reads>
3b. The reference identifiers must be carried over to the alignment files (.entries, .header). This should happen transparently when you use the mode "align", but if you create SAM files separately and convert them to compact format (sam-to-compact mode), some options need to be present for the transfer to work.

You can verify if the alignment includes identifiers with the command:

java -Xmx3g -jar goby.jar --mode compact-file-stat sample.entries
The output should look something like that (I am using a development version of Goby, future 1.5, so there will be some differences):
INFO GobyDriver - edu.cornell.med.icb.goby.modes.GobyDriver Implementation-Version: development (20100314115136)
Compact Alignment basename = goby-sample
Info from header:
Number of query sequences = 1,000,000
Number of target sequences = 1
Has query identifiers = true
Has target identifiers = true
num query indices= 999,956
num target indices= 1
Number of alignment entries = 75,118
Percent matched = 7.5%
Avg query alignment length = 43
Avg score alignment = 43.719162
Avg number of variations per query sequence = 0.02

If none of this works, could you please submit the log of the commands you entered up to the point where the error occurs, the annotation file, and if it is small, the reference sequence. We use Goby extensively in-house and we'll be happy to help you troubleshoot this problem further.

**alex72** · 03-20-2010, 01:28 PM

Goby trouble shooting

1. Here is the version info.
INFO GobyDriver - edu.cornell.med.icb.goby.modes.GobyDriver Implementation-Version: release (goby_1.4)
2. I downloaded the newest human genome annotation file from the biomart like below.
Chromosome Name Strand Ensembl Gene ID Ensembl Exon ID Exon Chr Start (bp) Exon Chr End (bp)
GL000239.1 -1 ENSG00000241154 ENSE00001869420 9385 9733
GL000239.1 -1 ENSG00000241154 ENSE00001913487 8170 8195
GL000214.1 -1 ENSG00000215525 ENSE00001647296 71373 71720
GL000214.1 -1 ENSG00000215525 ENSE00001806433 71272 71370
GL000214.1 -1 ENSG00000215525 ENSE00001746930 69685 69834
GL000214.1 -1 ENSG00000215525 ENSE00001676160 53527 53808
3. Reference compact info
has identifiers = true (93)
has descriptions = false (0)
has sequences = true (93)
Number of entries = 93
Min read length = 4,262
Max read length = 249,250,621
Avg read length = 33,732,916
Read length quantiles = [ 4,262.000000 ]
4. Reads compact info
Number of query sequences = 18532085
Number of target sequences = 93
has query identifiers = true
has target identifiers = true
num query indices= 18532085
num target indices= 91
Number of alignment entries = 12284703
Percent matched = 66%
Avg query alignment length = 34
Avg score alignment = 34.701973
5. Commands log
java -Xmx3g -jar goby.jar --mode fasta-to-compact *.fq &
java -Xmx20g -jar goby.jar --mode fasta-to-compact --sequence-per-chunk 1 --include-identifiers hg19.fa &
java -Xmx20g -jar goby.jar --mode align --aligner bwa --index --database-name hg19-index --reference hg19.compact-reads --database-directory reference --options t=16 &
java -Xmx20g -jar goby.jar --mode align --aligner bwa --search --database-name hg19-index --reference hg19.compact-reads --database-directory reference --reads A.compact-reads --basename A --options t=16 &
java -Xmx20g -jar goby.jar --mode alignment-to-annotation-counts *.entries --annotation *GRCh37.txt --include-annotation-types gene &
(I removed all the path above to make it short.)
It failed at the last command. I tried the test files (mm-chr1 & annot & read) in the goby package, but they failed too at the same step.

Thanks for your support.

**Fabien Campagne** · 03-20-2010, 08:32 PM

Thanks for the detailed log. I was able to reproduce the problem with Goby version 1.4 and the files we distribute as examples. The problem is caused by an issue we fixed after 1.4.

You can work around this issue by inserting the string "chr" in front of the chromosome id in the annotation file. The following awk script does the trick:

awk '{print "chr"$0} ' data/biomart-mouse-exons-ensembl55-genes-NCBIM37.txt >data/biomart-mouse-exons-ensembl55-genes-NCBIM37-chr-fix.txt

java -Xmx3g -jar goby.jar --mode alignment-to-annotation-counts goby-sample.entries --annotation data/biomart-mouse-exons-ensembl55-genes-NCBIM37-chr-fix.txt --include-annotation-types gene

This command should then result in a file such as:

head goby-sample.ann-counts.tsv
basename main-id secondary-id type chro strand length start end in-count over-count RPKM log2(RPKM+1) expression num-exons
goby-sample ENSMUSG00000073741 gene chr1 -1 681 6204693 6205373 2 2 39.0966 5.32541 2 1
goby-sample ENSMUSG00000047021 gene chr1 -1 33520 74948654 74982173 3 3 5.50402 2.70133 1 41
goby-sample ENSMUSG00000050625 gene chr1 -1 390 183440545 183440934 0 0 0.00000 0.00000 0 1
goby-sample ENSMUSG00000064612 gene chr1 1 78 63225251 63225328 0 0 0.00000 0.00000 0 1
goby-sample ENSMUSG00000049690 gene chr1 -1 916996 127810214 128727209 33 33 30.0156 4.95492 4 34
goby-sample ENSMUSG00000047053 gene chr1 1 1267 155738922 155740188 0 0 0.00000 0.00000 0 1
goby-sample ENSMUSG00000047067 gene chr1 1 1440 94803566 94805005 5 5 51.2409 5.70711 5 2
goby-sample ENSMUSG00000047539 gene chr1 -1 28505 184243233 184271737 47 47 127.352 7.00397 39 5
goby-sample ENSMUSG00000025774 gene chr1 -1 30712 18105272 18135983 0 0 0.00000 0.00000 0 32

Please let us know if this work-around does not work with GRCh37 (I tested only NCBIM37). Goby 1.5 will work directly with annotation files as described previously. Sorry for the inconvenience.

**alex72** · 03-21-2010, 11:57 AM

Thanks. Yes, it solves the problem. But, I encountered another problem in the next analysis for the statistical tests. Here is my command and error messages.
java -Xmx20g -jar goby.jar --mode alignment-to-annotation-counts *.entries --annotation GRCh37.txt --include-annotation-types gene --compare A/B --groups A=A1,A2,A3/B=B1,B2,B3 --stats stats.tsv

ERROR ChiSquareTestCalculator - elementId:ENSG00000196262
ERROR ChiSquareTestCalculator - expected:[10896.216554066976, 10472.783445933024]
ERROR ChiSquareTestCalculator - observed:[7516, 13853]
ERROR ChiSquareTestCalculator - org.apache.commons.math.MaxIterationsExceededException: Maximal number of iterations (2,147,483,647) exceeded
java.lang.ArrayIndexOutOfBoundsException: -1
at it.unimi.dsi.fastutil.doubles.DoubleArrayList.getDouble(DoubleArrayList.java:231)
at it.unimi.dsi.fastutil.doubles.AbstractDoubleList.get(AbstractDoubleList.java:403)
at edu.cornell.med.icb.goby.stats.FDRAdjustment.getListSize(FDRAdjustment.java:41)
at edu.cornell.med.icb.goby.stats.BonferroniAdjustment.adjust(BonferroniAdjustment.java:41)
at edu.cornell.med.icb.goby.stats.FDRAdjustment.adjust(FDRAdjustment.java:32)
at edu.cornell.med.icb.goby.modes.CompactAlignmentToAnnotationCountsMode.execute(CompactAlignmentToAnnotationCountsMode.java:320)
at edu.cornell.med.icb.goby.modes.GenericToolsDriver.execute(GenericToolsDriver.java:151)
at edu.cornell.med.icb.goby.modes.GobyDriver.main(GobyDriver.java:53)
INFO GobyRengine - Shutdown hook is terminating R

Also, the result file stats.tsv include only header no contents.

**Fabien Campagne** · 03-21-2010, 03:16 PM

The ERROR log from ChiSquareTestCalculator is not likely to be the problem. We use apache commons math and this is a known issue with the version distributed in Goby 1.4. We have observed this error as well and it will result in some chi-square p-values being set to NaN in the output. We are testing a new version of the apache commons jar that has a fix for this (see http://issues.apache.org/jira/browse/MATH-301 for details).

The second exception, ArrayIndexOutOfBoundsException, is what stops processing. Version 1.4 is not very good at checking the command line for errors. For instance, we found an issue (fixed in the development version) where 1.4 will not complain if you name a basename in the --groups argument that you did not provide on the command line as an input basename.

What this means is that if you type:
java -jar goby.jar --mode alignment-to-annotation-counts D.entries E.entries compare A/B --groups A=A,D/B=B,E
Goby 1.4 will try to process and fail when it tries to find either of the basenames A or B (because input basenames include D.entries and E.entries, but not A.entries or B.entries).

From the command line you provided, I cannot tell if *.entries will match A1.entries, A2.entries, A3.entries, B1.entries, B2.entries, B3.entries. All these inputs are required to exist if you provide --compare A/B with --groups A=A1,A2,A3/B=B1,B2,B3
If you do not have three files in each group, try adjusting the --groups directive to include only the alignments you have for each group.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 33 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Goby

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News