SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Isaac2 genome index creation (http://seqanswers.com/forums/showthread.php?t=61539)

GenoMax 07-21-2015 09:28 AM

Isaac2 genome index creation
 
@semyon: Creating indexes for human genome for isaac2 is not trivial (as I am discovering). Process has been running for 24+ hours on a server and still going.

Making isaac2 indexes available as a download may also be a useful thing for human/mouse genomes.

EDIT: This thread was created by moving several posts that were originally in (http://seqanswers.com/forums/showthread.php?t=61208) but were solely focused on Isaac2 genome index creation issues.

skruglyak 07-21-2015 03:20 PM

Quote:

Originally Posted by GenoMax (Post 177812)
@semyon: Creating indexes for human genome for isaac2 is not trivial (as I am discovering). Process has been running for 24+ hours on a server and still going.

Making isaac2 indexes available as a download may also be a useful thing for human/mouse genomes.

Sorry that it has been a pain. It generally takes us about half a day to generate, but that is hardware dependent. I need to figure out a good place to host the index files. Will try to get human posted in the next few days.

GenoMax 07-21-2015 05:20 PM

Quote:

Originally Posted by skruglyak (Post 177831)
Sorry that it has been a pain. It generally takes us about half a day to generate, but that is hardware dependent. I need to figure out a good place to host the index files. Will try to get human posted in the next few days.

I am not complaining. Just trying to make you aware of how many of us access shared compute resources and how isaac1/2 may come across as a surprise to new users.

Isaac1/2 are written in a way to take over entire hardware from a server/node in question (it is documented, as I remember). Unfortunately this is not something that works well in shared compute environments. It is also hard to make Isaac1/2 behave with job schedulers since the original thread spawns multiple sub-jobs over time.

sklages 07-21-2015 11:47 PM

Quote:

Originally Posted by GenoMax (Post 177835)
Isaac1/2 are written in a way to take over entire hardware from a server/node in question (it is documented, as I remember). Unfortunately this is not something that works well in shared compute environments.

I thought this would be handled by -j in both isaac1/2.
Code:

2015-07-22 08:40:21    [7f15a156c7c0]    Version: iSAAC-02.15.07.16
  -j [ --jobs ] arg (=4)  Maximum number of compute threads to run in parallel


GenoMax 07-22-2015 04:36 AM

Quote:

Originally Posted by sklages (Post 177843)
I thought this would be handled by -j in both isaac1/2.
Code:

2015-07-22 08:40:21    [7f15a156c7c0]    Version: iSAAC-02.15.07.16
  -j [ --jobs ] arg (=4)  Maximum number of compute threads to run in parallel


I will have to verify it again (am just trying to get the index built without overstepping limits set by my job scheduler) but even when I specified -j 1 in an earlier try, the job still spawned as many threads as there were cores on the node.

Have you successfully used the -j option in that manner?

sklages 07-22-2015 04:48 AM

I thought I did for isaac1 .. though I will "re-test" this as well.
I haven't tried isaac2 yet.

Did the format of the index files change from isaac1 to isaac2?
I still have my isaac1 index files ... :-)

GenoMax 07-22-2015 05:05 AM

Quote:

Originally Posted by sklages (Post 177866)
Did the format of the index files change from isaac1 to isaac2?
I still have my isaac1 index files ... :-)

Even if the format hasn't changed I did not build the index last time (used Illumina's). With the assertion that isaac2 can be used for any genome I thought it would be a good idea to get that experience under my belt. Indexing job is still running (2 days, 32 cores, TB RAM). UCSC hg19 from iGenomes.

Illumina must be using all flash storage/newest multi-core xeons etc to get their index built in under a half day.

sklages 07-22-2015 05:22 AM

I started to build a new index for hg19 on 32 cores on a fast (local) RAID.

Using "--jobs 32" on a 48 core machine results in 48 threads. So you are right ...
But this is a bug. Otherwise "--jobs N" does not make any sense.
So let's see how long it takes to build hg19 index ..

UPDATE:
For now I have cancelled building the index.
The average load on that machine raised beyond 52 with peaks over 128 ... I need to investigate first :-)

GenoMax 07-22-2015 05:47 AM

Quote:

Originally Posted by sklages (Post 177873)
Using "--jobs 32" on a 48 core machine results in 48 threads. So you are right ...
But this is a bug. Otherwise "--jobs N" does not make any sense.
So let's see how long it takes to build hg19 index ..

HiSeq Analysis Software (HAS) which isaac was a part of always did this. It seems to pay no attention to -j directive (as I said yesterday HAS documentation does say that it will take over the node).

I suggest watching I/O on your RAID (especially if it is shared with some other users/nodes). HAS/Isaac can do some interesting things to storage too.

EDIT: Just saw your update. That kind of load is "normal". It is only periodic (and partly related to storage). Isaac will also not use all the cores all the time so that part is "normal" too. :D

sklages 07-22-2015 05:57 AM

Quote:

Originally Posted by GenoMax (Post 177875)
HiSeq Analysis Software (HAS) which isaac was a part of always did this. It seems to pay no attention to -j directive (as I said yesterday HAS documentation does say that it will take over the node).

I suggest watching I/O on your RAID (especially if it is shared with some other users/nodes). HAS/Isaac can do some interesting things to storage too.

EDIT: Just saw your update. That kind of load is "normal". It is only periodic (and partly related to storage). Isaac will also not use all the cores all the time so that part is "normal" too. :D

Ha, .. but as you said .. that makes it completely unusable for cluster environments, if you have no control over cpu usage / machine load. Bad people may call that "error by design". ;-)

I have written a ticket on github as I consider this a bug.

If this behaviour will not be changed I will never be able to test the aligner itself :-)

In the past I never used HAS as it was shipped with a very old version of the aligner ..

GenoMax 07-22-2015 06:10 AM

Quote:

Originally Posted by sklages (Post 177876)
Ha, .. but as you said .. that makes it completely unusable for cluster environments, if you have no control over cpu usage / machine load. Bad people may call that "error by design". ;-)

isaac is not meant to be used across a cluster (just on a single node in the cluster).

Quote:

If this behaviour will not be changed I will never be able to test the aligner itself :-)
You have to be creative. Request exclusive access to a node in your scheduler/limit the I/O. It will also involve conversations with your cluster admins so they don't have a heart attack on seeing those kinds of loads on a single server in the cluster.

sklages 07-22-2015 06:38 AM

Quote:

Originally Posted by GenoMax (Post 177877)
isaac is not meant to be used across a cluster (just on a single node in the cluster).

Sure, .. the job will be run on a single node. Nevertheless I need to know roughly about the resources my job will use and I should be able to restrict resources as well, even if it runs on non-cluster server.

Quote:

You have to be creative. Request exclusive access to a node in your scheduler/limit the I/O. It will also involve conversations with your cluster admins so they don't have a heart attack on seeing those kinds of loads on a single server in the cluster.
These are only workarounds ... I do see the problem with the software being designed that way.
IMHO there is no reason to let the user without control over the resources a software uses ... there is always the argument "speed" and "efficiency" .. maybe. But sometimes it is not only speed that is important ..

Roman mentioned on github that at least the aligner may be restricted to a certain number of CPUs, but is not recommended for the sake of "efficiency of processing". Again, "efficiency" does not always mean "speed of single job". But that's just my 2p ;-)

GenoMax 07-22-2015 06:48 AM

I am with you all the way. Core infrastructure providers are left to fend for this sort of thing, which the end-users don't appreciate/care about.

This was one of the reasons I started this conversation so @semyon can take the real world observations back for internal discussion/improvements, especially if they want more users to use their software.

craczy 07-22-2015 11:06 AM

Quote:

Originally Posted by sklages (Post 177878)
Sure, .. the job will be run on a single node. Nevertheless I need to know roughly about the resources my job will use and I should be able to restrict resources as well, even if it runs on non-cluster server.

These are only workarounds ... I do see the problem with the software being designed that way.
IMHO there is no reason to let the user without control over the resources a software uses ... there is always the argument "speed" and "efficiency" .. maybe. But sometimes it is not only speed that is important ..

Roman mentioned on github that at least the aligner may be restricted to a certain number of CPUs, but is not recommended for the sake of "efficiency of processing". Again, "efficiency" does not always mean "speed of single job". But that's just my 2p ;-)

First of all sorry for the confusion around the meaning of the "-j" option across the different tools, and about the inconvenience that you experienced. To clarify:

- isaac-align: this is a single node and single process application and the "-j" option controls the maximum number of compute threads, which would effectively enable the user to control the CPU load on the node (the recommendation is to let the application figure out and use the available resources). As the user can also control the amount of memory used by the process, this should work fairly well in a cluster environment. If it causes trouble with your job scheduler, we would really like to better understand the issue so that we can effectively resolve it (it is a really important feature!).

- isaac-sort-reference: this is a multiprocess application. It can be distributed on multiple nodes but that requires explicit specification of the qrsh (or other) command line. The option "-j " is for the number of parallel operations (processes as opposed to threads). The recommendation is to execute it on a single node and to use "-j 1". At the moment, this application does not provide any control to the user for CPU and memory usage. Hopefully this inconvenience is mitigated by the fact that it need to run at most once per reference. If there really is a need to restrict resource usage, doing it with modern solutions like virtualization might be a good option.

Regarding the time and resources required to run "isaac-sort-reference", a server with 150GB memory is required. A dual CPU (mid-range or better) is recommended. It is also useful to have a reasonably good file system as the operation does quite a bit of IOs. With a mid-range server it should take about half a day. Again the "-j 1" option is important on a single node, otherwise the processes will compete for CPU, memory, swap, etc. If it takes much longer than that, it might be worth checking that the node is not stuck on IO waits or busy swapping.

Thanks a lot for your feedback on github!

Come

sklages 07-22-2015 12:42 PM

Thanks for the clarification ... good news on isaac-align ;-)

And as for isaac-sort-reference I still do not think that this is the right way; but you are probably right, as we only run it once for each reference, it might not be too much of a problem for an experienced user for the moment. Nevertheless you should consider changing or extending this behavior in that, that a user is able to restrict resources on a single node.
But format of index files has not changed from version 1 to 2?

craczy 07-22-2015 01:12 PM

Quote:

Originally Posted by sklages (Post 177917)
But format of index files has not changed from version 1 to 2?

Unfortunately it has. The index contains extra information about the reference and with isaac2 that information has changed. Specifically, in the isaac2 index we are keeping track for each position in the reference genome if there are similar sequences elsewhere in the reference.

GenoMax 07-23-2015 07:23 AM

I did not specify a value for seed-length so the process is creating all possible combinations [--annotation-seed-lengths arg (=16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80]. It looks like the end may be in sight today for the process I am running since the files for 80 are being made now.

@sven: Expect a multi-day turnaround.

sklages 07-23-2015 07:30 AM

I haven't neither .. should use 32.
But .. I am optmistic :-)

GenoMax 07-23-2015 05:18 PM

@Semyon/Come: Can one of you confirm if the following files represent the correct isaac2 index for hg19 genome? My isaac-sort-reference job appeared to have finished (no errors) but these are the only files I see in the top level directory (Temp directory is still there with files within)
Code:

1.1G 2uniqueness.16bpb.gz
 47G kmer-positions-32-0.dat
 50K sorted-reference.xml


sklages 07-23-2015 10:52 PM

Quote:

Originally Posted by sklages (Post 177973)
OK .. index creation is running for hg19 ... I'll report back tomorrow.

Well, .. for now .. the server crashed overnight, just three hours ago ..
We now have to investigate what event caused this crash. Maybe it is just "Murphy's Law" .. we'll see.

sklages 07-24-2015 03:14 AM

Quote:

Originally Posted by sklages (Post 178003)
Well, .. for now .. the server crashed overnight, just three hours ago ..
We now have to investigate what event caused this crash. Maybe it is just "Murphy's Law" .. we'll see.

Well, .. it was indeed Murphy's law :-)
We had a failure on a network interface .. that made at least one process going frenzy and pushed the load beyond 1000...

So I'll restart indexing today.

craczy 07-24-2015 10:02 AM

Quote:

Originally Posted by GenoMax (Post 177997)
@Semyon/Come: Can one of you confirm if the following files represent the correct isaac2 index for hg19 genome? My isaac-sort-reference job appeared to have finished (no errors) but these are the only files I see in the top level directory (Temp directory is still there with files within)
Code:

1.1G 2uniqueness.16bpb.gz
 47G kmer-positions-32-0.dat
 50K sorted-reference.xml


This looks correct, but surprising. Did you specify something like "-w 1" on the command line by any chance?

All the kmers are indexed in on single data file (kmer-positions-32-0.dat), which is not a very good thing as it prevents parallelisation when searching for mapping candidates.

You can use the "isaac-pack-reference" and then "isaac-unpack-reference -w 6" to split the index into smaller files without having to re-doing the reference sorting.

GenoMax 07-24-2015 10:29 AM

Quote:

Originally Posted by craczy (Post 178035)
This looks correct, but surprising. Did you specify something like "-w 1" on the command line by any chance?

Thanks for confirming that. I had only done this

Code:

$ isaac-sort-reference -g /path_to/HG19_UCSC/Sequence/WholeGenomeFasta/genome.fa -o .
Is there a better command-line for future reference?

Quote:

Originally Posted by craczy (Post 178035)
You can use the "isaac-pack-reference" and then "isaac-unpack-reference -w 6" to split the index into smaller files without having to re-doing the reference sorting.

I did the isaac-pack-reference thinking that it would "compress" the index but nothing appeared to change except the date stamps.

Update: I think I need to move the "Temp" directory out of the way (just realized that and trying it now) for "pack-reference" to work.

sklages 07-26-2015 11:13 PM

Well, I can confirm that.

It took ~64h on a 48 core "Opteron 6176 SE" (fast local storage, RAID) to build a hg19 index.

Code:

isaac-sort-reference --genome-file fa_hg19/genome.fa --jobs 1 --output-directory iSAAC2Index.32 --quiet
The result is:
Code:

938M 2015.07.27 06:21:35 2uniqueness.16bpb.gz
 42G 2015.07.27 06:54:45 kmer-positions-32-0.dat
 15K 2015.07.27 06:54:51 sorted-reference.xml
8.0K 2015.07.27 06:54:51 Temp

with 'Temp' being 1.1TiB (!) in size ... (btw, why don't you clean Temp automatically after successfully finishing a job?).

GenoMax 07-27-2015 04:56 AM

@come:

I tried the "isaac-unpack-reference" (relevant part of the command line below)

Code:

$ isaac-unpack-reference -j 8 -w 6 -i .
Resulted in this error

Code:

tar: .: Cannot read: Is a directory
tar: At beginning of tape, quitting now
tar: Error is not recoverable: exiting now
make: *** [Temp/sorted-reference.xml] Error 2

@sven: Can you see if it works for you?

BTW: "Temp" directory is required for the unpack-reference.

sklages 07-27-2015 05:45 AM

Just tried,
Code:

isaac-unpack-reference -j 1 -w 6 -i . --dry-run
This (basically) results in this error:
Code:

warning: failed to load external entity "Temp/sorted-reference.xml"
unable to parse Temp/sorted-reference.xml
warning: failed to load external entity "Temp/sorted-reference.xml"
unable to parse Temp/sorted-reference.xml

Without dry-run:
Code:

isaac-unpack-reference -j 1 -w 6 -i .
tar fails:
Code:

tar -C Temp --touch -xvf .
tar: .: Cannot read: Is a directory
tar: At beginning of tape, quitting now
tar: Error is not recoverable: exiting now
make: *** [Temp/sorted-reference.xml] Error 2

Even when I copy sorted-reference.xml to Temp, I get an error:

Code:

make[1]: Entering directory `/path/to/iSAACindexBuildDir/iSAAC2Index.32'
make[1]: *** No rule to make target `Temp/genome.fa', needed by `/path/to/iSAACindexBuildDir/iSAAC2Index.32/genome.fa'.  Stop.
make[1]: Leaving directory `/path/to/iSAACindexBuildDir/iSAAC2Index.32'
make: *** [all] Error 2


sklages 07-27-2015 11:02 AM

Quote:

Originally Posted by GenoMax (Post 178105)
BTW: "Temp" directory is required for the unpack-reference.

That's funny though .. under normal circumstances I'd remove this folder as it occupies quite a lot of disk space ..

GenoMax 07-27-2015 05:33 PM

@sven: A new thread has been created for posts related to isaac2 genome index creation.

craczy 07-28-2015 07:07 AM

The input file should be the 'sorted-reverence.xml', not the current directory:

This should work:

Code:

isaac-unpack-reference -j 1 -w 6 -i sorted-reference.xml
Remember to remove the already existing Temp directory, if any

Come

GenoMax 07-28-2015 11:58 AM

Quote:

Originally Posted by craczy (Post 178173)
The input file should be the 'sorted-reverence.xml', not the current directory:

This should work:

Code:

isaac-unpack-reference -j 1 -w 6 -i sorted-reference.xml
Remember to remove the already existing Temp directory, if any

Come

This is not working for me:

Code:

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Read 4461 bytes from ./sorted-reference.xml
tar: Error exit delayed from previous errors
make: *** [Temp/sorted-reference.xml] Error 2


craczy 07-28-2015 01:18 PM

Quote:

Originally Posted by GenoMax (Post 178199)
This is not working for me:

Code:

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Read 4461 bytes from ./sorted-reference.xml
tar: Error exit delayed from previous errors
make: *** [Temp/sorted-reference.xml] Error 2


My mistake. Apologies. It is not the sorted-reference.xml but the tarball created by 'isaac-pack-reference':

Code:

rm -rf Temp
isaac-unpack-reference -j 1 -w 6 -i packed-reference.tar.gz


GenoMax 07-29-2015 04:03 AM

Commands used for the final steps in a nutshell.

Code:

$ isaac-pack-reference -j 1 -r ./sorted-reference.xml -o ./packed-reference.tar.gz

$ isaac-unpack-reference -j 1 -w 6 -i ./packed-reference.tar.gz

The end result was a set of 64 files

Quote:

kmer-positions-32-00.dat through kmer-positions-32-63.dat
And one

Code:

2uniqueness.16bpb.gz
file.

I have started a new isaac2 genome creation job for the MM9 genome with -w 6 option upfront.

sklages 07-29-2015 04:29 AM

Got the same just 5 minutes ago :-)

So the default for isaac-sort-reference should be changed or, alternatively, it should always be called with '--mask-width 6'.

GenoMax 08-03-2015 12:41 PM

I had started an isaac2 index creation job for mm9 genome (with -w 6). It has been running for a week and still making files in Temp directory.

craczy 08-20-2015 03:17 PM

In an attempt to make it easier to use Isaac2, we will make the packed index reference for commonly used genomes on BaseSpace. At the moment, the only 2 genomes available are hg19 and mm9. Feel free to request other genomes.

Also, the issues and recommendations around indexing genomes are summarized on the isaac2 github wiki page "Reference Indexes".

The link to the already indexed genomes in basespace might change in the future, please refer to the wiki page on github for updates.

Hopefully, this will help.

Come

sklages 08-29-2016 01:50 AM

Hallo again ;-)

we are now with Isaac3. Cool .. ;-)

Creating indices for grch38 and grcm38 leaves some open questions:

I have run index creation as follows (mask-width 0 is the default, I just put it there as a "reminder" for future index creation runs):

Code:

isaac-sort-reference \
  --output-directory iSAACindex \
  --jobs 1 \
  --mask-width 0 \
  --genome-file genome.fa

That left me with exact 3 files and a 1.1TiB Temp folder:

Code:

-rw-rw-r-- 1 klages klages 618M 2016.08.26 01:05:08 2repeatness.8bpb.gz
-rw-rw-r-- 1 klages klages 678M 2016.08.25 22:19:13 2uniqueness.8bpb.gz
-rw-rw-r-- 1 klages klages 108K 2016.08.26 01:05:09 sorted-reference.xml
drwxrwxr-x 2 klages klages 8.0K 2016.08.26 01:05:09 Temp

make reported
Code:

[all]    INFO: All done!
At least it is "packable" by isaac-pack-reference.

hg19-packed-reference.tar.gz from BaseSpace (btw, would be fine to have some grch38/grcm38 though) shows:

Code:

-rwxr-x--- rpetrovski/aladdin 644685308 2014-11-19 21:38 2uniqueness.16bpb.gz
-rw-r--r-- rpetrovski/aladdin 386961748 2014-11-20 13:03 neighbors-1or2-16.1bpb
-rw-r--r-- rpetrovski/aladdin 386961748 2014-11-20 13:06 neighbors-1or2-32.1bpb
-rwxr-xr-- rpetrovski/aladdin 3157608038 2014-11-20 12:53 genome.fa
-rw-r--r-- rpetrovski/aladdin      48044 2014-11-20 12:54 sorted-reference.xml

* Is that a complete and valid index??
* Do I still need Temp for any task after index creation?
* What are the differences compared to isaac2 indices?

best,
Sven

fznajar 10-23-2018 08:14 AM

Dear all,
Can iSAAC work on mac os platform?


All times are GMT -8. The time now is 09:14 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.