8/1/2006 Interaction with qsub is a feature that's currently being developed in cipres. These are some notes for myself and other cipres internal developers about the current state of the feature and issues that need to be resolved 1. copied the drmaa.jar and .so files to cipres distribution lib dir. 2. Added the jar to my third party classpath in build-macros.xml. 3. With our qsub installation on the cipres cluster, you have to compile with a 64 bit compiler. I don't remember exactly why, but its a dynamic loading issues Maybe the drmaa.jar is looking for a 64 bit .so? I used the java 1.5 compiler. 4. Modified my properties file to set "cipres.registry.launcher=org.cipres.registry.DrmaaLauncher". Otherwise the regular Runtime.exec() method of launching is used. 5. Make sure the working dir you set for qsub in DrmaaLauncher.java exists - otherwise jobs will fail. I have this coded to be a "qsub" subdir of the user's tmp directory, but I didn't modify Config.java to create this subdir. 6. Make sure registry url isn't 127.0.0.1 - remote qsub jobs need to connect to it. Don't set OIAddr. 7. Jobs may be queued for a long time before running so we need something besides retryCount and retryInterval to tell registry how long to wait. We want to make sure while we're waiting that the job is in fact queued to run (we don't want to continue to wait if job has already run and failed or if the command is badly formed and unrunnable). Because getIOR() may wait a long time before returning an object reference and because the parallel version of recidcm3 obtains object refs at many different times (not just during execute()) its possible that a user will cancel recidcm3 after a job has been queued but before getIOR returns its ior to recidcm3, so recidcm3 has no way to kill the job. We could implement this similar to the non-parallel, non qsub version, where recidcm3.remove() and execute() are serialized so cancel will block until execute() finishes obtaining object refs, except that we'd change it so that remove() and whatever fn gets new object refs are serialized. This could block cancel for a long time however. It would be nice if the registry could tell that the client had gone away and kill the job in that case. Or if there were a way for the client to cancel the request to the registry. How should we package optional stuff like this qsub stuff, can this be a runtime option or must it be decided at build time and handled via autoconf?