Tips: What to do when the job runtime exceeds max queue time
Option 1) Get the answers faster:
A) Use the fastest library routines.
The nodes have fast dense linear routines. E.g. If these routines are
used in the code to solve systems of linear equations, a large increase
in speed may be possible by linking with the vendor supplied routines.
Link with -lacml rather than non-optimized libraries.
B) Change to a more efficient algorithm.
This is the best since you get your answers quicker. AIT's HPC
group can help you with numerical aspects and some algorithm choices,
but you would need to supply the modeling knowledge.
C) Go parallel.
The code can be recoded using
MPI: The program can be rewritten to use MPI. This often takes a long
time but usually gives the best performance
OpenMP: The program can be modified with OpenMP directives to perform
portions of the program in parallel, and compiled to use all
four processors in a single node. This is limited to a 4x speedup
though, and if not done well can even slow down a program.
Option 2) Use check-pointing.
Major production codes are checkpointed.
In check-pointing, you periodically save the state of the program in
a restart file. Whenever you run your program it reads the restart file to
pick up from the last checkpoint. The advantage of this is that there is no
limit to the total amount of time you can use. Barring disk crashes or total
loss of the machine, your total runtime is indefinite, you just keep submitting
the same job and start from where you left off. There is overhead associated
with each checkpoint, and time executed after the last checkpoint is lost
whenever the job is stopped. You may want to do this whatever else you do,
since as soon as a research code gets faster, the next step is to run a larger
problem, so you are back up against the queue time restriction.