Resuming prematurely interrupted computations

General

The QCG-PilotJob Manager service implements mechanism for resuming prematurely interrupted computations. All incoming job submission requests, as well as the finished job iterations are recorded to allow resuming job execution. The current state is placed in files track.* in auxiliary directory (the .qcgpjm-service-* in the working directory). It is worth to mention that started, but not finished job iterations will be started again, so if they don’t implement automatic computation checkpointing, they will re-start from begin.

Invocation

To resume the QCG-PilotJob Manager with previous jobs, the resume command line option must be used with path either direct to the auxiliary QCG-PilotJob Manager directory or to the working directory where auxiliary directory is placed ( in case where there are many auxiliary directories in the working directory, the last modified one will be automatically selected).

Note

Currently during the resume operation, non of previously used command line option will be re-used. So if for example the working directory has been specified in original QCG PilotJob Manager start, the same working directory should be used during resuming.

Example invocation:

qcg-pm-service --wd prev_work_dir --resume prev_work_dir/.qcgpjm-service-LAPTOP-CNT0BD0F.5091

Operation

After resume, the QCG-PilotJob Manager will re-use the pointed auxiliary directory, so all log files, current tracking status and job reports will be appended to the previous files. Thus there is no problem to resume, already resumed computations.

Issues

Currently the resume mechanism is not supported in resource partitioning mode.