Processes statistics¶
QCG-PilotJob Manager support gathering metrics about launched processes. This feature enables analysis of application behavior such as:
- process tree inspection, from the lauching by QCG-PilotJob to the final application process (with all intermediate steps,
such as
srun
,mpirun
,orted
etc. with time line (delays between following processes in tree) - process localization, such as: node name and cpu affinity
- coarse process metrics: cpu and memory utilization
Note
Please note that this is initial version of gathering process metrics, and due to the implementation obtained data might be precise.
How it works¶
When --enable-proc-stats
has been used with QCG-PilotJob (either as a command line argument or as argument to the server_args
parameter of LocalManager
class), the launcher agent started on each of the Slurm allocation’s node stars thread that periodically query about processes started in the local system. Because collecting statistics about all processes in the system would take too much time, and thus reduce the frequency of queries, launcher agent only checks the descendants of the slurmstepd
process. This process is responsible for starting every user process in Slurm, including launcher agent. Therefore we register all process started by launcher agent, and also processes started by MPI that is configured with Slurm (in such situation, Slurm asks slurmstepd
daemon to launch instancesof MPI application). Every descendant process of the slurmstepd
is registered with it’s identifier (pid
) and basic statistics, such as:
pid
- process identifier- process name (in most cases name of the executable file)
- command line arguments
- parent process name
- parent process identifier
- cpu affinity - list of available cores
- accumulated process times in seconds (the detailed description of the format is available at https://psutil.readthedocs.io/en/latest/index.html?highlight=cpu_times#psutil.cpu_times)
- memory information (the detailed description of the format is available at https://psutil.readthedocs.io/en/latest/index.html?highlight=cpu_times#psutil.Process.memory_info)
Currently the data from each query is stored in in-memory database and saved to the file at launcher agent finish. The destination file, created in QCG-PilotJob working directory, will have name of following pattern: ptrace_{node_name}_{current_date}_{random_number}.log
. In the future releases we are planing to send those information to external service that will allow to run-time monitoring of gathered statistics.
It is worth to mention about some shortcoming of such approach:
- because the processes are queried with some frequency (currently every 1 second), there is a chance that very short living process will not be registered,
- there is a possibility that after finish of some process, another one with the same identifier will be created later
How to use¶
First of all the --enable-proc-stats
arument of the QCG-PilotJob service must be used either as command line argument, or as one of the server_args
element in LocalManager
class. When all QCG-PilotJob workflow finish, the working directory should contain one or many (for each of allocation nodes there should be instance of this file) files named ptrace_{node_name}_{current_date}_{random_number}.log
. Those files contains information about processes statistics in JSON format. To analyze this data, QCG-PilotJob provides qcg-pm-processes
command line tool. Documentation about this tool is available with:
In this example we submitted 6 instances of mpi_iter
job, where each instance is an MPI application started on 8 cores.
To get process tree of the first instance of this job:
$ qcg-pm-processes tree out-api-mpi-iter mpi_iter:0
job mpi_iter:0, job process id 28521, application name openmpi_3.1_gcc_6.2_app
--28521:bash (bash -c source /etc/profile; module purge; module load openmpi/3.1.4_gcc620; exe) node(e0025) created 2021-03-25 17:18:34.350000
--29537:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 3.83 secs
--29542:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 3.86 secs
--29638:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 4.01 secs
--29608:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 3.98 secs
--29547:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 3.88 secs
--29579:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 3.95 secs
--29554:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 3.91 secs
--29567:openmpi_3.1_gcc_6.2_app (/tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app) node(e0025) after 3.94 secs
To get detail process info:
$ qcg-pm-processes apps out-api-mpi-iter mpi_iter:0
found 8 target processes
29537:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.180000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [0]
cpu times: [0.04, 0.03, 0.0, 0.0, 0.0]
cpu memory info: [25219072, 525488128, 12701696, 8192, 0, 153374720, 0]
cpu memory percent: 0.018710530527870046
29542:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.210000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [1]
cpu times: [0.06, 0.03, 0.0, 0.0, 0.0]
cpu memory info: [25206784, 391258112, 12693504, 8192, 0, 153370624, 0]
cpu memory percent: 0.01870141381655226
29638:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.360000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [7]
cpu times: [0.05, 0.03, 0.0, 0.0, 0.0]
cpu memory info: [25202688, 391258112, 12689408, 8192, 0, 153370624, 0]
cpu memory percent: 0.01869837491277966
29608:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.330000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [6]
cpu times: [0.04, 0.04, 0.0, 0.0, 0.0]
cpu memory info: [25206784, 391258112, 12693504, 8192, 0, 153370624, 0]
cpu memory percent: 0.01870141381655226
29547:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.230000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [2]
cpu times: [0.06, 0.03, 0.0, 0.0, 0.0]
cpu memory info: [25206784, 391258112, 12693504, 8192, 0, 153370624, 0]
cpu memory percent: 0.01870141381655226
29579:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.300000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [5]
cpu times: [0.05, 0.03, 0.0, 0.0, 0.0]
cpu memory info: [25206784, 391258112, 12693504, 8192, 0, 153370624, 0]
cpu memory percent: 0.01870141381655226
29554:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.260000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [3]
cpu times: [0.05, 0.04, 0.0, 0.0, 0.0]
cpu memory info: [25202688, 391258112, 12689408, 8192, 0, 153370624, 0]
cpu memory percent: 0.01869837491277966
29567:openmpi_3.1_gcc_6.2_app
created: 2021-03-25 17:18:38.290000
cmdline: /tmp/lustre_shared/plgkopta/qcgpjm-altair/examples/openmpi_3.1_gcc_6.2_app
parent: 28521:mpirun
cpu affinity: [4]
cpu times: [0.06, 0.03, 0.0, 0.0, 0.0]
cpu memory info: [25206784, 391258112, 12693504, 8192, 0, 153370624, 0]
cpu memory percent: 0.01870141381655226
It is worth to mention, that analysis with the qcg-pm-processes
tool can be done at any time outside the Slurm allocation. The only input data is the working directory.