Dear Users,
during last M100 maintenance, we configured SLURM resource manager so to collect statistics on the GPU usage and accounting for each job. The service is based on NVIDIA Data Center GPU Manager (DCGM), and produces a report per node, for all the requested GPUs, at the end of each job. The reports are saved in the job submit directory, in files named "dcgmi_stats_<nodename>_<jobid>.out".
The report contains statistics on the GPU usage (Power and Memory usage, etc.) for your run, and the assessment on the overall health state of GPUs.
Best regards,
HPC User Support @ CINECA