Monitoring System and Job Status

The four actions you may take the most are checking system status and starting, monitoring, and stopping jobs. Since scheduling jobs is a longer topic, see this page for an in-depth description of how to start your job. Here we describe how to check the status of the system for available resources, monitor a currently running job), and stop a running job.

Each of these tasks is done through the scheduler, which is Slurm on the MIT SuperCloud system. On this page and the job submission page we describe some of the basic options for submitting, monitoring, and stopping jobs. More advanced options are described in Slurm's documentation, and this handy two-page guide gives a brief description of the commands and their options.

Checking System Status

Our wrapper command, LLGrid_status, has a nicely formatted and easy to read output for checking system status:

[StudentX@login-0 ~]$ LLGrid_status
LLGrid: txe1 (running slurm 16.05.8)
============================================ 
Online Intel xeon-e5 nodes: 36
Unclaimed nodes: 24
Claimed slots: 172
Claimed slots for exclusive jobs: 80
-------------------------------------------- 
Available slots: 404

In the output, you can see the name of the system you are on (e1 here), the scheduler that's being used (Slurm), the number of unclaimed nodes, and the number of available slots.

Monitoring Jobs

To list all of your running jobs you can use the LLstat command. For more information about how your jobs are utilizing the resources, you can use the LLload command.

LLstat

You can monitor your jobs using the LLstat command:

[StudentX@login-0 ~]$ LLstat   LLGrid: txe1 (running slurm 16.05.8)
JOBID     ARRAY_J    NAME        USER    START_TIME          PARTITION  CPUS  FEATURES  MIN_MEMORY  ST  NODELIST(REASON)   
40986     40986      myJob      Student  2017-10-19T15:35:46 normal     1     xeon-e5   5G          R   gpu-2  
40980_100 40980      myArrayJob Student  2017-10-19T15:35:37 normal     1     xeon-e5   5G          R   gpu-2  
40980_101 40980      myArrayJob Student  2017-10-19T15:35:37 normal     1     xeon-e5   5G          R   gpu-2  
40980_102 40980      myArrayJob Student  2017-10-19T15:35:37 normal     1     xeon-e5   5G          R   gpu-2

The output of the LLstat command lists the job IDs of the jobs running, their names, the start time, the number of cpus per task, its status, and the node that it is running on. If it is in error state, it lists that as well.

LLload

For more information about how your jobs are utilizing the resources, you can use the LLload command.

The nodes in the main partitions on SuperCloud are exclusive by user, meaning nodes will be shared between different users, but multiple jobs from the same user can run together on the same node. This makes it very easy to examine how well your jobs are using the resources on the node, without jobs from others influencing those numbers.

To more view these statistics, we've introduced the LLload command that you can use to evaluate the efficiency of your jobs. If you run:

CPU Only JobsGPU Jobs

LLload

LLGrid: SuperCloud(TXE1)
Username: studentx, Nodes used: 4
HOSTNAME CORES -  USED =  FREE    LOAD MEMORY -   USED =   FREE
d-4-13-1    48 -    48 =     0   27.09  192GB -   82GB =  110GB
d-6-3-1     48 -    48 =     0    7.43  192GB -   20GB =  172GB
c-16-13-4   48 -    48 =     0    9.23  192GB -   17GB =  175GB
c-17-13-3   48 -    48 =     0    7.68  192GB -   67GB =  125GB

this command lists all of the nodes that you have jobs running on, how many of the cores on those nodes you have allocated, and some statistics about how the resources on those nodes are being used:

CPU Load: how much of the CPUs are used (5 minute average)
- Target: 50-150% of the number of CPUs (24.0-72.0 for the Xeon-P8 CPU nodes and 20.0-60.0 for the Xeon-G6 GPU nodes)
Memory Utilization: how much memory you are using, plus memory used for caching

A good target for CPU Load is 50-150% of the number of CPUs. For example, that is 24-48 for the Xeon-P8 CPU nodes and 20-40 for the Xeon-G6 GPU nodes, see the Systems and Software page for current core counts. If this number is lower, it's likely you could take advantage of more resources on the node. If you find the load numbers are very high, you risk the chance of speeddown or even overwhelming the node. You have a few knobs to turn to adjust your cpu utilization, often this is by either by changing the number of threads used by your application or by running more jobs or processes per node. Adjusting these is very easy if you are submitting your job with Triples Mode, which we highly encourage for those jobs that support it.

Since the memory utilization includes some additional values you will need to ssh to the node and run htop to see your true memory utilization, or use the sacct command after the job has completed to get the peak memory utilization (see this page).

LLload -g

LLGrid: SuperCloud(TXE1)
Username: studentx, Nodes used: 2
HOSTNAME CORES -  USED =  FREE    LOAD MEMORY -   USED =   FREE  GPUS -  USED = FREE LOAD GPUMEM -  USED =  FREE
d-10-10-1    40 -    40 =     0    2.71  384GB -   48GB =  336GB     2 -     2 =    0 0.40   64GB -   4GB =  60GB
d-13-12-1    40 -    40 =     0    0.22  384GB -   49GB =  335GB     2 -     2 =    0 0.40   64GB -   4GB =  60GB

this command lists all of the nodes that you have jobs running on, how many of the cores on those nodes you have allocated, and some statistics about how the resources on those nodes are being used:

CPU Load: how much of the CPUs are used (5 minute average)
- Target: 50-150% of the number of CPUs (24.0-72.0 for the Xeon-P8 CPU nodes and 20.0-60.0 for the Xeon-G6 GPU nodes)
Memory Utilization: how much memory you are using, plus memory used for caching
GPU Utilization: how much of both GPUs are being used, 2.0 is 100% of both GPUs (snapshot)
- Target: 50-100% of the GPUs allocated (0.5 or higher for 1 GPU, 1.0 or higher for 2 GPUs)
GPU Memory: how much of the GPU memory is being used

A good target for CPU Load is 50-150% of the number of CPUs. For example, that is 24-48 for the Xeon-P8 CPU nodes and 20-40 for the Xeon-G6 GPU nodes, see the Systems and Software page for current core counts. If this number is lower, it's likely you could take advantage of more resources on the node. If you find the load numbers are very high, you risk the chance of speeddown or even overwhelming the node. You have a few knobs to turn to adjust your cpu utilization, often this is by either by changing the number of threads used by your application or by running more jobs or processes per node. Adjusting these is very easy if you are submitting your job with Triples Mode, which we highly encourage for those jobs that support it.

Since the memory utilization includes some additional values you will need to ssh to the node and run htop to see your true memory utilization, or use the sacct command after the job has completed to get the peak memory utilization (see this page).

The GPU load is normalized such that 100% utilization on both GPUs will give a value of 2, so if both GPUs are well utilized you'll see a value close to 2. Note that this is derived from an instantaneous value rather than averaged over a period of time, so you may have to run it a few times to get a good idea of your GPU utilization. If you find your GPU utilization is low, check out our page on Optimizing your GPU Usage. You can see these numbers broken down by GPU and more information by adding the --detail flag:

LLload -g --detail

Stopping Jobs

Jobs can be stopped using the LLkill command. You specify the list of job IDs, separated by commas that you would like to stop, for example:

LLkill 40986,40980

Stops the jobs with job IDs 40986 and 40980. You can also use the LLkill command to stop all of your currently running jobs:

LLkill -u USERNAME