galaxyproject.slurm

Slurm

Overview

This guide explains how to install and set up a Slurm cluster on RHEL/CentOS or Debian/Ubuntu servers.

Role Variables

All variables are optional. If you don't set any variables, the role will install the Slurm client, the munge authentication service, and create a basic slurm.conf file with a localhost node and a debug partition. Check the default settings and example playbooks for more details.

Each Slurm node can have different roles. You can either set group names or add roles to the slurm_roles list:

For controller nodes, use group slurmservers or set slurm_roles: ['controller']
For execution nodes, use group slurmexechosts or set slurm_roles: ['exec']
For database nodes, use group slurmdbdservers or set slurm_roles: ['dbd']

General configuration options for slurm.conf go in slurm_config. Within this, you specify Slurm configuration options using their names as keys.

You can define partitions and nodes with slurm_partitions and slurm_nodes, which are lists of settings. The only required field is name, which sets the PartitionName or NodeName. Other settings can be included as well.

For additional configurations, you can specify settings for the files acct_gather.conf, cgroup.conf, and gres.conf in slurm_acct_gather_config, slurm_cgroup_config (both hashes), and slurm_gres_config (a list of hashes).

Set slurm_upgrade to true to upgrade Slurm packages.

Use slurm_user (a hash) and slurm_create_user (a boolean) to create a Slurm user to match user IDs.

Since this role requires root access, ensure you enable become globally in your playbook or for this role specifically as shown in the examples below.

Dependencies

None.

Example Playbooks

Here is a basic setup with all services on a single node:

- name: Slurm all in One
  hosts: all
  vars:
    slurm_roles: ['controller', 'exec', 'dbd']
  roles:
    - role: galaxyproject.slurm
      become: True

A more detailed example:

- name: Slurm execution hosts
  hosts: all
  roles:
    - role: galaxyproject.slurm
      become: True
  vars:
    slurm_cgroup_config:
      CgroupMountpoint: "/sys/fs/cgroup"
      CgroupAutomount: yes
      ConstrainCores: yes
      TaskAffinity: no
      ConstrainRAMSpace: yes
      ConstrainSwapSpace: no
      ConstrainDevices: no
      AllowedRamSpace: 100
      AllowedSwapSpace: 0
      MaxRAMPercent: 100
      MaxSwapPercent: 100
      MinRAMSpace: 30
    slurm_config:
      AccountingStorageType: "accounting_storage/none"
      ClusterName: cluster
      GresTypes: gpu
      JobAcctGatherType: "jobacct_gather/none"
      MpiDefault: none
      ProctrackType: "proctrack/cgroup"
      ReturnToService: 1
      SchedulerType: "sched/backfill"
      SelectType: "select/cons_res"
      SelectTypeParameters: "CR_Core"
      SlurmctldHost: "slurmctl"
      SlurmctldLogFile: "/var/log/slurm/slurmctld.log"
      SlurmctldPidFile: "/var/run/slurmctld.pid"
      SlurmdLogFile: "/var/log/slurm/slurmd.log"
      SlurmdPidFile: "/var/run/slurmd.pid"
      SlurmdSpoolDir: "/var/spool/slurmd"
      StateSaveLocation: "/var/spool/slurmctld"
      SwitchType: "switch/none"
      TaskPlugin: "task/affinity,task/cgroup"
      TaskPluginParam: Sched
    slurm_create_user: yes
    slurm_gres_config:
      - File: /dev/nvidia[0-3]
        Name: gpu
        NodeName: gpu[01-10]
        Type: tesla
    slurm_munge_key: "../../../munge.key"
    slurm_nodes:
      - name: "gpu[01-10]"
        CoresPerSocket: 18
        Gres: "gpu:tesla:4"
        Sockets: 2
        ThreadsPerCore: 2
    slurm_partitions:
      - name: gpu
        Default: YES
        MaxTime: UNLIMITED
        Nodes: "gpu[01-10]"
    slurm_roles: ['exec']
    slurm_user:
      comment: "Slurm Workload Manager"
      gid: 888
      group: slurm
      home: "/var/lib/slurm"
      name: slurm
      shell: "/usr/sbin/nologin"
      uid: 888