scicore.slurm

CI tests Galaxy releases
CI tests Galaxy releases

scicore.slurm

Set Up a SLURM Cluster

This role sets up:

  • SLURM accounting daemon
  • SLURM master daemon
  • SLURM worker nodes
  • SLURM submit hosts

SLURM users are automatically added to the SLURM accounting database upon their first job submission using a Lua job submission plugin.


Sample Inventory

master ansible_host=192.168.56.100 ansible_user=vagrant ansible_password=vagrant
submit ansible_host=192.168.56.101 ansible_user=vagrant ansible_password=vagrant
compute ansible_host=192.168.56.102 ansible_user=vagrant ansible_password=vagrant

[slurm_submit_hosts]
submit

[slurm_workers]
compute

Remember to set the variable slurm_master_host to the hostname of your master host.


Role Variables

# Add all SLURM hosts to /etc/hosts
slurm_update_etc_hosts_file: true

# Point to a Git repository if you store your SLURM config in Git
# slurm_config_git_repo: ""

# By default, it will deploy a Lua submit plugin that adds users to the SLURM accounting DB
# Check templates/job_submit.lua.j2 for details
slurm_config_deploy_lua_submit_plugin: true

# Enable configless SLURM if you use Slurm 20.02 or higher
slurm_configless: false

# Deploy necessary scripts for cloud scheduling with OpenStack
slurm_openstack_cloud_scheduling: false
slurm_openstack_venv_path: /opt/venv_slurm
slurm_openstack_auth_url: https://my-openstack-cloud.com:5000/v3
slurm_openstack_application_credential_id: "4eeabeabcabdwe19451e1d892d1f7"
slurm_openstack_application_credential_secret: "supersecret1234"
slurm_openstack_region_name: "RegionOne"
slurm_openstack_interface: "public"
slurm_openstack_identity_api_version: 3
slurm_openstack_auth_type: "v3applicationcredential"

# SLURM cluster name
slurm_cluster_name: slurm-cluster

# Set master host variable
slurm_master_host: slurm-master.cluster.com
# Set database host variable, defaults to master host
slurm_dbd_host: "{{ slurm_master_host }}"

# Group definitions
slurm_workers_group: slurm_workers
slurm_submit_group: slurm_submit_hosts

# SLURM configuration paths
slurm_slurmctld_spool_path: /var/spool/slurmctld
slurm_slurmd_spool_path: /var/spool/slurmd

# SLURM database settings
slurm_slurmdbd_mysql_db_name: slurm
slurm_slurmdbd_mysql_user: slurm
slurm_slurmdbd_mysql_password: aadAD432saAdfaoiu

# SLURM user and group settings for daemons
slurm_user:
  RedHat: "root"
  Debian: "slurm"

slurm_group:
  RedHat: "root"
  Debian: "slurm"

# EPEL is needed for SLURM packages on CentOS/RedHat
slurm_add_epel_repo: true

# Enable OpenHPC repositories on CentOS (optional)
slurm_add_openhpc_repo: false
slurm_ohpc_repos_url:
  rhel7: "https://github.com/openhpc/ohpc/releases/download/v1.3.GA/ohpc-release-1.3-1.el7.x86_64.rpm"
  rhel8: "http://repos.openhpc.community/OpenHPC/2/CentOS_8/x86_64/ohpc-release-2-1.el8.x86_64.rpm"

# Packages installed on each cluster member
slurm_packages_common:
  RedHat:
    - slurm
    - slurm-doc
    - slurm-contribs
  Debian:
    - slurm-client

# Packages installed on the master node
slurm_packages_master:
  RedHat:
    - slurm-slurmctld
  Debian:
    - slurmctld

# Packages installed on SLURMDBD node
slurm_packages_slurmdbd:
  RedHat:
    - slurm-slurmdbd
    - mariadb-server
  Debian:
    - slurmdbd
    - mariadb-server

# Packages installed on worker nodes
slurm_packages_worker:
  RedHat:
    - slurm-slurmd
    - vte-profile  # Prevents error message on SLURM interactive shells
  Debian:
    - slurmd

Setting Up SLURM for OpenStack Cloud Scheduling

This role allows your SLURM cluster to work with cloud scheduling on OpenStack.

Before configuring, read the Slurm Cloud Scheduling Guide and Configless Docs.

Ensure your OpenStack cloud has internal DNS resolution enabled for hostname resolution upon node boot.

Also, refer to the example config file slurm.conf.j2.cloud.example. This file requires adaptation to your needs and you should set slurm_conf_custom_template to this custom config.

Overview of Cloud Scheduling Config

  • When a job is submitted, SLURM executes "ResumeProgram" from slurm.conf to boot a cloud compute node.
  • The ResumeProgram script uses the OpenStack API.
  • When a compute node is idle, SLURM runs the SuspendProgram to shut down nodes.
  • OpenStack options for dynamic nodes must be set as node Features in slurm.conf.
  • Both "ResumeProgram" and "SuspendProgram" need an OpenStack config file located at "/etc/openstack/clouds.yaml".
  • Logs from "ResumeProgram" and "SuspendProgram" are saved in "/var/log/messages".

  1. Ensure you have Slurm 20.02 or higher for configless mode.
  2. Boot at least three machines:
    • SLURM master
    • SLURM submit (login node)
    • SLURM worker (to create OpenStack image)
  3. Update your inventory and assign machines to appropriate groups.
  4. Set slurm_master_host to the master hostname, and ensure all machines can resolve it.
  5. Copy and adjust slurm.conf.j2.cloud.example for your configuration.
  6. Set slurm_configless: true for configless mode.
  7. Execute the role to configure all machines, creating a functional SLURM cluster.
  8. Customize the SLURM worker as needed.
  9. Create an OpenStack image that includes your configurations.
  10. Update your slurm.conf with the correct node features and redeploy.

Finally, check your SLURM cluster by running sinfo -Nel to see cloud partitions. Try submitting a job to one of these partitions and monitor the log files for any issues.

Informazioni sul progetto

Install and configure a Slurm cluster

Installa
ansible-galaxy install scicore.slurm
Licenza
Unknown
Download
21.6k
Proprietario
The center for scientific computing @ University of Basel