stackhpc.openhpc

stackhpc.openhpc

This Ansible role helps you set up an OpenHPC v2.x Slurm cluster by installing necessary packages and configuring the system.

To use this role, you need to include it in an Ansible playbook. Below is a simple example of how to do this. The role is designed to be flexible and doesn't assume anything about your network or cluster features, aside from some guidelines on hostnames. You can add any required functionality, like file systems, using other Ansible roles.

The basic image for nodes is a RockyLinux 8 GenericCloud image.

Role Variables

  • openhpc_extra_repos: Optional list for extra Yum repository definitions. Required fields include name and file. Others like description, baseurl, metalink, etc., are optional.

  • openhpc_slurm_service_enabled: Boolean to enable the slurm service (either slurmd or slurmctld).

  • openhpc_slurm_service_started: Optional boolean to start slurm services. If set to false, services will stop. Defaults to the value of openhpc_slurm_service_enabled.

  • openhpc_slurm_control_host: Required string. The Ansible inventory hostname of the controller.

  • openhpc_slurm_control_host_address: Optional. An alternative IP or name for the control host.

  • openhpc_packages: Additional OpenHPC packages to install.

  • openhpc_enable:

    • control: Enable control host
    • database: Enable slurmdbd
    • batch: Enable compute nodes
    • runtime: Enable OpenHPC runtime
  • openhpc_slurmdbd_host: Optional. Define where slurmdbd should be deployed. Defaults to the control host.

  • openhpc_slurm_configless: Optional, default is false. If true, slurm's "configless" mode is used.

  • openhpc_munge_key: Optional. Define a munge key. If not set, one will be generated.

  • openhpc_login_only_nodes: Optional. Specify a group of nodes that are login-only.

  • openhpc_module_system_install: Optional, default true. Decide whether to install an environment module system (like lmod).

slurm.conf

  • openhpc_slurm_partitions: Optional. A list of slurm partitions, default is empty. Each partition can include various settings like name, ram_mb, and gres.

  • openhpc_job_maxtime: Sets the maximum job time limit, with a default of 60 days.

  • openhpc_cluster_name: Name of the cluster.

  • openhpc_config: Optional mapping for additional parameters to override defaults in slurm.conf.

  • openhpc_ram_multiplier: Optional, default is 0.95. Used to calculate memory for slurm partitions.

  • openhpc_state_save_location: Optional. Absolute path for Slurm controller state.

Accounting

By default, no accounting storage is configured. To set it up, follow these steps:

  1. Configure a MariaDB or MySQL server.
  2. Set openhpc_enable.database to true.
  3. Use openhpc_slurm_accounting_storage_type to specify your storage.

Configure these variables as needed for your accounting storage.

Job Accounting

If you choose to use basic job accounting, set the following:

  • openhpc_slurm_job_comp_type: Specify how to log job accounting.
  • openhpc_slurm_job_acct_gather_type: Define how to collect accounting data.
  • openhpc_slurm_job_acct_gather_frequency: Set the sampling period.

slurmdbd.conf

If you've enabled the database, configure these options in slurmdbd.conf:

  • openhpc_slurmdbd_port: Port for slurmdb to listen on, defaults to 6819.
  • openhpc_slurmdbd_mysql_host: Where your MariaDB is running.
  • openhpc_slurmdbd_mysql_database: Database to use.
  • openhpc_slurmdbd_mysql_password: Password for the database (required).
  • openhpc_slurmdbd_mysql_username: Username for the database, defaults to slurm.

Example Inventory

An example Ansible inventory might look like this:

[openhpc_login]
openhpc-login-0 ansible_host=10.60.253.40 ansible_user=centos

[openhpc_compute]
openhpc-compute-0 ansible_host=10.60.253.31 ansible_user=centos
openhpc-compute-1 ansible_host=10.60.253.32 ansible_user=centos

[cluster_login:children]
openhpc_login

[cluster_control:children]
openhpc_login

[cluster_batch:children]
openhpc_compute

Example Playbooks

To deploy your setup, your playbook might look like this:

---
- hosts:
  - cluster_login
  - cluster_control
  - cluster_batch
  become: yes
  roles:
    - role: openhpc
      openhpc_enable:
        control: "{{ inventory_hostname in groups['cluster_control'] }}"
        batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
        runtime: true
      openhpc_slurm_service_enabled: true
      openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
      openhpc_slurm_partitions:
        - name: "compute"
      openhpc_cluster_name: openhpc
      openhpc_packages: []
...

This simplified explanation covers the main aspects of using the stackhpc.openhpc Ansible role for setting up an OpenHPC Slurm cluster.

Informazioni sul progetto

This role provisions an exisiting cluster to support OpenHPC control and batch compute functions and installs necessary runtime packages.

Installa
ansible-galaxy install stackhpc.openhpc
Licenza
apache-2.0
Download
1.8k
Proprietario
StackHPC develops OpenStack capabilities for research computing use cases. Through extensive experience, we understand HPC and cloud.