datarsense.dataikudss

Ansible role DSS

An Ansible role automating Dataiku DSS deployment.

Requirements

The role is compatible with Debian 10 and AlmaLinux 8. Debian 11 is not supported as it is not a DSS 11.x & 12.xsupported OS. Centos 7 (EOL June 30th, 2024) & Centos 8 (EOL December 31st, 2021) are not compatible anymore with this role.

Ansible 5.8 or newer is required on the host running the ansible playbook. The account used for running the playbook must have sudo privileges on the remote environment and must be allowed to become :

  • root for pre-install stage (installing packages, creating the dss servie user)
  • dss service user for DSS install as DSS is not run as root.

If ansible-playbook in executed with a non-root user on the remote environment, the following configuration is added by this role in /etc/sudoers.d to allow this non-root user to act on behalt of the dataiku service account.

non-root-user ALL = (dataiku) NOPASSWD: ALL

Role Variables

Variable Default value Usage
dss_base_repository_url https://cdn.downloads.dataiku.com/public/studio Base URL of the Dataiku CDN. The base URL is variabilized to make the role compatible with offline deployment
dataiku_python_api_package "git+https://github.com/dataiku/dataiku-api-client-python@release/5.1#egg=dataiku-api-client" Source repository of the dataiku-api-client-python module
dss_version "12.1.0" The DSS version which has to be deployed
dss_api_version "12.1.0" Version of the DSS Python API client to be used by the ansible role to configure DSS
dss_service_user dataiku Name of the DSS service user which will be created by this playbook
dss_service_user_shell "/bin/bash" Shell of the dss_service_user. keep /bin/bash
dss_service_user_home_basedir /home Home directory of the DSS instance. In some rare deployment scenarios, it can be different than /home
dss_install_dir_location /opt/dataiku Directory in which Dataiku binaries are downloaded and installed
dss_node_poll_fqdn true If true, use ansible_fqdn else use ansible_host
dss_license_file license.json The DSS license file which has to be deployed on the DSS host.
dss_node_type design DSS node type. The only supported value in this release is design
dss_datadir dss_data Name of the DSS data directory
dss_network_port 10000 DSS network port

Optional variables for deploying JDBC drivers

DSS requires a third party JDBC connector JAR library provided by Oracle to be able to connect to MySQL databases.

The following ansible variables enable MySQL support in DSS :

Variable Sample value Usage
configure_mysql false Triggers if MySQL JDBC driver has to be deployed or not. Default is false
mysql_jdbc_connector_url https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.0.31/mysql-connector-j-8.0.31.jar URL of the MySQL JDBC connector JAR library

Optional variables for tuning memory settings

As decribed in https://doc.dataiku.com/dss/latest/operations/memory.html, memory allocation of DSS components can be tuned :

  • The backend is a Java process that has a fixed memory allocation set by the dss_backend_xmx parameter. Backend memory requirement scales with the number of users, the number of projects, datasets, recipes, … For large production instances, allocating 12 to 20 GB of memory for the backend is recommended by Dataiku.
  • Each job in DSS runs in a separate process called a JEK. If you have 10 jobs running at a given time, there will be 10 running JEKs. The default Xmx of the JEK is 2g. This is enough for a large majority of jobs. However, some jobs with large number of partitions or large number of files to process may require more. This is configured by the dss_jek_xmx ini parameter.
  • From time to time, the DSS backend will “delegate” part of its work to worker processes called the FEKs. This is done mostly for work that may consume huge amounts of memory. If a memory overrun happens, the FEK gets killed but the backend is unaffected. The default Xmx of each FEK is 2g. This is enough for a large majority of tasks. There may be some rare cases where you’ll need to allocate more memorry (generally at the direction of Dataiku Support). This is configured by the dss_fek_xmx ini parameter.
Variable Sample value
dss_backend_xmx 8g
dss_jek_xmx 2g
dss_fek_xmx 2g

Default values are applied by DSS installer for memory parameter not configured as a variable in the ansible playbook using this role.

Optional variables for enabling containerized execution on kubernetes

Variable Default value
configure_k8s false
k8s_executionconfigs []
download_dss_docker_images false
download_dss_docker_images_url_tmp_directory
download_dss_docker_images_url {{ dss_base_repository_url }}/{{ dss_version }}/container-images/dataiku-dss-ALL-base_dss-{{ dss_version }}-r-py3.6.tar.gz

The k8s_executionconfigs is an array which can contain multiple containerized execution configurations to match different business scenarios. Different kubernetes quotas can be allowed depending on user permissions, cuda ressources access can be limited to the data-scientist group, several base image can be offered to match business needs, ...

DSS docker bases images can be automatically downloaded as an archive from a web URL by configuring download_dss_docker_images: true. Use download_dss_docker_images_url_tmp_directory: /local/tmp to configure a custom ansible tmp directory if the /tmppartition of the server is too small to download the 4.5G docker images archive. The download_dss_docker_images_url download URL is configured to use the Dataiku public CDN by default, but can be changed if needed. DSS version must be set in the docker archive file name to make this role able to check consistency between DSS version and DSS docker images version

A configuration example is provided below with two kubernetes execution configs. Please make sure to replace sample repositoryURL, baseImage, and set a valid kubernetes namespace when using this sample.

 - name: "Deploy DSS with containerized execution support"
      ansible.builtin.include_role:
        name: "datarsense.dataikudss"
      vars:
        dss_version: "11.1.1"
        download_dss_docker_images: true
        [...]
        configure_k8s: true
        k8s_executionconfigs:
          - name: test1
            type: KUBERNETES
            properties: []
            usableBy: ALLOWED
            allowedGroups:
              - administrators
            dockerNetwork: host
            dockerResources: []
            kubernetesNamespace: testnamespace
            kubernetesResources:
              memRequestMB: 2048
              memLimitMB: 2048
              cpuRequest: 2.0
              cpuLimit: 2.0
              customLimits: []
              customRequests: []
            hostPathVolumes: []
            isFinal: false,
            ensureNamespaceCompliance: false
            createNamespace: false
            baseImageType: EXEC
            baseImage: dss_containerer_exec_base:latest
            repositoryURL: docker.io
            prePushMode: NONE
            dockerTLSVerify: false
          - name: test2
            type: KUBERNETES
            properties: []
            usableBy: ALLOWED
            allowedGroups:
              - data-scientists
            dockerNetwork: host
            dockerResources: []
            kubernetesNamespace: testnamespace
            kubernetesResources:
              memRequestMB: 8192
              memLimitMB: 8192
              cpuRequest: 16.0
              cpuLimit: 16.0
              customLimits: []
              customRequests: []
            hostPathVolumes: []
            isFinal: false,
            ensureNamespaceCompliance: false
            createNamespace: false
            baseImageType: EXEC
            baseImage: dss_containerer_exec_cuda_base:latest
            repositoryURL: docker.io
            prePushMode: NONE
            dockerTLSVerify: false

Optional variables for enabling Spark support

Variable Default value
configure_spark true
dss_hadoop_package "dataiku-dss-hadoop-standalone-libs-generic-hadoop3-12.1.0.tar.gz"
dss_spark_package "dataiku-dss-spark-standalone-12.1.0-3.3.1-generic-hadoop3.tar.gz"
spark_executionconfigs see below

The spark_executionconfigs is an array which can contain multiple spark execution configurations to match different business scenario, including Spark on kubernetes. By default, this variable mirrors the default spark configuration of a DSS instance :

spark_executionconfigs:
  - name: default
    description: |- 
      This default configuration sets a few parameters that are suitable for a wide range of use cases.
      Importantly in order to work in all circumstances it does not set the spark master configuration.
      It will thus use the master defined by your default Spark configuration.
      This may lead Spark jobs to execute locally without using your cluster. You may need for example to add spark.master\u003dyarn-client
    conf:
      - key: spark.executor.memory
        value: 2400m
        isFinal: false
        secret: false
      - key: spark.sql.shuffle.partitions
        value: 40
        isFinal: false
        secret: false
      - key: spark.yarn.executor.memoryOverhead
        value: 600
        isFinal: false
        secret: false
      - key: spark.port.maxRetries
        value: 200
        isFinal: false
        secret: false
  - name: sample-yarn-config
    description: |- 
      This sample configuration shows a possible set of parameters for running DSS Spark jobs on YARN.
      These settings are suitable for a small cluster.
      You will need to tune spark.executor.instances spark.executor.cores and memory settings based on the size of your YARN cluster.
    conf :
      - key: spark.master
        value: yarn-client
        isFinal: false
        secret: false
      - key: spark.executor.memory
        value: 4g
        isFinal: false
        secret: false
      - key: spark.executor.instances
        value: 4
        isFinal: false
        secret: false
      - key: spark.executor.cores
        value: 2
        isFinal: false
        secret: false
      - key: spark.sql.shuffle.partitions
        value: 40
        isFinal: false
        secret: false
      - key: spark.yarn.executor.memoryOverhead
        value: 1200
        isFinal: false
        secret: false
      - key: spark.port.maxRetries
        value: 200
        isFinal: false
        secret: false      
  - name: sample-local-config
    description: |-
      This sample configuration shows a possible set of parameters for running DSS Spark jobs locally (non distributed).
      This can be useful for testing on small jobs as local Spark jobs start faster than YARN ones but is not suitable for production usage.
    conf:
      - key: spark.master
        value: local[4]
        isFinal: false
        secret: false  
      - key: spark.driver.memory
        value: 3g
        isFinal: false
        secret: false
      - key: spark.sql.shuffle.partitions
        value: 40
        isFinal: false
        secret: false
      - key: spark.port.maxRetries
        value: 200
        isFinal: false
        secret: false  

This variable can be used to configure spark settings for spark on kubernetes when both configure_spark and configure_k8s are true*.

A configuration example is provided below. Please make sure to replace sample repositoryURL: docker.io, baseImage: dss_spark_base:latest, and to set a valid kubernetes namespace when using this sample. The authenticationMode can be either BUILTIN or DYNAMIC_SERVICE_ACCOUNT : set this variable according with your user isolation needs. Read more on Workload isolation on Kubernetes dataiku documentation.

Spark executors CPU limit is set by spark.kubernetes.executor.limit.cores. The CPU request is set by spark.executor.cores. The memory request and limit are set by summing the values of spark.executor.memory and spark.executor.memoryOverhead.

spark_executionconfigs:
  - name: SparkOnKubernetes
    description: Execute Spark jobs in a Kuberntes cluster.
    conf:
      - key: spark.master
        value: k8s://https://IP_OF_YOUR_K8S_CLUSTER
        isFinal: false
        secret: false
      - key: spark.executor.memory
        value: 4g
        isFinal: false
        secret: false
      - key: spark.executor.memoryOverhead
        value: 8g
        isFinal: false
        secret: false
      - key: spark.executor.instances
        value: 4
        isFinal: false
        secret: false
      - key: spark.executor.cores
        value: 2
        isFinal: false
        secret: false
      - key: spark.kubernetes.executor.limit.cores
        value: 4
        isFinal: false
        secret: false
      - key: spark.sql.shuffle.partitions
        value: 40
        isFinal: false
        secret: false
      - key: spark.port.maxRetries
        value: 200
        isFinal: false
        secret: false
    kubernetesSettings:
      managedKubernetes: true
      managedNamespace: testnamespace
      authenticationMode: BUILTIN
      ensureNamespaceCompliance: false
      createNamespace: false
      baseImageType: SPARK
      baseImage: dss_spark_base:latest
      repositoryURL: docker.io
      prePushMode: NONE
      dockerTLSVerify: false

Optional variables for enabling LDAP authentication

Variable Default value
configure_ldap_settings false
Variable Sample value
ldap_url "ldap://ldap.internal.example.com/dc=example,dc=com"
ldap_binddn "uid=readonly,ou=users,dc=example,dc=com"
ldap_bindpassword: ""
ldap_usetls true
ldap_autoimportusers true
ldap_userfilter "(&(objectClass=posixAccount)(uid={USERNAME}))"
ldap_defaultuserprofile "READER"
ldap_displaynameattribute "cn"
ldap_emailattribute "mail"
ldap_enablegroups true
ldap_groupfilter "(&(objectClass=posixGroup)(memberUid={USERDN}))"
ldap_groupnameattribute "cn"
ldap_groupprofiles []
ldap_authorizedgroups "dss-users"

Optional variables for enabling OpenID Connect (OIDC) SSO authentication

Configure configure_oidc_sso: true to enable OpenID Connect SSO.

Variable Default value
configure_oidc_sso false

The following table shows an example of OIDC SSO configuration with Google IDP. Change the default values to match your IDP configuration.

Remapping rules are not supported by this role.

Variable Sample value
oidc_clientid test
oidc_clientsecret test
oidc_scope 'openid profile email'
oidc_issuer https://accounts.google.com
oidc_authorizationendpoint https://accounts.google.com/o/oauth2/v2/auth
oidc_tokenendpoint https://oauth2.googleapis.com/token
oidc_jwksuri https://www.googleapis.com/oauth2/v3/certs
oidc_claimkeyidentifier email_verified

Optional variables for enabling User Isolation Framework (UIF)

Variable Default value
configure_uif false
uif_users {}
uif_userrules []
uif_grouprules []

The uif_users is a dict which contains local unix users and groups for which UIF impersonation is allowed. You must fill it with the list of users which have to be created by ansible and the UNIX groups to which they belong. Only users belonging to these groups will be allowed to use the local code impersonation mechanism (Python, R, visual ML, spark). More on https://knowledge.dataiku.com/latest/kb/governance/Which-activities-in-DSS-require-that-a-user-be-added-to-the.html

The uif_userrules and uif_grouprules are arrays which can contain multiple uif user mapping configuring mapping between a DSS user and an unix or hadoop user effectively running the DSS job when user isolation is enabled. Read more on https://doc.dataiku.com/dss/latest/user-isolation/initial-setup.html

UIF rules types can be :

  • IDENTITY : Make the rule map each DSS user to a UNIX user of the same name.
  • SINGLE_MAPPING : Make the rule map a given DSS user/group configured in the dssUser ordssGroup variable to a given UNIX user defined in the targetUnix (or targetHadoop) variable.
  • REGEXP_RULE : Make the rule map a DSS users matching a given regular expression configured in the ruleFrom variable to a given UNIX user defined in the targetUnix (ortargetHadoop) variable.
configure_uif: true
uif_users:
  userA:
    group: groupA
  userB:
    group: groupB
uif_userrules:
  - name: rule1
    scope: GLOBAL
    type: SINGLE_MAPPING
    dssUser: userA
    targetUnix: unix-userA
    targetHadoop: hadoop-userA
  - name: rule2
    scope: GLOBAL
    type: REGEXP_RULE
    ruleFrom: .*
    targetUnix: unix-userB
    targetHadoop: hadoop-userB
uif_grouprules:
  - name: ruleGroupA
    scope: GLOBAL
    type: SINGLE_MAPPING
    dssGroup: groupA
    targetUnix: unix-userA
    targetHadoop: hadoop-userA
  - name: ruleGroupB
    scope: GLOBAL
    type: REGEXP_RULE
    ruleFrom: .*
    targetUnix: unix-userB
    targetHadoop: hadoop-userB

Dependencies

Python3, ansible >= 5.8 and jmespath are required by this role. A requirements.txt file including those dependencies is provided with this role.

The following modules provided by Dataiku are required for DSS config automation :

Create a requirements.yml file in your playbook. The requirements.yml has to include the following content to install the role and it's dependencies :

---
- src: git+https://github.com/dataiku/dataiku-api-client-python
  name: dataiku-api-client-python
  version: release/8.0

- src: git+https://github.com/dataiku/dataiku-ansible-modules
  name: dataiku.dataiku-ansible-modules
  version: master

- src: git+https://github.com/datarsense/ansible-role-dataikudss.git
  name: datarsense.dataikudss
  version: main

Then, install the role and it's dependencies with the following command :

ansible-galaxy install -r requirements.yml

Sample DSS deployment playbook

---

- hosts: all
  gather_facts: true
  tasks:
    - name: "Deploy DSS with containerized execution support"
      ansible.builtin.include_role:
        name: "datarsense.dataikudss"
      vars:
        dss_version: "11.1.1"
        dss_hadoop_package: "dataiku-dss-hadoop-standalone-libs-generic-hadoop3-11.1.1.tar.gz"
        dss_spark_package: "dataiku-dss-spark-standalone-11.1.1-3.2.1-generic-hadoop3.tar.gz"

        configure_ldap_settings: true
        ldap_url: "ldap://ldap.internal.example.com/dc=example,dc=com"
        ldap_binddn: "uid=readonly,ou=users,dc=example,dc=com"
        ldap_bindpassword: ""
        ldap_usetls: true
        ldap_autoimportusers: true
        ldap_userfilter: "(&(objectClass=posixAccount)(uid={USERNAME}))"
        ldap_defaultuserprofile: "READER"
        ldap_displaynameattribute: "cn"
        ldap_emailattribute: "mail"
        ldap_enablegroups: true
        ldap_groupfilter: "(&(objectClass=posixGroup)(memberUid={USERDN}))"
        ldap_groupnameattribute: "cn"
        ldap_groupprofiles: []
        ldap_authorizedgroups: "dss-users"
        
        configure_oidc_sso: true
        oidc_clientid: test
        oidc_clientsecret: test
        oidc_scope: 'openid profile email'
        oidc_issuer: https://accounts.google.com
        oidc_authorizationendpoint: https://accounts.google.com/o/oauth2/v2/auth
        oidc_tokenendpoint: https://oauth2.googleapis.com/token
        oidc_jwksuri: https://www.googleapis.com/oauth2/v3/certs
        oidc_claimkeyidentifier: email_verified

        configure_uif: true
        uif_users:
          userA:
            group: groupA
          userB:
            group: groupB
        uif_userrules:
          - name: rule1
            scope: GLOBAL
            type: SINGLE_MAPPING
            dssUser: userA
            targetUnix: unix-userA
            targetHadoop: hadoop-userA
          - name: rule2
            scope: GLOBAL
            type: REGEXP_RULE
            ruleFrom: .*
            targetUnix: unix-userB
            targetHadoop: hadoop-userB
        uif_grouprules:
          - name: ruleGroupA
            scope: GLOBAL
            type: SINGLE_MAPPING
            dssGroup: groupA
            targetUnix: unix-userA
            targetHadoop: hadoop-userA
          - name: ruleGroupB
            scope: GLOBAL
            type: REGEXP_RULE
            ruleFrom: .*
            targetUnix: unix-userB
            targetHadoop: hadoop-userB

        configure_spark: true
        spark_executionconfigs:
          - name": SparkOnKubernetes
            kubernetesSettings:
              managedKubernetes: true
              managedNamespace: testnamespace
              authenticationMode: BUILTIN
              ensureNamespaceCompliance: false
              createNamespace: false
              baseImageType: SPARK
              baseImage: dss_spark_base:latest
              repositoryURL: docker.io
              prePushMode: NONE
              dockerTLSVerify: false
        
        configure_k8s: true
        download_dss_docker_images: true
        k8s_executionconfigs:
          - name: test1
            type: KUBERNETES
            properties: []
            usableBy: ALLOWED
            allowedGroups:
              - administrators
            dockerNetwork: host
            dockerResources: []
            kubernetesNamespace: testnamespace
            kubernetesResources:
              memRequestMB: 2048
              memLimitMB: 2048
              cpuRequest: 2.0
              cpuLimit: 2.0
              customLimits: []
              customRequests: []
            hostPathVolumes: []
            isFinal: false,
            ensureNamespaceCompliance: false
            createNamespace: false
            baseImageType: EXEC
            baseImage: dss_containerer_exec_base:latest
            repositoryURL: docker.io
            prePushMode: NONE
            dockerTLSVerify: false
          - name: test2
            type: KUBERNETES
            properties: []
            usableBy: ALLOWED
            allowedGroups:
              - data-scientists
            dockerNetwork: host
            dockerResources: []
            kubernetesNamespace: testnamespace
            kubernetesResources:
              memRequestMB: 8192
              memLimitMB: 8192
              cpuRequest: 16.0
              cpuLimit: 16.0
              customLimits: []
              customRequests: []
            hostPathVolumes: []
            isFinal: false,
            ensureNamespaceCompliance: false
            createNamespace: false
            baseImageType: EXEC
            baseImage: dss_containerer_exec_cuda_base:latest
            repositoryURL: docker.io
            prePushMode: NONE
            dockerTLSVerify: false

License

BSD

Author Information

About

Role automating Dataiku DSS deployment

Install
ansible-galaxy install datarsense.dataikudss
GitHub repository
License
gpl-2.0
Downloads
39
Owner