Administrator Guide

This section provides guidance for administrators managing the BETIF/DIFAET system. It covers installation, configuration, and maintenance tasks to ensure the system runs smoothly. The scripts shown below are also available in the helper-scripts repository.

Kernel-based Virtual Machine (KVM) Installation

The BETIF-DIFAET machine runs AlmaLinux 9 as its operating system, and, as is common across many Linux distributions, it supports Kernel-based Virtual Machines (KVM) for virtualizing bare-metal resources. KVM is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). This tool allows to run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc.

In order to test the steps to create the computing platform, 4 VMs have been spawned based on AlmaLinux 9 as Operating system and 16 vCPUs, 32 GB of RAM and 80 GB of disk space, through the virt-install interface.

Listing 1 can be used to create a virtual machine acting as a master node for the BETIF-DIFAET system.

Listing 1 : Script to create a master node VM for BETIF-DIFAET
#!/usr/bin/env bash
vm_name='alma9-test-master'
vm_memory='32768'
vm_cpus='16'
vm_disk='/var/lib/libvirt/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2'

ci_user_data='user-data'
ci_network_config='network-configv3'
qemu-img create -f qcow2 \
    -b /var/lib/libvirt/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2 \
    -F qcow2 /var/lib/libvirt/images/rke2-master-AlmaLinux-9-test-master.qcow2 80G

vm_disk='/var/lib/libvirt/images/rke2-master-AlmaLinux-9-test-master.qcow2'

virt-install \
    --connect qemu:///system \
    --name "$vm_name" \
    --memory "$vm_memory" \
    --machine q35 \
    --vcpus "$vm_cpus" \
    --cpu host-passthrough \
    --import \
    --cloud-init user-data="$ci_user_data" \
    --osinfo name=almalinux9 \
    --disk "$vm_disk" \
    --virt-type kvm \
    --network network=private-net \
    --network network=default \
    --noautoconsole

Listing 2 can be used to create a virtual machine acting as a worker node for the BETIF-DIFAET system.

Listing 2 : Script to create a worker node VM for BETIF-DIFAET
#!/usr/bin/env bash

if [ -z "$1" ]; then
    echo "Usage: $0 <worker-number>"
    exit 1
fi

N="$1"
vm_name="alma9-test-worker-$N"
vm_memory='32768'
vm_cpus='16'
vm_base_disk='/var/lib/libvirt/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2'

ci_user_data='user-data'
ci_network_config='network-configv3'

vm_disk="/var/lib/libvirt/images/rke2-master-AlmaLinux-9-test-worker-$N.qcow2"

qemu-img create -f qcow2 -b "$vm_base_disk" -F qcow2 "$vm_disk" 80G

virt-install \        --connect qemu:///system \
    --name "$vm_name" \
    --memory "$vm_memory" \
    --machine q35 \
    --vcpus "$vm_cpus" \
    --cpu host-passthrough \
    --import \
    --cloud-init user-data="$ci_user_data" \
    --osinfo name=almalinux9 \
    --disk "$vm_disk" \
    --virt-type kvm \
    --network network=private-net \
    --noautoconsole

A : Private network interface was also created to enable direct connections between the different VMs which for debug and testing purposes is still left open to allow direct access to the worker nodes. In the actual deployment this network will block access to the worker VMs, leaving only the Master accessible via SSH.

Listing 3 : Private network interface
<network>
  <name>private-net</name>
  <forward mode='nat'/>
  <bridge name="virbr1"/>
  <ip address="10.10.142.1" netmask="255.255.255.0">
    <dhcp>
      <range start="10.10.142.100" end="10.10.142.200"/>
    </dhcp>
  </ip>
</network>

Turn on GPU Virtualization

Up until now, the creation of VMs has relied on virtualization technologies (e.g., VT-x for Intel CPUs), which do not expose hardware connected to the host machine via a PCIe interface. To enable the passthrough of PCIe expansion devices, such as GPUs or FPGA accelerator cards, VT-d (Intel) or AMD-V (AMD) must be activated in the BIOS setup menu.

Once enabled in the firmware, the procedure continues in the host operating system: PCIe passthrough must also be allowed in the kernel.

Run:

find /sys/kernel/iommu_groups/ -type l

If no output is returned, the kernel boot options must be updated.

On AlmaLinux 9, with an Intel CPU and chipset, this can be done with:

grubby --update-kernel=ALL --args="intel_iommu=on iommu=pt"

After a reboot, the iommu_groups folder should be populated with all devices that can be passed through to VMs.

To use a device inside a VM, it must not be in use by the host system, i.e., the default driver must not be loaded. For example, with two identical GPUs, a rule must be added at boot time in /etc/udev/rules.d/99-vfio.rules:

ACTION=="add", SUBSYSTEM=="pci", KERNEL=="0000:<PCI-ID of device>", DRIVER=="", ATTR{driver_override}="vfio-pci"

The PCI-ID can be retrieved using lspci.

Next, load the vfio-pci driver, which is responsible for virtualization handling:

modprobe vfio-pci

Important

Due to issues with the order in which rules are applied during boot, vfio-pci is not yet loaded automatically. This command must therefore be run manually after every reboot. The driver will automatically attach to the configured hardware.

Finally, update the operating system configuration (initramfs) with:

dracut -f

At this point, a VM with access to a GPU can be create with the following instruction:

virt-install \
   --connect qemu:///system \
   --name "$vm_name" \
   --memory "$vm_memory" \
   --machine q35 \
   --vcpus "$vm_cpus" \
   --cpu host-passthrough \
   --import \
   --cloud-init user-data="$ci_user_data" \
   --osinfo name=almalinux9 \
   --disk "$vm_disk" \
   --virt-type kvm \
   --network network=private-net \
   --network network=default \
   --noautoconsole \
   --hostdev <PCI-ID of device>

Setting up the GPU Worker Node

To make the GPU deployable in the k8s cluster, NVidia’s drivers are needed in the worker node housing the virtual GPU.

After an update of the kernel, some utilities are useful/needed to move forward

dnf check-update --security
dnf upgrade --security
dnf install pciutils
dnf install epel-release
dnf install dkms gcc

Then the Nvidia Toolkit and driver have to be installed, following the instruction from [GPU1]:

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
dnf clean all
dnf install cuda-toolkit-12-4
dnf module install nvidia-driver:580-dkms

In order to actually add the nvidia driver to the kernel run dkms status and then

dkms install nvidia/<version>
reboot

Once the node is running after the reboot, it is possible to check if the driver is working by running nvidia-smi

Important

It is possible that the open-source driver nouveau is used instead of the proprietary one from Nvidia, if that is the case it can be fixed with the following

sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<EOF
blacklist nouveau
options nouveau modeset=0
EOF
dracut --force
reboot

NFS server

All pods spawned by JupyterHub will contain a persistent-storage folder hosted in a NFS server.

The procedure for provisioning this storage starts from launching in a fresh VM

dnf install nfs-utils
systemctl enable --now nfs-server

Then the folder to export is created with sudo mkdir -p /srv/nfs/k8s and its owner and r/w permissions are changed to

chown -R nobody:nobody /srv/nfs/k8s/
chmod -R 0777 /srv/nfs/k8s

Finally, the rules of the export are set and the filesystem is made visible

/srv/nfs/k8s *(rw,sync,no_subtree_check,no_root_squash)
exportfs -ra

Turning the VMs in a K8s cluster

With KVM, it was possible to set up 4 different VMs with a chosen flavour in terms of virtualized hardware and software. The next step to provide a computing platform is to setup a Kubernetes Cluster which will allow to host all the services needed for the research environment, from authentication to end-user tools for launching code.

To streamline the process of setting up the cluster, RKE2, also known as Rancher Kubernetes Engine 2 [RKE2], was used. It is a Kubernetes distribution developed by Rancher (now part of SUSE) that emphasizes security, stability, and ease of deployment. It’s designed to be fully Kubernetes-conformant, meaning it behaves in accordance with the official Kubernetes standards and APIs, making it compatible with standard Kubernetes tooling and workloads.

RKE2 is packaged as a single binary, which simplifies installation and maintenance. This binary includes everything needed to run a Kubernetes node, including the container runtime (which is containerd, rather than Docker), as well as the control plane and networking components. This design eliminates many of the dependencies and complexities found in traditional Kubernetes setups.

Listing 4 can be used to install RKE2 on the master node:

Listing 4 : Script to install RKE2 on the master node
mkdir -p /etc/rancher/rke2/
echo """
tls-san:
  - 192.168.122.59
  - 10.10.142.115
""" > /etc/rancher/rke2/config.yaml
curl -sfL https://get.rke2.io | sh -
systemctl enable rke2-server.service
systemctl start rke2-server.service
sudo cp /etc/rancher/rke2/rke2.yaml /home/clouduser/
sudo chown clouduser /home/clouduser/rke2.yaml
export KUBECONFIG=/home/clouduser/rke2.yaml

Listing 5 can be used to install RKE2 on the worker node:

Listing 5 : Script to install RKE2 on the worker node
#!/bin/bash
mkdir -p /etc/rancher/rke2/
echo """
server: https://10.10.142.115:9345
token: K10e67c7985e7db4f9ed9b0353ae10f53c179a51eb4ed8443ca8596873a3327188d::server:c5cb82b52a5650b010f9e3e5f6e76b52
node-name: worker-1
""" > /etc/rancher/rke2/config.yaml
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh -
systemctl enable rke2-agent.service
systemctl start rke2-agent.service

Where node-name is the name of the worker node, which can be set to worker-1, worker-2, etc. depending on the number of worker nodes in the cluster.

Using RKE2, the computing architecture shown in Fig. 8 was built:

  • The bare-metal server stands as foundation for all virtualization layers above;

  • The Kubernetes cluster is made up of three Virtual Machines created using KVM and can comunicate through a private network;

  • The Master node is accessible from the host.

BETIF-DIFAET architecture

Fig. 8 Schematic of the BETIF-DIFAET architecture.

Deploying the BETIF-DIFAET jhub platform

Once the Kubernetes cluster is set up with RKE2, on the master node the kube-config file is available at /home/clouduser/rke2.yaml. This file can be used to interact with the Kubernetes cluster using kubectl, the command-line tool for Kubernetes.

Danger

The kube-config file contains sensitive information, such as the token used to authenticate with the cluster. It should be kept secure and not shared publicly.

Important

Currently, the BETIF-DIFAET platform does not have a DNS resolved domain name. Therefore, the IP address of the master node is used to access the platform. To create an user-friendly domain name, add the following line to the /etc/hosts file:

123.456.789.012 betif-difaet.jhub

where 123.456.789.012 is the IP address of the master node.

The BETIF-DIFAET platform is deployed using Helm charts [HELM], which are packages of pre-configured Kubernetes resources. The recipe for deploying the platform is available in the charts repository.

The steps to deploy the platform are as follows:

  1. Install Helm: Ensure that Helm is installed on the same machine where you connect and control the Kubernetes cluster. An example of how to install Helm is shown here.

  1. Add the following requirements:

  • Cert-Manager:

    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.yaml
    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.crds.yaml
    
  • Label nodes with no GPU, to perform node selection during the deployment of the JupyterHub platform:

    kubectl label node worker-N nvidia.com/gpu.present=false
    

    This is not needed for the node with the GPU, which will be automatically detected by the NVIDIA device plugin (see later).

  1. NFS external provisioner:

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update
kubectl create namespace kube-storage
helm install nfs-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --namespace kube-storage \
  --set nfs.server=<NFS_SERVER_IP> \
  --set nfs.path=/srv/nfs/k8s

where <NFS_SERVER_IP> is the IP address of the NFS VM created earlier.

  1. Deploy the BETIF-DIFAET platform: Use the Helm chart to deploy the platform.

git clone git@github.com:BETIF-DIFAET/charts.git
cd charts/stable/jhubaas
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm dependency build
kubectl create namespace jhub
helm upgrade --install --cleanup-on-fail --namespace jhub jhub ./

The last command deploys the JupyterHub platform in the jhub namespace of the Kubernetes cluster. The deployment will take a few minutes to complete, and you can monitor the status of the pods using:

kubectl get pods -n jhub

Once the deployment is complete, you can access the JupyterHub platform using the IP address of the master node. If you have set up a domain name in your /etc/hosts file, you can access it using that domain name as well (in this case betif-difaet.jhub).

Customizing the jhub

To customize the JupyterHub configuration, you can modify the values.yaml file in the Helm chart directory. This file contains various configuration options for JupyterHub, including authentication methods (currently the Einstein Telescope IAM instance), resource limits, and more.

Once you have made your changes to the values.yaml file, you can apply them by running:

helm upgrade --install --cleanup-on-fail --namespace jhub jhub ./

Adding CVMFS to the cluster

To add CVMFS support to the Kubernetes cluster, a dedicated Helm chart is available in the charts repository.

The installation can be done by following these steps:

  1. Clone the charts repository (if not already done):

git clone https://github.com/BETIF-DIFAET/charts.git
  1. Deploy the CVMFS service:

git clone -b release-2.0 https://github.com/BETIF-DIFAET/cvmfs-csi.git
helm install cvmfs ./cvmfs-csi/deployments/helm/cvmfs-csi -f ./charts/stable/cvmfs/config.yaml -n jhub
kubectl create -f ./charts/stable/cvmfs/volume-storageclass-pvc.yaml
kubectl create -f ./charts/stable/cvmfs/cvmfs-idler-daemonset.yaml

Customizing cvmfs

To customize the CVMFS configuration, you can modify the config.yaml file in the charts/stable/cvmfs/ directory. This file contains various configuration options for CVMFS,

Uninstall CVMFS

Warning

Before uninstalling the Helm package, delete first the cvmfs-idler-daemonset resource to avoid corrupting the cvmfs-idler pods (causing them to go in Error state, for SIGKILL).

To uninstall the CVMFS Helm package, run:

kubectl delete daemonset cvmfs-idler-daemonset -n jhub
helm uninstall cvmfs -n jhub

Adding GPU support to the cluster

To add GPU support to the Kubernetes cluster, the steps above to set up the GPU worker node must be followed. Once the node is ready, the NVIDIA GPU Operator is used to manage the GPU resources in the cluster. The dedicated Helm chart installation is available in the charts repository.

The installation can be done by following these steps:

  1. Clone the charts repository (if not already done):

git clone https://github.com/BETIF-DIFAET/charts.git
  1. Deploy the NVIDIA GPU operator:

kubectl create namespace gpu-operator
kubectl create -f ./charts/stable/gpu-operator/time-slicing-config.yaml
kubectl create -f ./charts/stable/gpu-operator/gpu-operator.yaml

The GPU operator will automatically detect the GPU on the worker node and manage its resources. The time-slicing-config.yaml file is used to configure the GPU time slicing feature, which allows multiple pods to share the same GPU.

Note

Currently, the GPU time slicing is set for 10 replicas, which means that up to 10 pods can share the same GPU. This setting can be adjusted in the time-slicing-config.yaml file.

Uninstall GPU operator

To uninstall the GPU operator Helm package, run:

kubectl delete helmchart gpu-operator -n kube-system
helm delete namespace gpu-operator

References