=================== Administrator Guide =================== This section provides guidance for administrators managing the BETIF/DIFAET system. It covers installation, configuration, and maintenance tasks to ensure the system runs smoothly. The scripts shown below are also available in the `helper-scripts repository`_. .. _helper-scripts repository: https://github.com/BETIF-DIFAET/helper-scripts ----------------------------------------------- Kernel-based Virtual Machine (KVM) Installation ----------------------------------------------- The BETIF-DIFAET machine runs AlmaLinux 9 as its operating system, and, as is common across many Linux distributions, it supports Kernel-based Virtual Machines (KVM) for virtualizing bare-metal resources. KVM is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). This tool allows to run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc. In order to test the steps to create the computing platform, 4 VMs have been spawned based on AlmaLinux 9 as Operating system and 16 vCPUs, 32 GB of RAM and 80 GB of disk space, through the ``virt-install`` interface. :numref:`create-master-node-vm` can be used to create a virtual machine acting as a master node for the BETIF-DIFAET system. .. code-block:: bash :caption: : Script to create a master node VM for BETIF-DIFAET :name: create-master-node-vm #!/usr/bin/env bash vm_name='alma9-test-master' vm_memory='32768' vm_cpus='16' vm_disk='/var/lib/libvirt/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2' ci_user_data='user-data' ci_network_config='network-configv3' qemu-img create -f qcow2 \ -b /var/lib/libvirt/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2 \ -F qcow2 /var/lib/libvirt/images/rke2-master-AlmaLinux-9-test-master.qcow2 80G vm_disk='/var/lib/libvirt/images/rke2-master-AlmaLinux-9-test-master.qcow2' virt-install \ --connect qemu:///system \ --name "$vm_name" \ --memory "$vm_memory" \ --machine q35 \ --vcpus "$vm_cpus" \ --cpu host-passthrough \ --import \ --cloud-init user-data="$ci_user_data" \ --osinfo name=almalinux9 \ --disk "$vm_disk" \ --virt-type kvm \ --network network=private-net \ --network network=default \ --noautoconsole :numref:`create-worker-node-vm` can be used to create a virtual machine acting as a worker node for the BETIF-DIFAET system. .. code-block:: bash :caption: : Script to create a worker node VM for BETIF-DIFAET :name: create-worker-node-vm #!/usr/bin/env bash if [ -z "$1" ]; then echo "Usage: $0 " exit 1 fi N="$1" vm_name="alma9-test-worker-$N" vm_memory='32768' vm_cpus='16' vm_base_disk='/var/lib/libvirt/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2' ci_user_data='user-data' ci_network_config='network-configv3' vm_disk="/var/lib/libvirt/images/rke2-master-AlmaLinux-9-test-worker-$N.qcow2" qemu-img create -f qcow2 -b "$vm_base_disk" -F qcow2 "$vm_disk" 80G virt-install \ --connect qemu:///system \ --name "$vm_name" \ --memory "$vm_memory" \ --machine q35 \ --vcpus "$vm_cpus" \ --cpu host-passthrough \ --import \ --cloud-init user-data="$ci_user_data" \ --osinfo name=almalinux9 \ --disk "$vm_disk" \ --virt-type kvm \ --network network=private-net \ --noautoconsole A :ref:`private-network-interface` was also created to enable direct connections between the different VMs which for debug and testing purposes is still left open to allow direct access to the worker nodes. In the actual deployment this network will block access to the worker VMs, leaving only the Master accessible via SSH. .. code-block:: xml :caption: : Private network interface :name: private-network-interface private-net ^^^^^^^^^^^^^^^^^^^^^^^^^^ Turn on GPU Virtualization ^^^^^^^^^^^^^^^^^^^^^^^^^^ Up until now, the creation of VMs has relied on virtualization technologies (e.g., VT-x for Intel CPUs), which do not expose hardware connected to the host machine via a PCIe interface. To enable the passthrough of PCIe expansion devices, such as GPUs or FPGA accelerator cards, ``VT-d`` (Intel) or ``AMD-V`` (AMD) must be activated in the BIOS setup menu. Once enabled in the firmware, the procedure continues in the host operating system: PCIe passthrough must also be allowed in the kernel. Run: .. code-block:: bash find /sys/kernel/iommu_groups/ -type l If no output is returned, the kernel boot options must be updated. On AlmaLinux 9, with an Intel CPU and chipset, this can be done with: .. code-block:: bash grubby --update-kernel=ALL --args="intel_iommu=on iommu=pt" After a reboot, the ``iommu_groups`` folder should be populated with all devices that can be passed through to VMs. To use a device inside a VM, it must not be in use by the host system, i.e., the default driver must not be loaded. For example, with two identical GPUs, a rule must be added at boot time in ``/etc/udev/rules.d/99-vfio.rules``: .. code-block:: ACTION=="add", SUBSYSTEM=="pci", KERNEL=="0000:", DRIVER=="", ATTR{driver_override}="vfio-pci" The ``PCI-ID`` can be retrieved using ``lspci``. Next, load the ``vfio-pci`` driver, which is responsible for virtualization handling: .. code-block:: bash modprobe vfio-pci .. IMPORTANT:: Due to issues with the order in which rules are applied during boot, ``vfio-pci`` is not yet loaded automatically. This command must therefore be run manually after every reboot. The driver will automatically attach to the configured hardware. Finally, update the operating system configuration (``initramfs``) with: .. code-block:: bash dracut -f At this point, a VM with access to a GPU can be create with the following instruction: .. code-block:: bash virt-install \ --connect qemu:///system \ --name "$vm_name" \ --memory "$vm_memory" \ --machine q35 \ --vcpus "$vm_cpus" \ --cpu host-passthrough \ --import \ --cloud-init user-data="$ci_user_data" \ --osinfo name=almalinux9 \ --disk "$vm_disk" \ --virt-type kvm \ --network network=private-net \ --network network=default \ --noautoconsole \ --hostdev ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Setting up the GPU Worker Node ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To make the GPU deployable in the k8s cluster, NVidia's drivers are needed in the worker node housing the virtual GPU. After an update of the kernel, some utilities are useful/needed to move forward .. code-block:: bash dnf check-update --security dnf upgrade --security dnf install pciutils dnf install epel-release dnf install dkms gcc Then the Nvidia Toolkit and driver have to be installed, following the instruction from [GPU1]_: .. code-block:: bash dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo dnf clean all dnf install cuda-toolkit-12-4 dnf module install nvidia-driver:580-dkms In order to actually add the nvidia driver to the kernel run ``dkms status`` and then .. code-block:: bash dkms install nvidia/ reboot Once the node is running after the reboot, it is possible to check if the driver is working by running ``nvidia-smi`` .. IMPORTANT:: It is possible that the open-source driver ``nouveau`` is used instead of the proprietary one from Nvidia, if that is the case it can be fixed with the following .. code-block:: bash sudo tee /etc/modprobe.d/blacklist-nouveau.conf < /etc/rancher/rke2/config.yaml curl -sfL https://get.rke2.io | sh - systemctl enable rke2-server.service systemctl start rke2-server.service sudo cp /etc/rancher/rke2/rke2.yaml /home/clouduser/ sudo chown clouduser /home/clouduser/rke2.yaml export KUBECONFIG=/home/clouduser/rke2.yaml :numref:`install-rke-worker` can be used to install RKE2 on the worker node: .. code-block:: bash :caption: : Script to install RKE2 on the worker node :name: install-rke-worker #!/bin/bash mkdir -p /etc/rancher/rke2/ echo """ server: https://10.10.142.115:9345 token: K10e67c7985e7db4f9ed9b0353ae10f53c179a51eb4ed8443ca8596873a3327188d::server:c5cb82b52a5650b010f9e3e5f6e76b52 node-name: worker-1 """ > /etc/rancher/rke2/config.yaml curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh - systemctl enable rke2-agent.service systemctl start rke2-agent.service Where ``node-name`` is the name of the worker node, which can be set to ``worker-1``, ``worker-2``, etc. depending on the number of worker nodes in the cluster. Using RKE2, the computing architecture shown in :numref:`betif-arch` was built: * The bare-metal server stands as foundation for all virtualization layers above; * The Kubernetes cluster is made up of three Virtual Machines created using KVM and can comunicate through a private network; * The Master node is accessible from the host. .. _betif-arch: .. figure:: betif_arch.png :alt: BETIF-DIFAET architecture Schematic of the BETIF-DIFAET architecture. ---------------------------------------- Deploying the BETIF-DIFAET jhub platform ---------------------------------------- Once the Kubernetes cluster is set up with RKE2, on the master node the kube-config file is available at ``/home/clouduser/rke2.yaml``. This file can be used to interact with the Kubernetes cluster using `kubectl`, the command-line tool for Kubernetes. .. DANGER:: The kube-config file contains sensitive information, such as the token used to authenticate with the cluster. **It should be kept secure and not shared publicly.** .. IMPORTANT:: Currently, the BETIF-DIFAET platform does not have a DNS resolved domain name. Therefore, the IP address of the master node is used to access the platform. To create an user-friendly domain name, add the following line to the ``/etc/hosts`` file: .. code-block:: bash 123.456.789.012 betif-difaet.jhub where ``123.456.789.012`` is the IP address of the master node. The BETIF-DIFAET platform is deployed using Helm charts [HELM]_, which are packages of pre-configured Kubernetes resources. The recipe for deploying the platform is available in the `charts repository`_. .. _charts repository: https://github.com/BETIF-DIFAET/charts The steps to deploy the platform are as follows: 1. **Install Helm**: Ensure that Helm is installed on the same machine where you connect and control the Kubernetes cluster. An example of how to install Helm is shown `here`_. .. _here: https://github.com/BETIF-DIFAET/helper-scripts/blob/main/helm/install_helm.sh 2. Add the following requirements: * Cert-Manager: .. code-block:: bash kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.yaml kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.crds.yaml * Label nodes with no GPU, to perform node selection during the deployment of the JupyterHub platform: .. code-block:: bash kubectl label node worker-N nvidia.com/gpu.present=false This is not needed for the node with the GPU, which will be automatically detected by the NVIDIA device plugin (see later). 3. NFS external provisioner: .. code-block:: bash helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ helm repo update kubectl create namespace kube-storage helm install nfs-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \ --namespace kube-storage \ --set nfs.server= \ --set nfs.path=/srv/nfs/k8s where ```` is the IP address of the NFS VM created earlier. 4. **Deploy the BETIF-DIFAET platform**: Use the Helm chart to deploy the platform. .. code-block:: bash git clone git@github.com:BETIF-DIFAET/charts.git cd charts/stable/jhubaas helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/ helm dependency build kubectl create namespace jhub helm upgrade --install --cleanup-on-fail --namespace jhub jhub ./ The last command deploys the JupyterHub platform in the `jhub` namespace of the Kubernetes cluster. The deployment will take a few minutes to complete, and you can monitor the status of the pods using: .. code-block:: bash kubectl get pods -n jhub Once the deployment is complete, you can access the JupyterHub platform using the IP address of the master node. If you have set up a domain name in your ``/etc/hosts`` file, you can access it using that domain name as well (in this case `betif-difaet.jhub`). ^^^^^^^^^^^^^^^^^^^^ Customizing the jhub ^^^^^^^^^^^^^^^^^^^^ To customize the JupyterHub configuration, you can modify the ``values.yaml`` file in the Helm chart directory. This file contains various configuration options for JupyterHub, including authentication methods (currently the Einstein Telescope IAM instance), resource limits, and more. Once you have made your changes to the ``values.yaml`` file, you can apply them by running: .. code-block:: bash helm upgrade --install --cleanup-on-fail --namespace jhub jhub ./ --------------------------- Adding CVMFS to the cluster --------------------------- To add CVMFS support to the Kubernetes cluster, a dedicated Helm chart is available in the `charts repository`_. .. _charts repository: https://github.com/BETIF-DIFAET/charts The installation can be done by following these steps: 1. Clone the charts repository (if not already done): .. code-block:: bash git clone https://github.com/BETIF-DIFAET/charts.git 2. Deploy the CVMFS service: .. code-block:: bash git clone -b release-2.0 https://github.com/BETIF-DIFAET/cvmfs-csi.git helm install cvmfs ./cvmfs-csi/deployments/helm/cvmfs-csi -f ./charts/stable/cvmfs/config.yaml -n jhub kubectl create -f ./charts/stable/cvmfs/volume-storageclass-pvc.yaml kubectl create -f ./charts/stable/cvmfs/cvmfs-idler-daemonset.yaml ^^^^^^^^^^^^^^^^^ Customizing cvmfs ^^^^^^^^^^^^^^^^^ To customize the CVMFS configuration, you can modify the ``config.yaml`` file in the ``charts/stable/cvmfs/`` directory. This file contains various configuration options for CVMFS, ^^^^^^^^^^^^^^^ Uninstall CVMFS ^^^^^^^^^^^^^^^ .. WARNING:: Before uninstalling the Helm package, **delete first** the ``cvmfs-idler-daemonset`` resource to avoid corrupting the ``cvmfs-idler`` pods (causing them to go in Error state, for SIGKILL). To uninstall the CVMFS Helm package, run: .. code-block:: bash kubectl delete daemonset cvmfs-idler-daemonset -n jhub helm uninstall cvmfs -n jhub --------------------------------- Adding GPU support to the cluster --------------------------------- To add GPU support to the Kubernetes cluster, the steps above to set up the GPU worker node must be followed. Once the node is ready, the `NVIDIA GPU Operator`_ is used to manage the GPU resources in the cluster. The dedicated Helm chart installation is available in the `charts repository`_. .. _NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html .. _charts repository: https://github.com/BETIF-DIFAET/charts The installation can be done by following these steps: 1. Clone the charts repository (if not already done): .. code-block:: bash git clone https://github.com/BETIF-DIFAET/charts.git 2. Deploy the NVIDIA GPU operator: .. code-block:: bash kubectl create namespace gpu-operator kubectl create -f ./charts/stable/gpu-operator/time-slicing-config.yaml kubectl create -f ./charts/stable/gpu-operator/gpu-operator.yaml The GPU operator will automatically detect the GPU on the worker node and manage its resources. The ``time-slicing-config.yaml`` file is used to configure the GPU time slicing feature, which allows multiple pods to share the same GPU. .. NOTE:: Currently, the GPU time slicing is set for **10 replicas**, which means that up to 10 pods can share the same GPU. This setting can be adjusted in the ``time-slicing-config.yaml`` file. ^^^^^^^^^^^^^^^^^^^^^^ Uninstall GPU operator ^^^^^^^^^^^^^^^^^^^^^^ To uninstall the GPU operator Helm package, run: .. code-block:: bash kubectl delete helmchart gpu-operator -n kube-system helm delete namespace gpu-operator ---------- References ---------- .. [RKE2] https://docs.rke2.io/ .. [HELM] https://helm.sh/ .. [GPU1] https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=RHEL&target_version=9&target_type=rpm_network .. [GPU2] https://docs.rke2.io/advanced