Non-root Containers And Devices

Author: Mikko Ylinen (Intel)

The user/group ID related security settings in Pod's securityContext trigger a problem when users want to deploy containers that use accelerator devices (via Kubernetes Device Plugins) on Linux. In this blog post I talk about the problem and describe the work done so far to address it. It's not meant to be a long story about getting the k/k issue fixed.

Instead, this post aims to raise awareness of the issue and to highlight important device use-cases too. This is needed as Kubernetes works on new related features such as support for user namespaces.

Why non-root containers can't use devices and why it matters

One of the key security principles for running containers in Kubernetes is the principle of least privilege. The Pod/container securityContext specifies the config options to set, e.g., Linux capabilities, MAC policies, and user/group ID values to achieve this.

Furthermore, the cluster admins are supported with tools like PodSecurityPolicy (deprecated) or Pod Security Admission (alpha) to enforce the desired security settings for pods that are being deployed in the cluster. These settings could, for instance, require that containers must be runAsNonRoot or that they are forbidden from running with root's group ID in runAsGroup or supplementalGroups.

In Kubernetes, the kubelet builds the list of Device resources to be made available to a container (based on inputs from the Device Plugins) and the list is included in the CreateContainer CRI message sent to the CRI container runtime. Each Device contains little information: host/container device paths and the desired devices cgroups permissions.

The OCI Runtime Spec for Linux Container Configuration expects that in addition to the devices cgroup fields, more detailed information about the devices must be provided:

        "type": "<string>",
        "path": "<string>",
        "major": <int64>,
        "minor": <int64>,
        "fileMode": <uint32>,
        "uid": <uint32>,
        "gid": <uint32>

The CRI container runtimes (containerd, CRI-O) are responsible for obtaining this information from the host for each Device. By default, the runtimes copy the host device's user and group IDs:

  • uid (uint32, OPTIONAL) - id of device owner in the container namespace.
  • gid (uint32, OPTIONAL) - id of device group in the container namespace.

Similarly, the runtimes prepare other mandatory config.json sections based on the CRI fields, including the ones defined in securityContext: runAsUser/runAsGroup, which become part of the POSIX platforms user structure via:

  • uid (int, REQUIRED) specifies the user ID in the container namespace.
  • gid (int, REQUIRED) specifies the group ID in the container namespace.
  • additionalGids (array of ints, OPTIONAL) specifies additional group IDs in the container namespace to be added to the process.

However, the resulting config.json triggers a problem when trying to run containers with both devices added and with non-root uid/gid set via runAsUser/runAsGroup: the container user process has no permission to use the device even when its group id (gid, copied from host) was permissive to non-root groups. This is because the container user does not belong to that host group (e.g., via additionalGids).

Being able to run applications that use devices as non-root user is normal and expected to work so that the security principles can be met. Therefore, several alternatives were considered to get the gap filled with what the PodSec/CRI/OCI supports today.

What was done to solve the issue?

You might have noticed from the problem definition that it would at least be possible to workaround the problem by manually adding the device gid(s) to supplementalGroups, or in the case of just one device, set runAsGroup to the device's group id. However, this is problematic because the device gid(s) may have different values depending on the nodes' distro/version in the cluster. For example, with GPUs the following commands for different distros and versions return different gids:

Fedora 33:

$ ls -l /dev/dri/
total 0
drwxr-xr-x. 2 root root         80 19.10. 10:21 by-path
crw-rw----+ 1 root video  226,   0 19.10. 10:42 card0
crw-rw-rw-. 1 root render 226, 128 19.10. 10:21 renderD128
$ grep -e video -e render /etc/group

Ubuntu 20.04:

$ ls -l /dev/dri/
total 0
drwxr-xr-x 2 root root         80 19.10. 17:36 by-path
crw-rw---- 1 root video  226,   0 19.10. 17:36 card0
crw-rw---- 1 root render 226, 128 19.10. 17:36 renderD128
$ grep -e video -e render /etc/group

Which number to choose in your securityContext? Also, what if the runAsGroup/runAsUser values cannot be hard-coded because they are automatically assigned during pod admission time via external security policies?

Unlike volumes with fsGroup, the devices have no official notion of deviceGroup/deviceUser that the CRI runtimes (or kubelet) would be able to use. We considered using container annotations set by the device plugins (e.g., io.kubernetes.cri.hostDeviceSupplementalGroup/) to get custom OCI config.json uid/gid values. This would have required changes to all existing device plugins which was not ideal.

Instead, a solution that is seamless to end-users without getting the device plugin vendors involved was preferred. The selected approach was to re-use runAsUser and runAsGroup values in config.json for devices:

        "type": "c",
        "path": "/dev/foo",
        "major": 123,
        "minor": 4,
        "fileMode": 438,
        "uid": <runAsUser>,
        "gid": <runAsGroup>

With runc OCI runtime (in non-rootless mode), the device is created (mknod(2)) in the container namespace and the ownership is changed to runAsUser/runAsGroup using chmod(2).

Having the ownership updated in the container namespace is justified as the user process is the only one accessing the device. Only runAsUser/runAsGroup are taken into account, and, e.g., the USER setting in the container is currently ignored.

While it is likely that the "faulty" deployments (i.e., non-root securityContext + devices) do not exist, to be absolutely sure no deployments break, an opt-in config entry in both containerd and CRI-O to enable the new behavior was added. The following:

device_ownership_from_security_context (bool)

defaults to false and must be enabled to use the feature.

See non-root containers using devices after the fix

To demonstrate the new behavior, let's use a Data Plane Development Kit (DPDK) application using hardware accelerators, Kubernetes CPU manager, and HugePages as an example. The cluster runs containerd with:

    device_ownership_from_security_context = true

or CRI-O with:

device_ownership_from_security_context = true

and the Guaranteed QoS Class Pod that runs DPDK's crypto-perf test utility with this YAML:

  name: qat-dpdk
    runAsUser: 1000
    runAsGroup: 2000
    fsGroup: 3000
  - name: crypto-perf
    image: intel/crypto-perf:devel
        cpu: "3"
        memory: "128Mi" '4'
        hugepages-2Mi: "128Mi"
        cpu: "3"
        memory: "128Mi" '4'
        hugepages-2Mi: "128Mi"

To verify the results, check the user and group ID that the container runs as:

$ kubectl exec -it qat-dpdk -c crypto-perf -- id

They are set to non-zero values as expected:

uid=1000 gid=2000 groups=2000,3000

Next, check the device node permissions ( exposes /dev/vfio/ devices) are accessible to runAsUser/runAsGroup:

$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/vfio
total 0
drwxr-xr-x 2 root root      140 Sep  7 10:55 .
drwxr-xr-x 7 root root      380 Sep  7 10:55 ..
crw------- 1 1000 2000 241,   0 Sep  7 10:55 58
crw------- 1 1000 2000 241,   2 Sep  7 10:55 60
crw------- 1 1000 2000 241,  10 Sep  7 10:55 68
crw------- 1 1000 2000 241,  11 Sep  7 10:55 69
crw-rw-rw- 1 1000 2000  10, 196 Sep  7 10:55 vfio

Finally, check the non-root container is also allowed to create HugePages:

$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/hugepages/

fsGroup gives a runAsUser writable HugePages emptyDir mountpoint:

total 0
drwxrwsr-x 2 root 3000   0 Sep  7 10:55 .
drwxr-xr-x 7 root root 380 Sep  7 10:55 ..

Help us test it and provide feedback!

The functionality described here is expected to help with cluster security and the configurability of device permissions. To allow non-root containers to use devices requires cluster admins to opt-in to the functionality by setting device_ownership_from_security_context = true. To make it a default setting, please test it and provide your feedback (via SIG-Node meetings or issues)! The flag is available in CRI-O v1.22 release and queued for containerd v1.6.

More work is needed to get it properly supported. It is known to work with runc but it also needs to be made to function with other OCI runtimes too, where applicable. For instance, Kata Containers supports device passthrough and allows it to make devices available to containers in VM sandboxes too.

Moreover, the additional challenge comes with support of user names and devices. This problem is still open and requires more brainstorming.

Finally, it needs to be understood whether runAsUser/runAsGroup are enough or if device specific settings similar to fsGroups are needed in PodSpec/CRI v2.


My thanks goes to Mike Brown (IBM, containerd), Peter Hunt (Redhat, CRI-O), and Alexander Kanevskiy (Intel) for providing all the feedback and good conversations.