- Seccomp in Docker
- Seccomp in Kubernetes
- Using Custom Seccomp Profile
- Reject all syscalls made by the container
Seccomp (Secure Computing Mode) is a Linux kernel feature that allows you to restrict the system calls available to a process. It provides a way to create a secure execution environment by limiting the set of allowed system calls, reducing the attack surface of a program.
Simply put, we can limit the syscalls that programs can use.
To check if seccomp is supported by the Kernel, check boot config file.
Docker containers have a builtin Seccomp fitler that is used when a container is created, provided that the host Kernel has Seccomp enabled.
Below is a snippet of the full Seccomp profile. The complete profile includes a more extensive list of allowed syscalls and actions.
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"amd64",
"x86_64"
],
"syscalls": [
{
"name": "accept4",
"action": "SCMP_ACT_ALLOW"
},
{
"name": "access",
"action": "SCMP_ACT_ALLOW"
},
{
"name": "adjtimex",
"action": "SCMP_ACT_ALLOW"
},
// ... additional syscalls ...
]
}
Seccomp (Secure Computing Mode) operates in these modes:
- "0" - Disabled
- "1" - Filter Mode
- "2" - Notification Mode
In filter mode, Seccomp allows or denies system calls based on a predefined filter set by the process.
- The filter is typically a set of rules specifying which system calls are permitted and which are denied.
- If a process attempts to make a system call that is not allowed by the filter, the action specified in the filter (such as allowing, denying, or notifying the calling process) is taken.
Example of a Seccomp Filter in JSON format:
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [ "amd64" ],
"syscalls": [
{ "name": "read" },
{ "name": "write" },
{ "name": "exit" }
]
}
User notification mode allows a process to receive a notification (signal) when a specified system call is about to be executed.
- The process can decide how to handle the notification, either allowing the system call to proceed or terminating the process.
- This mode is useful for processes that want to monitor or control certain system calls.
Example of Using User Notification Mode in C:
#include <linux/seccomp.h>
#include <stddef.h>
#include <stdio.h>
#include <sys/prctl.h>
int main() {
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, NULL);
// Specify a system call for notification
prctl(PR_SET_SECCOMP, SECCOMP_MODE_NOTIFY, SECCOMP_RET_TRAP);
// ... Your application code ...
return 0;
}
Seccomp profiles define the set of system calls allowed for a process.
- Profiles can be strict, allowing only a predefined set of syscalls.
- They can also use a filter expression.
- Can be created manually or generated using tools like seccomp-bpf or Docker
A Seccomp Profile consist of objects:
-
Default Action - specifies actions for syscalls not defined in the syscall array.
-
Architecture - defines which system the profile can be used for
-
Syscalls Array - set of syscall names and associated actions
Here's an example of a simple seccomp profile in JSON format that allows only a few basic syscalls:
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [ "amd64" ],
"syscalls": [
{ "name": "read" },
{ "name": "write" },
{ "name": "exit" },
{ "name": "exit_group" }
]
}
Note that there are two types of profiles:
- Whitelist - Allows defined syscalls, deny the rest .
- Blacklist - Rejects defined syscalls, allows the rest.
Below is an example:
We can create a custom Seccomp profile and use it when running containers:
## custom.json
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": ["amd64"],
"syscalls": [
{ "name": "read", "action": "SCMP_ACT_ALLOW" },
{ "name": "write", "action": "SCMP_ACT_ALLOW" },
{ "name": "exit", "action": "SCMP_ACT_ALLOW" },
{ "name": "exit_group", "action": "SCMP_ACT_ALLOW" },
{ "name": "open", "action": "SCMP_ACT_ALLOW" },
{ "name": "close", "action": "SCMP_ACT_ALLOW" },
{ "name": "fstat", "action": "SCMP_ACT_ALLOW" },
{ "name": "arch_prctl", "action": "SCMP_ACT_ALLOW" },
{ "name": "brk", "action": "SCMP_ACT_ALLOW" },
{ "name": "munmap", "action": "SCMP_ACT_ALLOW" },
{ "name": "mmap", "action": "SCMP_ACT_ALLOW" }
// Add more syscalls as needed
]
}
To use the seccomp profile, pass it when running th container.
docker run \
--security-opt seccomp=/path/to/custom.json \
-it ubuntu:latest
We can also tell the Docker container to completely ignore any seccomp profile completely:
docker run \
--security-opt seccomp=unconfined \
-it ubuntu:latest
By doing this, the container should be able to use all avaiable syscalls from within the container.
This is NOT RECOMMENDED.
Below is an example of a Docker container. This container displays the runtime used and the list of blocked syscalls.
Now, if we try to run a pod using the image, we'll see a different output.
From the pod logs above, we see that there's lesser blocked syscalls, and the Seccomp is set to disabled. This is because Kubernetes doesn't implement Seccomp by default.
To implement Seccomp in the Pod, specify it as a Security Context in the Pod definition file.
### pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: seccomp-pod
spec:
securityContext:
seccompProfile:
tye: RuntimeDefault
containers:
- name: my-container
image: nginx:latest
securityContext:
allowPrivilegeEscalation: false
# Add more containers or configurations if needed
If we want to use a custom Seccomp profile, we could also specify it in the Pod definition file.
But first, ensure that the /var/lib/kubelet/seccomp/profiles/
directory is created. Inside this directory, create the custom.json file.
Create the custom json file.
## audit.json
{
"defaultAction": "SCMP_ACT_LOG"
}
apiVersion: v1
kind: Pod
metadata:
name: seccomp-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/audit.json
containers:
- name: my-container
image: nginx:latest
securityContext:
allowPrivilegeEscalation: false
# Add more containers or configurations if needed
Once pod is created, syslog calls made by the container in the pod will be logged in the /var/log/syslog file.
From the syslog output above, we could see the syslog call number made by the container in the pod. Note that this number are mapped to specific syscall names, which we can check in the /usr/include/asm/unistd_64.h
Below are just some of the syscall numbers and their corresponding syscall names.
We can create another custom profile which will reject all syscalls made by the container. Create the violation.json.
## /var/lib/kubelet/seccomp/profiles/violation.json
{
"defaultAction": "SCMP_ACt_ERRNO"
}
We can then specify this profile in the Pod definition file.
apiVersion: v1
kind: Pod
metadata:
name: test-violation
spec:
restartPolicy: Never
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/violation.json
containers:
- name: my-container
image: nginx:latest
securityContext:
allowPrivilegeEscalation: false
Once we apply the manifest and check the pod, we'll see that the pod has a "ContainerCannotRun" status.