Why adding extra security to containers ?
Containers are an awesome way to package and design complex applications with many dependencies and interconnection. They are also lightweight in contrast of Virtual Machine because they use the kernel of the host instead of their own. The drawback is the security : when you are closer to the host, you have less security. Docker makes use of namespaces and cgroups to isolate each container from each others and, more importantly, from the host. That’s an additional layer as opposed to “raw” applications installed directly on the host but this layer can be hardened to make container escape or privilege escalation inside the container harder. Let’s see some practical best practices with the wordpress:apache image used as an example.
Running as an unprivileged user
You should never run anything as root inside the container. Especially the services that are exposed on the network. If an attacker exploit a vulnerability on root process, he will directly gain root privileges. Are we obligated to give him that pleasure ? The response is no of course.
Let’s fire up a wordpress:apache container and have a look at the processes :
We can see that the first process (with PID 1) is running as root but the workers are running as un unprivileged www-data user. Since the apache2 workers are directly connected to clients (and potentially attackers), it’s a good practice to drop root privileges before running them. It’s good but not perfect and might not be your case (remember we use wordpress:apache as an example), the general best practice is to run the first process also as an unprivileged user. The problem is we cannot choose whatever uid/gid because the corresponding user won’t be able to impersonate the www-data user : we must run the first process as www-data.
We can extract the UID/GID of www-data, adjust the rights of our volume and fire up another container with the –user uid:gid parameter :
As we can see the master process and even the ps command are now running as www-data. Please note that you can use the –user parameter with the docker exec command (e.g. : docker exec –user 0:0 …).
Keeping only necessary capabilities
Linux capabilities are like groups of features that a process can access to. We can choose the list of capabilities a process will be allowed to use. Here are some capabilities as an example :
- CAP_CHOWN : UID/GID changes on files
- CAP_NET_BIND_SERVICE : bind to TCP/UDP ports lower than 1024
- CAP_SYS_PTRACE : debug remote process using ptrace
By default, Docker run containers with a subset of the capabilities which is a good thing but can be better. The general best practice here is to drop all the default capabilities and then only allow the ones that are essentials (whitelisting approach).
The –cap-drop=all parameter lets you drop all default capabilities, let’s try it :
If you encounter some errors you may need to edit manually some configurations files or the Dockerfile to generate a new image that doesn’t need extra capabilities. Another option is to grant specifics capabilities by using the –cap-add parameter. Here is a dummy example :
User namespace remapping lets you map internal UID/GID inside the container to another UID/GID on the host. This is particularly useful when there is no other solution than running a container as root. Please note that you can even map UID/GID to non-existent user/group.
First of all, check that you have the /etc/subuid and /etc/subgid files like this one (replace user with the username of an existing user on your host) :
With this example all UID/GID inside the container will be mapped to UID/GID starting at 100000 on the host. For example, the UID/GID 33 on the container will be mapped to 100033 (100000 + 33) on the host.
Next, you need to edit the /etc/docker/daemon.json file (replace user with the same username used in /etc/subuid and /subgid) and restart Docker to activate the change :
Let’s edit the permissions of the persistent data volume, run a new container and check that UID/GID are remapped :
Read only root filesystem
When we use containers, most of the time, we can distinguish two types of data : persistent (important) and temporary. The persistent one contains the useful data (e.g. : database, website folder, …) that needs to be reused when starting a new container. On the other end, the temporary one data doesn’t contain “useful” data and can be removed when we start a new container. We know that attackers likes to (re)write things (e.g. : exploit, backdoor, …) to the filesystem. What about making the whole container read-only (or at least as much as possible) to make it harder for them ?
This is the whole point of the –read-only parameter when using the docker run command :
The additional –tmpfs parameters let you define directories where you need to have write access for your temporary files. It highly depend of the image used and some research might be done to find the right ones. As you can see, by default, the noexec flag is present on the writable directories. It means that an attacker won’t be able to run some malicious executable files from there.
Here is a list of things you should avoid when running the containers in production especially if they are in front of clients :
- Running as privileged (–privileged)
- Mounting the docker socket (-v /var/run/docker.sock)
- Mounting the host filesystem (-v /)
- Using the host networking devices (–network host)