Surprising subtleties of Docker permissions

by Ash Zahlen on May 14, 2019

The fundamental building block of our analysis platform is an analyzer. Since the static analysis world works in many different languages and can require many different libraries, each analyzer is its own Docker image, and the Dockerfile is provided by the analysis author. We provide the analyzer's inputs in the /analysis/inputs folder (where the list of inputs is determined by a manifest file), and once the image has finished running, we look for its output in /analysis/output. Usually, we do this by bind-mounting a directory on the host to the /analysis folder; when we run in our CI environment on Circle, we have to fall back to docker cp since Circle's docker-in-docker solution uses a remote docker daemon, meaning that the image isn't necessarily running on the same machine as the code that launched it.

This seems like it'd work, and for a while, it did. But when we started running our client on Linux hosts, we ran into weird issues related to filesystem permissions.

A digression into POSIX filesystem permissions

Before getting into detail, let's explore the typical POSIX filesystem access control model. This model is shared by macOS, BSD, Linux, and other similar operating systems (notably, not Windows). If you're familiar with how they work, including what write and execute permissions mean on a directory, you can skip to the next section.

Each file has an owner, which is stored as a number known as a user ID, and a group, which is similarly stored as a group ID. The permissions entry for a file controls who can read, write, or execute the file, and this can be controlled separately for the owner, for users in the file's group, and for all other users (AKA 'other'). This is typically represented in a form like rwxr-x---, where the first triple corresponds to the user, the second corresponds to the group, and the last triple corresponds to all other users. So for this example file, the user can read, write, and execute it (rwx); users in the file's group can read and execute, but not write (r-x); and other users can't read, write, or execute (---);

This all makes sense for files, but what about directories? It turns out that the answer there is a bit subtle. For directories:

  • Read permissions allow a user to list the names of all the files in a directory
  • Write permissions allow the user to add, delete, and rename files in the directory, but only if the execute bit is set.
  • Execute permissions allow cd-ing into the directory.

The exact semantics of the various combinations are complicated, but the important thing is that in order to delete a file in a directory, you either need to own the directory or have write permissions to it.

The problem

Our Docker images all use some flavor of Linux as a base, and when running a Linux image on a Linux host with a bind mount, files have the same owner, group, and permissions within the image and on the host machine. And since we don't control what analyzer images do, an analyzer could very well be running as root inside its image. Which means that on the host, the output files will be owned by root. And so, if the output files contain a nonempty directory, we won't be able to delete them afterwards.

To see why, suppose that the mounted path on the host is /tmp/data, which is owned by the user running our CLI (who we'll call Alice). Then, suppose when Alice runs the analyzer, it outputs a file located at /analysis/output/foo/bar inside the image, and that this file and its containing directory are both owned by root.

Then on the host, we'll have a directory /tmp/data/foo that's owned by root, and a file inside it named bar that's also owned by root.

Then on the host, in order to delete /tmp/data, we'll first have to delete /tmp/data/foo, and in turn that requires deleting /tmp/data/foo/bar since you can't delete a nonempty directory. But both /tmp/data/foo/bar and its containing folder are owned by the root user, not Alice, and we can't rely on having write permissions to it, so we can't delete it!

This didn't show up as a problem earlier for us for two reasons:

  • Most of our developers use macOS as our daily workstations. Docker on macOS doesn't map filesystem permissions/ownership in the same way; instead, the way it all shakes out is that everything will be automatically owned by the user that ran our CLI, so we don't have any problems to begin with.
  • Even in cases where we were running on Linux, most of our analyzers at the time would only output a single output.json file, which gets mapped onto /tmp/data/output.json on the host. And that dioesn't cause problems, since the CLI user owns /tmp/data and you can delete files in directories that you own.

What doesn't work

Running as the CLI user

Docker has options for running the entrypoint/command as a given user, so one might think that we could just get the user running the r2c CLI and run the entrypoint as that user. But an analysis image might install various software as a user; for example, many of our own analyzers create an analysis user and install software from NPM as that user. And we don't want to require that analysis authors make sure everything they set up internally is world-readable.

This also has another problem: while the UID and GID are shared between the Docker container and the host image, user names aren't. This is because the mapping between UID and username is stored in the /etc/passwd file, which isn't shared between the host and the Docker container. So if software tries to look up the name of the current user, it'll fail, which can have surprising and/or amusing effects:

$ me=$(id -u) # get the ID of the host's current user
$ echo $me
$ docker run --user $me debian:latest whoami
whoami: cannot find name for user ID 501
$ docker run -it --user $me debian:latest /bin/bash
I have no name!@4c4fb54624c4:/$

Since this situation of not having a valid username accounts for a very small percent of use cases in most software, we expected this might trigger interesting edge cases in external software.. In combination with the burden it'd place on analysis authors, we rejected this approach.

Pass the UID/GID into the docker image

Docker has support for passing environment variables into an image at build time. So we could pass the UID, GID, and username of the user running the CLI into the image and require that analysis authors use that UID, GID, and username when setting up their image. However, this would mean that the end user of the analysis image would have to rebuild it the first time they run it, and that's a bad user experience. It also puts additional burden on analysis authors, and bugs related to users failing to do this would be hard to track down.

Always use docker cp

Instead of using bind mounts, we could just use volumes and then use docker cp to copy data out of the filesystem. We're already doing this in the event of a remote Docker daemon, such as if we're running in CI; in that case, you can't use a bind mount since the Docker daemon won't even be on the same physical machine! But volume mounts are less performant than bind mounts, and some of our analyzers can output hundreds of megabytes, or even gigabytes, of data.

What does work

Eventually, we realized something: you can use Docker with bind mounts to change permissions! Specifically, a command like

docker run --mount source=/host/path,target=/vol alpine:3.9 chmod ugo+rwx /vol/rand

lets us change the permissions of a file on the host by mounting it inside a Docker image and then chmodding it. (Here, ugo+rwx means 'add read/write/execute permissions to the user, group, and other'.). So we can just run that on the analyzer image's input before it starts and on the image's output after it finishes.

... except in the case where we're using a remote docker. In that case, bind mounts won't work, so this command doesn't work. Not only that, we have a different problem: docker cp makes the files inside the image owned as root by default and preserves the permissions, and since in many cases the output of the previous analyzer won't be world-writable, the analysis author won't be able to write to their input. And some analyses want to be able to do things like npm install in their input, which requires the ability to write to it.

Fortunately, in this case, we can do something slightly different:

  1. Create a volume and copy the data from the host into the volume.
  2. Run the chmod ugo+rwx command, but mounting the volume we just created instead of trying to bind-mmount.
  3. Run the analyzer with that volume mounted in the usual place.
  4. Fix up the permissions again as usual.
  5. Copy the files out, and delete the temporary volume we used for all of this.

And since both the local Docker and remote Docker cases follow the same pattern, we can abstract all of this in a DockerFileManager interface, instantiate one based on whether we're running local Docker or remote Docker, and just call various lifecycle methods on it. There's no need for branching control flow.