Our original protection against CVE-2019-5736 involved making a copy of /proc/self/exe into a sealed memfd and re-executing for every runc create and runc exec.
This had its own issues (see #1980, though in retrospect quite a bit of that analysis is wrong) and eventually resulted in us having a somewhat weaker protection mechanism that relied on the kernel to not allow writes to pierce through read-only barriers (first it was a temporary read-only mount which was potentially bypassable with mount privileges, and now we use a temporary overlayfs with /usr/bin as a lowerdir).
This is all well and good, as long as there isn't something that lets you bypass those kinds of protections (purely hypothetically, something like a page cache poisoning attack like the ones found in the past few weeks 😉). This was something I mentioned a bit back then (as we had just seen Dirty COW and the risk that it posed to these kinds of protections) but the performance issues were deemed too concerning to justify a protection against a hypothetical attack. It is quite frustrating that a mitigation we implemented back in 2019 would've protected against one class of container escapes (though it turns out that Kubernetes is vulnerable in other ways because they unfortunately share layers between different trust levels).
However, now that we have seen two attacks which showed it was not a hypothetical issue, maybe we should revisit that decision? Even something as simple as an environment variable that sysadmins could set to switch to memfd-based cloning would be nice to allow for more security-conscious administrators to use. Given that the logic was moved outside of runc init in #3987 we could even make it a per-container annotation (you would almost certainly want to always enable it, but making it an annotation would make it easier to expose in every container runtime). I would prefer it be opt-out (after all, how many people would've enabled this if we'd added it back in 2019?) but even an opt-in would be better than nothing...
Our original protection against CVE-2019-5736 involved making a copy of
/proc/self/exeinto a sealed memfd and re-executing for everyrunc createandrunc exec.This had its own issues (see #1980, though in retrospect quite a bit of that analysis is wrong) and eventually resulted in us having a somewhat weaker protection mechanism that relied on the kernel to not allow writes to pierce through read-only barriers (first it was a temporary read-only mount which was potentially bypassable with mount privileges, and now we use a temporary overlayfs with
/usr/binas a lowerdir).This is all well and good, as long as there isn't something that lets you bypass those kinds of protections (purely hypothetically, something like a page cache poisoning attack like the ones found in the past few weeks 😉). This was something I mentioned a bit back then (as we had just seen Dirty COW and the risk that it posed to these kinds of protections) but the performance issues were deemed too concerning to justify a protection against a hypothetical attack. It is quite frustrating that a mitigation we implemented back in 2019 would've protected against one class of container escapes (though it turns out that Kubernetes is vulnerable in other ways because they unfortunately share layers between different trust levels).
However, now that we have seen two attacks which showed it was not a hypothetical issue, maybe we should revisit that decision? Even something as simple as an environment variable that sysadmins could set to switch to memfd-based cloning would be nice to allow for more security-conscious administrators to use. Given that the logic was moved outside of
runc initin #3987 we could even make it a per-container annotation (you would almost certainly want to always enable it, but making it an annotation would make it easier to expose in every container runtime). I would prefer it be opt-out (after all, how many people would've enabled this if we'd added it back in 2019?) but even an opt-in would be better than nothing...