Conversation
Spec covers data flow, components, error handling, threat model, kernel requirements, multi-arch BPF build, and CI strategy. Plan decomposes the work into 16 TDD tasks landing on a single PR.
Centralizes the placeholder format used by both the webhook and the BPF program. Length is fixed (77 bytes) so BPF substitution can write in place without reallocating the user-space stack.
Thin wrappers around sys/wrapping/{wrap,unwrap} used by the BPF
credential injection mode. The webhook calls WrapValues to produce a
single-use 5-min token that travels through the PodSpec; the node
DaemonSet calls UnwrapValues to retrieve the real payload.
- Add TestUnwrapValues_VaultError covering the error path - Document that an unwrap error still consumes the token - Note that defer handles panic reset of SetWrappingLookupFunc
Adds the configuration knobs for the BPF credential protection layer. ModeBPF is validated as a legal runtime mode; BPFConfig holds the single Enabled switch plus operational tunables (TTL, tmpfs path, max mappings). All defaults applied at load time. No behavior change yet — Run* methods still need wiring. Also adds ANNOTATION_BPF_MAPPING constant used in subsequent tasks to attach (wrap_token, placeholders) data to pod metadata.
Reviewer noted that a separate applyBPFDefaults method diverged from the existing pattern (LogLevel, SyncTTLSecond, etc. live in the NewConfig struct literal). Moves the BPF defaults into the literal and deletes the helper. Adds a YAML round-trip test verifying that explicit values from config file are preserved through NewConfig.
Refactors applyEnvToContainers into shape-specific helpers and adds a single boolean gate. When the gate is on, the function wraps the credentials via Vault response-wrapping, generates fixed-length placeholders, and attaches the (wrap_token, placeholders) map as the bpf-mapping annotation. When the gate is off, behavior is byte-identical to today. Multi-DbConfiguration support under BPF is intentionally rejected with an explicit error for the v1 — adding it would require combining mappings into one annotation, tracked separately.
…g type Reviewer flagged duplication between applyClassic and applyURI (the wrap+annotate block was copy-pasted) and noted that the local bpfMapping struct in k8smutator would force the BPF runner DaemonSet to redefine the same JSON shape independently — diverging sources of truth. - Extract wrapAndAnnotate(ctx, pod, creds, vw, errContext) helper used by both shapes; carries the TODO(Task 8) for plumbing the wrap TTL. - Move BPFMapping struct to pkg/k8s/bpf_mapping.go so both the webhook (writer) and the runner (reader, Task 10) reference one exported type.
Recognizes the new runtime mode and routes it through Controller.RunBPF. The kernel-coupled body lives in pkg/bpf and is wired in a subsequent task; for now runBPFAgent is a build-tagged stub that returns an explicit 'not yet linked' error on Linux and 'not supported' on other platforms. The binary still compiles on macOS / Windows for local development. main.go gains a 'case config.ModeBPF' branch and includes RunBPF in the ModeAll errgroup when cfg.BPF.Enabled, so cluster operators can either deploy a dedicated DaemonSet (recommended) or run all modes in one process.
Reviewer noted that returning naked nil obscured the cancellation cause when RunBPF runs inside the ModeAll errgroup. Now returns ctx.Err() so context.Canceled bubbles up through Wait, while main.go continues to swallow it as a clean shutdown. Also tightens the test assertion (was 'nil OR context.Canceled', now strictly context.Canceled) and adds an Errorf log line in main.go so a non-canceled runErr is structured-logged before the 'stopped' line — without this, errors from the upcoming runBPFAgent stub would only surface via stderr.
Cgroup v2 inode lookup matching what bpf_get_current_cgroup_id() returns inside the LSM hook. Searches the three standard QoS slices (guaranteed, burstable, besteffort) and the three common container runtime prefixes (containerd, cri-o, docker). Linux-only: the BPF LSM hook this coordinates with is a Linux kernel feature.
Reviewer questioned the Guaranteed QoS path. After verifying against kubelet upstream source (pkg/kubelet/cm/qos_container_manager_linux.go lines 97-138, pod_container_manager_linux.go), the implementation is correct: kubelet's systemd cgroup driver only creates kubepods-burstable.slice and kubepods-besteffort.slice intermediate slices; Guaranteed pods land directly under kubepods.slice. Add a comment to prevent the same mistake from being made again, and a second comment explaining the hyphen-to-underscore systemd escaping. Also improves the not-found error message to include the list of tried paths and adds a test for the bare-container-ID input form.
One JSON file per pod under /run/vault-db-injector/bpf, atomic via write-then-rename. Memory-backed by Helm emptyDir so contents die at node reboot — consistent with credential lifecycle, since rebooted nodes lose their pods and re-admit through the webhook. Save/Load/LoadAll/Delete operations cover the runner's needs: LoadAll on startup to repopulate the in-memory cache, Save on each new pod, Delete on pod-deleted events (idempotent: missing file is not an error).
Hooks lsm/bprm_check_security to scan envp at exec time. For each mapping registered against the current cgroup, replaces matching placeholder runs with the real value (NUL-padded to placeholder length). The envp scan is a bounded stride loop (256 × 64 bytes = 16 KB coverage) that the BPF verifier accepts. Mappings cap at 8 per cgroup and 4096 cgroups per node — sized for typical workload density. BPF C sources and vmlinux.h live in pkg/bpf/c/ (isolated from the Go toolchain to avoid "C source files not allowed" errors). Compiled artifacts (substitute.amd64.bpf.o, substitute.arm64.bpf.o) sit in pkg/bpf/ where the Task 9 loader will embed them via go:embed. arm64 .bpf.o is currently a copy of amd64 and will be replaced by a real cross-compile when the Dockerfile clang stage lands (Task 12). vmlinux.h is bpftool-generated and committed for offline builds; .gitattributes marks it as linguist-generated to suppress GitHub diffing.
Loads the embedded substitute.{amd64,arm64}.bpf.o (selected by runtime
GOARCH), verifies BPF LSM is enabled in the kernel cmdline, attaches
the LSM hook, and exposes PutMapping / DeleteMapping / Close for the
runner to populate the cgroup→mappings hash map.
PlaceholderLen, ValueMax, and MaxMappingsPerCgroup constants mirror
the BPF C source. PutMapping enforces them at the Go boundary so a
malformed call to the runner never reaches the kernel.
Test coverage is structural (constants + kernel-support shape) since
real attach requires CAP_BPF; the integration test in a privileged
environment lives behind //go:build integration_bpf and is wired up
in the CI task.
cilium/ebpf v0.21.0 (added in commit d4dca7f for the BPF loader) requires go 1.24+. Aligns the CI matrix accordingly.
Reviewer noted that PutMapping and DeleteMapping looked up the map by name on every call, and that the update/delete errors were returned unwrapped (inconsistent with the rest of the file using cockroachdb/errors.Wrap). Caches the *ebpf.Map handle on the Loader struct at Load() time and wraps both update and delete errors with identifying context.
The runner attaches the BPF program, restores the in-memory cache from tmpfs, builds a Vault Connector authenticated with the DS's SA token, then runs an informer filtered by spec.nodeName == NODE_NAME. On pod added/updated: read bpf-mapping annotation → unwrap via Vault → resolve cgroup_id from podUID+containerID → program the BPF map → persist tmpfs. On pod deleted: delete BPF entry + tmpfs file. Idempotency via an in-memory processed-set; tmpfs restore preloads it so a DS restart does not re-consume already-spent wrap tokens. Prometheus metrics expose mappings_loaded gauge and unwrap_errors_total counter. Also extends k8s.KubernetesClient with RawClientset() for the informer factory; existing test fakes return nil since they do not exercise informers.
…ield Two reviewer-flagged issues: - DeleteFunc was silently dropping cache.DeletedFinalStateUnknown events that arrive after an informer re-sync, leaking BPF map entries for pods deleted while the watch was disconnected. - 'unwrapped data missing field' error path forgot to clear the processed flag, so subsequent informer events would silently skip the pod even though the wrap token had already been burned. Now also increments the unwrap_errors_total metric with reason 'missing_field' for observability.
A single Helm value (bpf.enabled) renders the BPF DaemonSet and tells the injector deployment to set BPF_ENABLED=true. When false, behavior is byte-identical to today: no DS rendered, no flag passed. The DaemonSet uses hostPID for cgroup resolution, mounts /sys/fs/cgroup read-only and /sys/fs/bpf, runs with CAP_BPF + CAP_PERFMON + CAP_SYS_RESOURCE on a read-only root, and uses an in-memory emptyDir at /run/vault-db-injector/bpf for tmpfs persistence (consistent with the credential lifecycle: tmpfs dies at node reboot, when the pods are also gone).
Reviewer noted that the ConfigMap was mounted at /bpf but the binary was never told to read it, leaving Vault address/role unconfigured at startup. Adds --config=/bpf/config.yaml to the DS args, matching the pattern used by the renewer/revoker deployments.
Dockerfile gains a clang+libbpf builder stage that compiles substitute.bpf.c for both linux/amd64 and linux/arm64; the resulting .bpf.o files are copied into the Go build stage and embedded via go:embed, replacing the placeholder stubs committed for offline builds. Multi-arch image build runs through docker buildx in CI. Also fixes the Go image version (1.23.6 → 1.24) to match go.mod. Adds .github/workflows/bpf-integration.yml that probes for BPF LSM in the runner kernel; when present, installs clang and runs the //go:build integration_bpf tests with sudo (CAP_BPF needed for actual program load + attach). When absent (most hosted GitHub runners), emits a warning and skips — the gate is still useful as documentation of how to validate locally on a kernel with BPF LSM. The new pkg/bpf/integration_test.go exercises the full Load → Close cycle against the real kernel.
build-bpf fails without -I $(BPF_LIBBPF_INCLUDE) on hosts where libbpf-dev is not installed (headers only live under linux-headers-$(uname -r)/…). Variable was previously removed speculatively — restore it (Option B). Also rename the CI install step to reflect bpftool is not needed at test time and drop linux-tools-generic from the apt list.
how-it-works/bpf-mode.md walks through architecture, activation, threat model, and failure modes — operator-facing prose with cross links to the design spec for deep details. getting-started/bpf-requirements.md lists kernel configs, capabilities, and tested distros so cluster operators can validate readiness before flipping bpf.enabled. comparison.md gains a row for credential invisibility at the K8s API layer — a unique differentiator vs vault-secrets-operator, external-secrets, and secrets-store-csi-driver. README mentions the feature; CONTRIBUTING explains how to validate BPF mode locally (kernel cmdline, build-bpf, integration test target).
…tructure, and hostPID
Issues fixed:
- Remove --mode=bpf arg from DaemonSet (main.go has no such flag, exits 2)
- Remove hostPID: true (unused; cgroup resolution uses /sys/fs/cgroup only)
- Drop redundant BPF env vars from DaemonSet (ConfigMap already carries them)
- Rename BPF_* env vars to INJECTOR_BPF_* in deployment-injector.yaml to match
envconfig.Process("INJECTOR", cfg) prefix; without this BPFConfig.Enabled stays false
- Rewrite BPF ConfigMap entry to nested bpf: block matching BPFConfig yaml tags
(bpfWrapTokenTTL at root was never unmarshalled; nested structure is required)
- Make VaultSecretName/VaultSecretPrefix validation conditional on mode != bpf:
BPF runner only unwraps tokens, never reads KV — requiring these fields hurts UX
…th at admission - Add placeholder.MaxValue = 73 (authoritative Go constant, matches BPF C-side buffer) - wrapAndAnnotate: accept explicit wrapTTL time.Duration, pass to WrapValues instead of 0; removes TODO(Task 8) and ensures cfg.BPF.WrapTokenTTL is honoured - wrapAndAnnotate: validate len(username) and len(password) <= placeholder.MaxValue before calling Vault; returns admission error if exceeded, matching the doc promise - Thread wrapTTL through applyEnvToContainersWithBPF -> applyClassic/applyURI signatures - applyEnvToContainers passes cfg.BPF.WrapTokenTTL to the inner testable variant - Update all tests to pass the new wrapTTL argument (0 for BPF-disabled paths) - Add TestApplyEnvToContainers_BPFEnabled_RejectLongCredentials covering both username-too-long and password-too-long cases
… metrics - EAGAIN (Issue 5): BPF program actually returns 0 (allow) when cgroup not found; document real behavior: placeholders reach the application → connection failure → CrashLoopBackoff → self-resolves when DS catches up - Placeholder length (Issue 8): fix "74 bytes" → "77 bytes" in two places (architecture overview and limitations section); add clarification that 73-byte max value accounts for NUL terminator and alignment padding - Phantom metrics (Issue 9): drop vault_injector_bpf_substitutions_total and vault_injector_bpf_map_size from the metrics table — only vault_injector_bpf_mappings_loaded (Gauge) and vault_injector_bpf_unwrap_errors_total (CounterVec) are implemented; also remove the bpf_map_size reference from the failure modes table
Replace manual NewConnector+Login with vault.ConnectAndRenew so the BPF runner benefits from background token renewal, matching the pattern used by renewer/revoker entry points.
Register vault_injector_bpf_map_size to track cgroup entries in the BPF map (distinct from mappings_loaded which counts pods). Increment on PutMapping success, decrement on DeleteMapping success, enabling "BPF map full" alerting against bpf.maxMappingsPerNode.
Refactor processPodAdded to iterate all regular, init, and ephemeral container statuses and call PutMapping for each resolved cgroup_id. Update PersistedMapping to store a list of cgroup_ids (not just one) so that processPodDeleted can clean up ALL BPF entries for the pod. processPodDeleted now reads cgroup_ids from tmpfs instead of re-resolving from a potentially stale pod object. Add multi-container tests and update persist_test.go for new struct shape.
…reflight, and CI invariants Issue #1: Load(maxMappings int) sets cgroup_mappings.MaxEntries from cfg.BPF.MaxMappingsPerNode; helm values.yml and configmap expose it (default 4096, matching MAX_CGROUPS in C source). Issue #2: processPodAdded rolls back PutMapping calls and removes UID from processed when persister.Save fails; mappingsLoaded is only incremented after both Put+Save succeed. New test TestProcessPodAdded_SaveFailureRollsBackBPFMap covers the path. Issue #3: Makefile verify-bpf-object target uses readelf section-diff instead of byte-exact cmp; added to test.yml CI step. Issue #4: bpf-integration workflow split into always-running compile-bpf job (enforced) and optional integration job (LSM-gated). Issue #5: checkCgroupSetup() / checkCgroupSetupAt(root) added to cgroup.go; Run() calls it before Load(); three unit tests cover OK/no-v2/no-systemd. Issue #7: scan_callback adds break after bpf_probe_write_user — two distinct placeholders cannot match at the same byte offset; BPF object recompiled. Issue #9 (partial): persistErrors CounterVec vault_injector_bpf_persist_errors_total{op} added and registered; Save and Delete failures increment it.
…efore wrapping Issue #6: check pod.Annotations[ANNOTATION_BPF_MAPPING] before calling vw.WrapValues so a second DbConfiguration in the same admission does not burn a single-use wrap token before returning the multi-config error.
…ndations Issue #11: add 'Recommended pod hardening' section to bpf-requirements.md covering RBAC restriction and Pod Security Admission restricted profile. Reference it from the bpf-mode.md Limitations section under the kubectl exec entry.
If PutMapping succeeds for container A and fails for container B, the previously written BPF entry for A is now deleted, processed[uid] is cleared, and mapSize is decremented — preventing a silent leak until DaemonSet restart. Added TestProcessPodAdded_PutMappingFailureRollsBack to assert the rollback: first cgroup deleted, processed map cleared, mappingsLoaded and mapSize both at net-zero after the failure.
Add .text, .rodata, and scan_callback to the readelf grep filter so structural presence of the scan_callback code section is verified. Also extend the target to cross-compile and verify the arm64 object against a fresh arm64 build (clang -D__TARGET_ARCH_arm64), not against the amd64 fresh binary which has differing relocation offsets. Update build-bpf to cross-compile arm64 locally via clang -target bpf -D__TARGET_ARCH_arm64 removing the "skipping locally" stub since cross-compile works on amd64 hosts without extra vmlinux.h setup.
Recompile substitute.arm64.bpf.o locally via: clang -O2 -g -target bpf -D__TARGET_ARCH_arm64 ... Cross-compile succeeded on amd64 host (no separate vmlinux.h needed; -target bpf bytecode is arch-agnostic, __TARGET_ARCH_arm64 only affects CO-RE relocation context). The binary was last committed in c72d83d before the bpf_loop break-fix in pkg/bpf/c/substitute.bpf.c.
…ompat
Tested end-to-end in a Linux 6.8 KVM with lsm=...,bpf cmdline:
1. Kernel verifier rejected our program with:
"LSM programs must have a GPL compatible license"
The license string was 'Apache-2.0'. Changed to 'Dual BSD/GPL' so
the BPF source is dual-licensed (GPL-2.0 OR Apache-2.0) — required
by the kernel for LSM programs and bpf_probe_write_user. The
project's overall Apache-2.0 license is unchanged; only this
single C source file is dual-licensed (SPDX header updated).
2. integration_test.go was using the old Load() signature; updated
to Load(0) for compile-time defaults since the M-1 fix changed
Load to accept a max-mappings argument.
3. TestProcessPodAdded_SaveFailureRollsBackBPFMap relied on
chmod 0o555 to force a Save error, which root bypasses. The CI
integration runner runs as root (CAP_BPF), so the test now
forces failure via a parent-is-a-file path (MkdirAll always
fails), which root cannot bypass.
Validation: TestIntegration_LoadAttachClose now PASSES on real kernel.
Full suite green with -tags=integration_bpf -race under sudo: 189 tests.
K3D/K3s and many lightweight K8s distributions use the cgroupfs cgroup driver instead of systemd. Our resolver previously only searched systemd paths (kubepods.slice/...). Extends resolveCgroupIDAt to also try cgroupfs paths (kubepods/pod<UID>/<containerID> et al), preferring systemd (more common in production) and falling back to cgroupfs. Also: K3D node containers don't expose /sys/kernel/security by default, so the loader's BPF LSM precondition check fails. Mount /sys/kernel/security read-only in the DaemonSet.
K3D runtime validation revealed that bpf_probe_write_user is not
permitted in LSM programs by the verifier — only KPROBE/TRACEPOINT
types may call it. The original lsm/bprm_check_security design
was therefore unloadable on any kernel.
Switches to tracepoint/syscalls/sys_enter_execve which fires before
the kernel copies envp from user memory to the new process's stack.
We walk the user-space envp[] pointer array, read each env string
window directly from user memory via bpf_probe_read_user (user
pointer + variable offset is accepted; map_value + variable offset
is not on kernels ≤ 6.8), compare against registered placeholders
using __builtin_memcmp, and write the substitution back to user
memory. The kernel's subsequent copy_strings (do_execveat_common)
propagates our writes into the new process's environment.
Additional runtime findings on kernel 6.8:
- bpf_probe_write_user requires CAP_SYS_ADMIN at load time (not
just CAP_BPF). Adds SYS_ADMIN to the DaemonSet capabilities.
- tracefs must be bind-mounted into the pod for the pre-flight
kernel-support check. Adds /sys/kernel/tracing hostPath volume.
- Stack budget: #pragma unroll for PLACEHOLDER_LEN=77 byte loops
combined with per-entry spills blows the 512-byte BPF stack. Fixed
by using __builtin_memcmp (no per-byte spill) and bpf_probe_read_kernel
for value copy, keeping the frame at 88 bytes.
The Go loader is updated to attach as a tracepoint (Tracepoint()
instead of AttachLSM()) and look up program "sys_enter_execve".
The kernel-support check is updated to probe tracefs instead of
/sys/kernel/security/lsm. Substitution semantics are unchanged
from the operator's view; only the kernel hook type changes.
Validated on kernel 6.8.0-111-generic + k3d (BPF LSM still enabled
in kernel cmdline but not required for tracepoint mode).
Two real bugs discovered during full end-to-end validation in a k3d cluster on kernel 6.8. 1. Runner: container-IDs check now BEFORE Unwrap. The wrap token is single-use server-side; unwrapping before kubelet has populated ContainerStatuses caused retry events to burn the token. Now we silently wait for the next informer event when statuses aren't ready, then unwrap once. 2. Loader: Go struct 'entry' was 159 bytes (binary.Write doesn't pad) while C struct 'mapping' is 160 bytes (compiler aligns ValueLen to 4 bytes after the 74-byte Value array). Map.Update rejected with 'doesn't marshal to 1288 bytes'. Added explicit Pad1[1]byte to match C layout exactly. End-to-end validation now passes: - kubectl get pod -o yaml shows placeholders (__VDBI_PH_<hex>___) - kubectl exec -- env shows REAL substituted values
Init containers and fast-restarting (CrashLoopBackoff) containers
exec'd before the DaemonSet had time to discover their per-container
cgroup_id and program the BPF map, so their first execve saw raw
placeholders.
Root cause: kubelet creates the pod-level cgroup at admission time
(before any container starts), then a per-container cgroup at runtime
just before the container's execve. The runner only programmed the
per-container ids, learned via container statuses, which arrive
asynchronously after the kubelet starts the container.
Fix:
- ResolvePodCgroupID(podUID): inode of kubepods{,.slice}/.../pod<UID>
(no container suffix). Exists right after pod admission.
- processPodAdded: program the pod-cgroup in the BPF map FIRST, then
best-effort the per-container ids. Failure to resolve a per-container
id is no longer fatal; the ancestor walk covers it.
- substitute.bpf.c: on a leaf-cgroup miss, walk
bpf_get_current_ancestor_cgroup_id(level) for level 0..7 and look up
each in the map. Single hit suffices to drive substitution.
This eliminates the init-container race AND the CrashLoopBackoff
restart race (same root cause: per-container cgroup_id changes per
restart on some runtimes; the pod cgroup_id is stable for the pod
lifetime).
Validated in k3d (Linux 6.8): init container reads real $DB_USER from
the very first execve; crash-loop pod logs show real values across
3+ restart iterations.
emptyDir{medium: Memory} is per-POD: every DaemonSet pod restart
(rolling update, OOM, eviction) wipes it, so restoreBPFMaps finds
no entries and every running pod on the node loses substitution
until the credentials are re-issued (which can't happen — wrap
tokens are single-use).
Replace with hostPath at /run/vault-db-injector/bpf,
type DirectoryOrCreate. /run is tmpfs on every modern distribution
(systemd-tmpfiles), so:
- Files survive a DS pod restart (restoreBPFMaps re-programs the BPF
map for already-running pods).
- Files DO NOT survive a node reboot, which is the desired lifecycle:
a rebooted node loses all its pods anyway, so the next pod admission
triggers a fresh wrap-token unwrap.
Validated in k3d: rolling-restart of the BPF DS leaves running pods
with their substitution intact, and the new DS logs
"restoreBPFMaps: restored N cgroup(s) for pod <uid>" for each entry.
Pivot rationale: BPF probe_write_user can't do variable-length writes, breaking URI/DSN-mode envs (postgres://user:__PH__@host/db). NRI plugin substitutes env in container spec before runc, fixing the structural bug and removing all kernel/capability/lockdown concerns.
Pivot from eBPF probe_write_user (which couldn't do variable-length writes, breaking URI/DSN-mode envs) to a containerd/CRI-O NRI plugin that mutates container env in CreateContainer before runc. - pkg/bpf/ deleted (runner, loader, cgroup, BPF C, embedded objects) - pkg/nri/ added with substitute, plugin, vault unwrap, runner - Annotation renamed bpf-mapping → nri-mapping - Mode renamed bpf → nri across config/controller/main - Webhook drops length cap (URI mode now works for arbitrary lengths) - placeholder.MaxValue removed; only structural shape matters now - Metrics: vdbi_nri_substitutions_total, vdbi_nri_unwrap_failures_total - Makefile BPF targets removed 166 tests pass.
NRI DS runs non-root, drops all capabilities, mounts only /var/run/nri hostPath socket. ConfigMap renamed; values schema renamed bpf.* → nri.*. Webhook deployment env vars renamed accordingly.
- README, CONTRIBUTING, comparison docs use NRI naming - mkdocs.yml drops bpf-mode and bpf-requirements pages - test.yml CI drops BPF clang/object verification steps - Test names and comments use NRI consistently - 171 tests pass
Without persistence, a wrap token consumed at first CreateContainer was lost from plugin memory if the DS pod restarted, leaving subsequent container retries (CrashLoop, kubelet restart) unable to substitute — the placeholder string leaked into the running app. Reproduction (now fixed): 1. Apply pod with nri-mapping annotation, exit 1 to force CrashLoop 2. Kill plugin DS pods (cache wiped) 3. Wait for kubelet container retry 4. Before fix: container env shows literal placeholder; after: real value Implementation: - pkg/nri/persist.go: atomic write (tmp+rename), 0600 file perms, 0700 dir perms - pkg/nri/plugin.go: write-through on resolveMapping and RemovePodSandbox - pkg/nri/runner.go: loadCache before stub.Run with boot logging - Synchronize callback evicts entries for pods no longer on the node - pkg/config/config.go: NRIConfig.CachePath (default /run/vault-db-injector/nri/cache.json) - helm: hostPath /run/vault-db-injector/nri DirectoryOrCreate Trade-off: unwrapped credentials sit in tmpfs cleartext on the node. Same posture as the prior BPF design and as kubelet's projected SA tokens. A root-on-node attacker can already read /proc/<pid>/environ; the cache adds zero new attack surface vs that baseline. 177 unit tests pass; reproduction scenario verified on k3d.
An empty placeholder key in the mapping caused strings.ReplaceAll(env, "",
val) to insert val between every character, corrupting PATH/HOSTNAME and
every other env var. Same issue would arise from any non-conforming key
(short, wrong shape, etc.).
Validation: Substitute now skips entries where the key fails
placeholder.IsPlaceholder. Webhook never emits malformed keys
(placeholder.Generate is safe by construction), but a manually-crafted
or future-buggy annotation no longer corrupts container env.
Reproduction (now safe):
apply pod with annotation '{"placeholders":{"":"x"}}'
before: every env var doubled with junk inserted between every char
after: pod runs with original env intact
Tests in pkg/nri/substitute_test.go updated to use placeholder.Generate
output (existing tests used short tokens that fail validation).
178 unit tests pass; runtime reproduction confirmed on k3d.
Closes the placeholder-leak window when the NRI plugin is absent on a
node. Without this gate, a pod scheduled to a node where the plugin DS
is missing (image pull, broken DS, nodeSelector mismatch, post-install
delay) starts with the literal placeholder string in env, leaking into
the running app.
Plugin (pkg/nri/readiness.go):
- Labels its node with vault-db-injector.numberly.io/nri-ready=true once
registered with containerd and cache loaded
- 30s reconcile loop in case label is removed externally
- Removes label on SIGTERM (best-effort, 5s timeout)
Webhook (pkg/k8smutator):
- Merges a nodeAffinity matchExpression requiring the readiness label
into every existing NodeSelectorTerm (preserves OR semantic with
user expressions). Tolerations, nodeSelector, and preferred (soft)
affinities are untouched.
Helm:
- ClusterRole grants get/patch on nodes (cluster-wide; plugin restricts
writes to os.Getenv("NODE_NAME") in code)
Behavior matrix:
- Plugin not yet ready → pods Pending (visible failure mode)
- Plugin running normally → pods schedule + substitute
- Plugin crashes after schedule → cache persistence covers (prior fix)
- Label removed manually → re-applied within 30s
185 unit tests pass; runtime reproduction on k3d shows pod Pending when
plugin disabled, schedules and substitutes when restored.
This reverts commit 92cc690.
After reverting the readiness-label/affinity approach (introduced more complexity than it bought, plus a SIGKILL-stale-label bug), we rely on the simpler 'visible failure' model: pods with credentials that miss substitution start unsubstituted, fail to connect, and CrashLoop. This is detectable via Prometheus. Adds docs/how-it-works/nri-mode.md covering: - Architecture summary - Failure modes and existing metrics - Sample Prometheus alert for 'plugin missing on node where pods have the annotation' - Hardening checklist (priorityClass, resource requests) - Trust posture vs kubelet's projected SA tokens
Force-deleted pods (kubectl delete --grace-period=0 --force, kubelet eviction under pressure, or any path that bypasses graceful shutdown) do not trigger NRI's RemovePodSandbox event. Without the sweep, unwrapped credentials of those pods would sit on tmpfs until the next plugin restart. The sweeper lists pods on the local node via the k8s API every 5 minutes and evicts cache entries whose UIDs are no longer alive. Existing ClusterRole grants the pod get/list/watch verbs already; no new RBAC required. Tested: pod with annotation force-deleted → cache initially still has the entry → after 30s the sweep evicts it (interval lowered for the test).
…ction Reference Kyverno ClusterPolicy at helm/policies/kyverno-restrict-nri-socket.yaml blocks hostPath mounts of /var/run/nri and /opt/nri outside trusted namespaces (default: vault-db-injector, kube-system). README security section explains the inherent NRI threat (any pod that mounts the socket becomes a plugin and can mutate every container) and ranks mitigations: PSA restricted/baseline first, Kyverno policy second, SELinux third.
Closes Hunter finding #H5 (wrap_token-as-bearer-credential leak):
anyone with pods.get + Vault network access could read the wrap_token
from a pod annotation and unwrap before the legitimate plugin did.
Design (docs/superpowers/specs/2026-05-03-pull-not-push-design.md):
- Webhook stops calling Vault sys/wrapping/wrap. Annotation schema v2
contains only {db_path, db_role, placeholders, request_id,
pod_namespace, pod_service_account} — no Vault credential.
- Plugin authenticates to Vault using its own SA (k8s auth method)
and creates dynamic database credentials at CreateContainer time.
- Plugin re-runs CanIGetRoles for the target pod identity claimed in
the annotation, defending against annotation forgery via pods.update
RBAC.
- Lease tagged with pod UID — existing renewer/revoker pipeline
unchanged.
- Cache flow unchanged: per-pod placeholder→value mapping, write-through
to tmpfs, sweep on Synchronize and via periodic k8s pod-list query.
Webhook split into authorizeDbAccess + fetchDbCredentials. Legacy mode
(NRI disabled) still pre-fetches creds at admission. NRI mode just
authorizes + annotates.
Config validation: VaultSecretName / VaultSecretPrefix now required in
all modes (plugin needs them too for lease metadata KV write).
Schema versioning: NRIMapping.SchemaVersion=2 hardcoded; plugin rejects
schema 1 with a clear error so old wrap_token annotations fail visibly
during upgrade rather than silently leak.
178 unit tests pass. K3D end-to-end validation in follow-up.
… (#H6)
Hunter mode found a critical regression introduced by the pull-not-push
refactor: the plugin trusted pod_namespace and pod_service_account
straight out of the annotation. An attacker with pods.create could:
1. Run a pod with their own (unprivileged) SA (e.g. "evil")
2. Annotate it with pod_service_account="trusted-app"
3. Plugin called CanIGetRoles with the CLAIMED identity → passed
4. Plugin fetched the privileged role's credentials and put them in
the attacker's pod env → privilege escalation
Reproduction (now blocked):
kubectl apply <pod with sa=evil, annotation claiming sa=default>
before: container env contains real privileged-app DB credentials
after: plugin logs 'pod identity mismatch: actual default/evil !=
annotated default/default — refusing to fetch credentials'
container starts with placeholder, app fails visibly
Implementation:
- fetchAndBuildMapping now takes the (namespace, name) reported by NRI
for the actual pod, queries kube-apiserver for the pod, and reads
spec.serviceAccountName directly. The annotation's claimed identity
is only used to verify it MATCHES the actual; if disagreement, refuse.
- CanIGetRoles is called with the K8s-attested identity, never the
annotation's. Same for the credential fetch (Namespace +
ServiceAccount fields in DbCredentialsRequest).
178 unit tests pass. Runtime suite on k3d: 9/9 (A-I) including the new
I_identity_forge_blocked case.
…P-2, IMP-5) CRIT-1 — Pod identity is now verified by UID, not just (namespace, name). Closes a name-reuse race where an attacker could force-delete a pod between admission and CreateContainer, recreate one with the same name+SA, and have the plugin fetch credentials for the recreated pod whose UID doesn't match the NRI sandbox UID. fetchAndBuildMapping now refuses if NRI sandbox UID != kube-apiserver pod.UID. CRIT-2 — Removed dead wrap/unwrap path. The pull-not-push refactor (76c2074) replaced the wrap_token-as-bearer-credential pattern, but left behind: - pkg/vault/vault.go: WrapValues / UnwrapValues (deleted) - pkg/vault/wrap_test.go (deleted) - pkg/k8smutator/k8smutator_test.go: stubWrapper helper (deleted) - pkg/config/config.go: NRIConfig.WrapTokenTTL field (deleted) - helm/values.yml, configmaps.yaml, deployment-injector.yaml: nri.wrapTokenTTL knob and INJECTOR_NRI_WRAP_TOKEN_TTL env var (deleted) Operators no longer see a knob that does nothing. Tests no longer exercise dead code paths. IMP-1 — Rewrote docs/how-it-works/nri-mode.md to describe the schema v2 pull-not-push design. Old text described the wrap/unwrap flow which no longer exists. New version: architecture diagram for v2, the three-layer identity attestation defense (UID + namespace + SA), schema upgrade path, hardening checklist, accurate trust posture. IMP-2 — Kyverno policy now blocks hostPath mounts of /run/vault-db-injector in addition to /var/run/nri and /opt/nri. The credential cache lives there and was previously not covered by the policy (PSA baseline already covered it, but defense-in-depth). IMP-5 — Deleted leaked pkg/nri/.tmp file (saveCache test artifact). 174 unit tests pass. K3D edge suite: 9/9 (A_substitution, B_uri_mode, C_multi_container, D_init, E_empty_placeholder, F_malformed_json, G_cache_persistence, H_v1_rejected, I_identity_forge_blocked).
5f851cb to
b274ef2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.