Skip to content

Feat/nri injection mode#56

Open
SoulKyu wants to merge 68 commits intomainfrom
feat/ebpf-injection-mode
Open

Feat/nri injection mode#56
SoulKyu wants to merge 68 commits intomainfrom
feat/ebpf-injection-mode

Conversation

@SoulKyu
Copy link
Copy Markdown
Member

@SoulKyu SoulKyu commented May 4, 2026

No description provided.

SoulKyu added 30 commits May 2, 2026 09:57
Spec covers data flow, components, error handling, threat model,
kernel requirements, multi-arch BPF build, and CI strategy.
Plan decomposes the work into 16 TDD tasks landing on a single PR.
Centralizes the placeholder format used by both the webhook and the
BPF program. Length is fixed (77 bytes) so BPF substitution can write
in place without reallocating the user-space stack.
Thin wrappers around sys/wrapping/{wrap,unwrap} used by the BPF
credential injection mode. The webhook calls WrapValues to produce a
single-use 5-min token that travels through the PodSpec; the node
DaemonSet calls UnwrapValues to retrieve the real payload.
- Add TestUnwrapValues_VaultError covering the error path
- Document that an unwrap error still consumes the token
- Note that defer handles panic reset of SetWrappingLookupFunc
Adds the configuration knobs for the BPF credential protection layer.
ModeBPF is validated as a legal runtime mode; BPFConfig holds the
single Enabled switch plus operational tunables (TTL, tmpfs path,
max mappings). All defaults applied at load time. No behavior change
yet — Run* methods still need wiring.

Also adds ANNOTATION_BPF_MAPPING constant used in subsequent tasks
to attach (wrap_token, placeholders) data to pod metadata.
Reviewer noted that a separate applyBPFDefaults method diverged from
the existing pattern (LogLevel, SyncTTLSecond, etc. live in the
NewConfig struct literal). Moves the BPF defaults into the literal
and deletes the helper. Adds a YAML round-trip test verifying that
explicit values from config file are preserved through NewConfig.
Refactors applyEnvToContainers into shape-specific helpers and adds a
single boolean gate. When the gate is on, the function wraps the
credentials via Vault response-wrapping, generates fixed-length
placeholders, and attaches the (wrap_token, placeholders) map as the
bpf-mapping annotation. When the gate is off, behavior is
byte-identical to today.

Multi-DbConfiguration support under BPF is intentionally rejected
with an explicit error for the v1 — adding it would require combining
mappings into one annotation, tracked separately.
…g type

Reviewer flagged duplication between applyClassic and applyURI (the
wrap+annotate block was copy-pasted) and noted that the local
bpfMapping struct in k8smutator would force the BPF runner DaemonSet
to redefine the same JSON shape independently — diverging sources of
truth.

- Extract wrapAndAnnotate(ctx, pod, creds, vw, errContext) helper used
  by both shapes; carries the TODO(Task 8) for plumbing the wrap TTL.
- Move BPFMapping struct to pkg/k8s/bpf_mapping.go so both the
  webhook (writer) and the runner (reader, Task 10) reference one
  exported type.
Recognizes the new runtime mode and routes it through Controller.RunBPF.
The kernel-coupled body lives in pkg/bpf and is wired in a subsequent
task; for now runBPFAgent is a build-tagged stub that returns an
explicit 'not yet linked' error on Linux and 'not supported' on other
platforms. The binary still compiles on macOS / Windows for local
development.

main.go gains a 'case config.ModeBPF' branch and includes RunBPF in
the ModeAll errgroup when cfg.BPF.Enabled, so cluster operators can
either deploy a dedicated DaemonSet (recommended) or run all modes in
one process.
Reviewer noted that returning naked nil obscured the cancellation
cause when RunBPF runs inside the ModeAll errgroup. Now returns
ctx.Err() so context.Canceled bubbles up through Wait, while main.go
continues to swallow it as a clean shutdown.

Also tightens the test assertion (was 'nil OR context.Canceled', now
strictly context.Canceled) and adds an Errorf log line in main.go
so a non-canceled runErr is structured-logged before the 'stopped'
line — without this, errors from the upcoming runBPFAgent stub
would only surface via stderr.
Cgroup v2 inode lookup matching what bpf_get_current_cgroup_id()
returns inside the LSM hook. Searches the three standard QoS slices
(guaranteed, burstable, besteffort) and the three common container
runtime prefixes (containerd, cri-o, docker).

Linux-only: the BPF LSM hook this coordinates with is a Linux kernel
feature.
Reviewer questioned the Guaranteed QoS path. After verifying against
kubelet upstream source (pkg/kubelet/cm/qos_container_manager_linux.go
lines 97-138, pod_container_manager_linux.go), the implementation is
correct: kubelet's systemd cgroup driver only creates
kubepods-burstable.slice and kubepods-besteffort.slice intermediate
slices; Guaranteed pods land directly under kubepods.slice. Add a
comment to prevent the same mistake from being made again, and a
second comment explaining the hyphen-to-underscore systemd escaping.

Also improves the not-found error message to include the list of
tried paths and adds a test for the bare-container-ID input form.
One JSON file per pod under /run/vault-db-injector/bpf, atomic via
write-then-rename. Memory-backed by Helm emptyDir so contents die at
node reboot — consistent with credential lifecycle, since rebooted
nodes lose their pods and re-admit through the webhook.

Save/Load/LoadAll/Delete operations cover the runner's needs:
LoadAll on startup to repopulate the in-memory cache, Save on each
new pod, Delete on pod-deleted events (idempotent: missing file is
not an error).
Hooks lsm/bprm_check_security to scan envp at exec time. For each
mapping registered against the current cgroup, replaces matching
placeholder runs with the real value (NUL-padded to placeholder
length).

The envp scan is a bounded stride loop (256 × 64 bytes = 16 KB
coverage) that the BPF verifier accepts. Mappings cap at 8 per cgroup
and 4096 cgroups per node — sized for typical workload density.

BPF C sources and vmlinux.h live in pkg/bpf/c/ (isolated from the Go
toolchain to avoid "C source files not allowed" errors). Compiled
artifacts (substitute.amd64.bpf.o, substitute.arm64.bpf.o) sit in
pkg/bpf/ where the Task 9 loader will embed them via go:embed.

arm64 .bpf.o is currently a copy of amd64 and will be replaced by a
real cross-compile when the Dockerfile clang stage lands (Task 12).

vmlinux.h is bpftool-generated and committed for offline builds;
.gitattributes marks it as linguist-generated to suppress GitHub
diffing.
Loads the embedded substitute.{amd64,arm64}.bpf.o (selected by runtime
GOARCH), verifies BPF LSM is enabled in the kernel cmdline, attaches
the LSM hook, and exposes PutMapping / DeleteMapping / Close for the
runner to populate the cgroup→mappings hash map.

PlaceholderLen, ValueMax, and MaxMappingsPerCgroup constants mirror
the BPF C source. PutMapping enforces them at the Go boundary so a
malformed call to the runner never reaches the kernel.

Test coverage is structural (constants + kernel-support shape) since
real attach requires CAP_BPF; the integration test in a privileged
environment lives behind //go:build integration_bpf and is wired up
in the CI task.
cilium/ebpf v0.21.0 (added in commit d4dca7f for the BPF loader)
requires go 1.24+. Aligns the CI matrix accordingly.
Reviewer noted that PutMapping and DeleteMapping looked up the map by
name on every call, and that the update/delete errors were returned
unwrapped (inconsistent with the rest of the file using
cockroachdb/errors.Wrap). Caches the *ebpf.Map handle on the Loader
struct at Load() time and wraps both update and delete errors with
identifying context.
The runner attaches the BPF program, restores the in-memory cache
from tmpfs, builds a Vault Connector authenticated with the DS's SA
token, then runs an informer filtered by spec.nodeName == NODE_NAME.

On pod added/updated:
  read bpf-mapping annotation → unwrap via Vault →
  resolve cgroup_id from podUID+containerID →
  program the BPF map → persist tmpfs.
On pod deleted: delete BPF entry + tmpfs file.

Idempotency via an in-memory processed-set; tmpfs restore preloads it
so a DS restart does not re-consume already-spent wrap tokens.

Prometheus metrics expose mappings_loaded gauge and unwrap_errors_total
counter.

Also extends k8s.KubernetesClient with RawClientset() for the informer
factory; existing test fakes return nil since they do not exercise
informers.
…ield

Two reviewer-flagged issues:
- DeleteFunc was silently dropping cache.DeletedFinalStateUnknown
  events that arrive after an informer re-sync, leaking BPF map
  entries for pods deleted while the watch was disconnected.
- 'unwrapped data missing field' error path forgot to clear the
  processed flag, so subsequent informer events would silently
  skip the pod even though the wrap token had already been burned.
  Now also increments the unwrap_errors_total metric with reason
  'missing_field' for observability.
A single Helm value (bpf.enabled) renders the BPF DaemonSet and tells
the injector deployment to set BPF_ENABLED=true. When false, behavior
is byte-identical to today: no DS rendered, no flag passed.

The DaemonSet uses hostPID for cgroup resolution, mounts /sys/fs/cgroup
read-only and /sys/fs/bpf, runs with CAP_BPF + CAP_PERFMON +
CAP_SYS_RESOURCE on a read-only root, and uses an in-memory emptyDir
at /run/vault-db-injector/bpf for tmpfs persistence (consistent with
the credential lifecycle: tmpfs dies at node reboot, when the pods
are also gone).
Reviewer noted that the ConfigMap was mounted at /bpf but the binary
was never told to read it, leaving Vault address/role unconfigured
at startup. Adds --config=/bpf/config.yaml to the DS args, matching
the pattern used by the renewer/revoker deployments.
Dockerfile gains a clang+libbpf builder stage that compiles
substitute.bpf.c for both linux/amd64 and linux/arm64; the resulting
.bpf.o files are copied into the Go build stage and embedded via
go:embed, replacing the placeholder stubs committed for offline
builds. Multi-arch image build runs through docker buildx in CI.
Also fixes the Go image version (1.23.6 → 1.24) to match go.mod.

Adds .github/workflows/bpf-integration.yml that probes for BPF LSM
in the runner kernel; when present, installs clang and runs the
//go:build integration_bpf tests with sudo (CAP_BPF needed for
actual program load + attach). When absent (most hosted GitHub
runners), emits a warning and skips — the gate is still useful as
documentation of how to validate locally on a kernel with BPF LSM.

The new pkg/bpf/integration_test.go exercises the full Load → Close
cycle against the real kernel.
build-bpf fails without -I $(BPF_LIBBPF_INCLUDE) on hosts where libbpf-dev
is not installed (headers only live under linux-headers-$(uname -r)/…).
Variable was previously removed speculatively — restore it (Option B).
Also rename the CI install step to reflect bpftool is not needed at test
time and drop linux-tools-generic from the apt list.
how-it-works/bpf-mode.md walks through architecture, activation,
threat model, and failure modes — operator-facing prose with cross
links to the design spec for deep details.

getting-started/bpf-requirements.md lists kernel configs, capabilities,
and tested distros so cluster operators can validate readiness before
flipping bpf.enabled.

comparison.md gains a row for credential invisibility at the K8s API
layer — a unique differentiator vs vault-secrets-operator,
external-secrets, and secrets-store-csi-driver.

README mentions the feature; CONTRIBUTING explains how to validate
BPF mode locally (kernel cmdline, build-bpf, integration test
target).
…tructure, and hostPID

Issues fixed:
- Remove --mode=bpf arg from DaemonSet (main.go has no such flag, exits 2)
- Remove hostPID: true (unused; cgroup resolution uses /sys/fs/cgroup only)
- Drop redundant BPF env vars from DaemonSet (ConfigMap already carries them)
- Rename BPF_* env vars to INJECTOR_BPF_* in deployment-injector.yaml to match
  envconfig.Process("INJECTOR", cfg) prefix; without this BPFConfig.Enabled stays false
- Rewrite BPF ConfigMap entry to nested bpf: block matching BPFConfig yaml tags
  (bpfWrapTokenTTL at root was never unmarshalled; nested structure is required)
- Make VaultSecretName/VaultSecretPrefix validation conditional on mode != bpf:
  BPF runner only unwraps tokens, never reads KV — requiring these fields hurts UX
…th at admission

- Add placeholder.MaxValue = 73 (authoritative Go constant, matches BPF C-side buffer)
- wrapAndAnnotate: accept explicit wrapTTL time.Duration, pass to WrapValues instead of 0;
  removes TODO(Task 8) and ensures cfg.BPF.WrapTokenTTL is honoured
- wrapAndAnnotate: validate len(username) and len(password) <= placeholder.MaxValue before
  calling Vault; returns admission error if exceeded, matching the doc promise
- Thread wrapTTL through applyEnvToContainersWithBPF -> applyClassic/applyURI signatures
- applyEnvToContainers passes cfg.BPF.WrapTokenTTL to the inner testable variant
- Update all tests to pass the new wrapTTL argument (0 for BPF-disabled paths)
- Add TestApplyEnvToContainers_BPFEnabled_RejectLongCredentials covering both
  username-too-long and password-too-long cases
… metrics

- EAGAIN (Issue 5): BPF program actually returns 0 (allow) when cgroup not found;
  document real behavior: placeholders reach the application → connection failure
  → CrashLoopBackoff → self-resolves when DS catches up
- Placeholder length (Issue 8): fix "74 bytes" → "77 bytes" in two places
  (architecture overview and limitations section); add clarification that 73-byte
  max value accounts for NUL terminator and alignment padding
- Phantom metrics (Issue 9): drop vault_injector_bpf_substitutions_total and
  vault_injector_bpf_map_size from the metrics table — only
  vault_injector_bpf_mappings_loaded (Gauge) and
  vault_injector_bpf_unwrap_errors_total (CounterVec) are implemented;
  also remove the bpf_map_size reference from the failure modes table
Replace manual NewConnector+Login with vault.ConnectAndRenew so the BPF
runner benefits from background token renewal, matching the pattern used
by renewer/revoker entry points.
Register vault_injector_bpf_map_size to track cgroup entries in the BPF
map (distinct from mappings_loaded which counts pods). Increment on
PutMapping success, decrement on DeleteMapping success, enabling
"BPF map full" alerting against bpf.maxMappingsPerNode.
Refactor processPodAdded to iterate all regular, init, and ephemeral
container statuses and call PutMapping for each resolved cgroup_id.
Update PersistedMapping to store a list of cgroup_ids (not just one)
so that processPodDeleted can clean up ALL BPF entries for the pod.
processPodDeleted now reads cgroup_ids from tmpfs instead of re-resolving
from a potentially stale pod object.

Add multi-container tests and update persist_test.go for new struct shape.
SoulKyu added 27 commits May 2, 2026 14:13
…reflight, and CI invariants

Issue #1: Load(maxMappings int) sets cgroup_mappings.MaxEntries from
  cfg.BPF.MaxMappingsPerNode; helm values.yml and configmap expose it
  (default 4096, matching MAX_CGROUPS in C source).

Issue #2: processPodAdded rolls back PutMapping calls and removes UID
  from processed when persister.Save fails; mappingsLoaded is only
  incremented after both Put+Save succeed.  New test
  TestProcessPodAdded_SaveFailureRollsBackBPFMap covers the path.

Issue #3: Makefile verify-bpf-object target uses readelf section-diff
  instead of byte-exact cmp; added to test.yml CI step.

Issue #4: bpf-integration workflow split into always-running compile-bpf
  job (enforced) and optional integration job (LSM-gated).

Issue #5: checkCgroupSetup() / checkCgroupSetupAt(root) added to cgroup.go;
  Run() calls it before Load(); three unit tests cover OK/no-v2/no-systemd.

Issue #7: scan_callback adds break after bpf_probe_write_user — two
  distinct placeholders cannot match at the same byte offset; BPF
  object recompiled.

Issue #9 (partial): persistErrors CounterVec
  vault_injector_bpf_persist_errors_total{op} added and registered;
  Save and Delete failures increment it.
…efore wrapping

Issue #6: check pod.Annotations[ANNOTATION_BPF_MAPPING] before calling
  vw.WrapValues so a second DbConfiguration in the same admission does
  not burn a single-use wrap token before returning the multi-config error.
…ndations

Issue #11: add 'Recommended pod hardening' section to bpf-requirements.md
  covering RBAC restriction and Pod Security Admission restricted profile.
  Reference it from the bpf-mode.md Limitations section under the
  kubectl exec entry.
If PutMapping succeeds for container A and fails for container B, the
previously written BPF entry for A is now deleted, processed[uid] is
cleared, and mapSize is decremented — preventing a silent leak until
DaemonSet restart.

Added TestProcessPodAdded_PutMappingFailureRollsBack to assert the
rollback: first cgroup deleted, processed map cleared, mappingsLoaded
and mapSize both at net-zero after the failure.
Add .text, .rodata, and scan_callback to the readelf grep filter so
structural presence of the scan_callback code section is verified.
Also extend the target to cross-compile and verify the arm64 object
against a fresh arm64 build (clang -D__TARGET_ARCH_arm64), not against
the amd64 fresh binary which has differing relocation offsets.

Update build-bpf to cross-compile arm64 locally via
  clang -target bpf -D__TARGET_ARCH_arm64
removing the "skipping locally" stub since cross-compile works on
amd64 hosts without extra vmlinux.h setup.
Recompile substitute.arm64.bpf.o locally via:
  clang -O2 -g -target bpf -D__TARGET_ARCH_arm64 ...

Cross-compile succeeded on amd64 host (no separate vmlinux.h needed;
-target bpf bytecode is arch-agnostic, __TARGET_ARCH_arm64 only
affects CO-RE relocation context). The binary was last committed in
c72d83d before the bpf_loop break-fix in pkg/bpf/c/substitute.bpf.c.
…ompat

Tested end-to-end in a Linux 6.8 KVM with lsm=...,bpf cmdline:

1. Kernel verifier rejected our program with:
     "LSM programs must have a GPL compatible license"
   The license string was 'Apache-2.0'. Changed to 'Dual BSD/GPL' so
   the BPF source is dual-licensed (GPL-2.0 OR Apache-2.0) — required
   by the kernel for LSM programs and bpf_probe_write_user. The
   project's overall Apache-2.0 license is unchanged; only this
   single C source file is dual-licensed (SPDX header updated).

2. integration_test.go was using the old Load() signature; updated
   to Load(0) for compile-time defaults since the M-1 fix changed
   Load to accept a max-mappings argument.

3. TestProcessPodAdded_SaveFailureRollsBackBPFMap relied on
   chmod 0o555 to force a Save error, which root bypasses. The CI
   integration runner runs as root (CAP_BPF), so the test now
   forces failure via a parent-is-a-file path (MkdirAll always
   fails), which root cannot bypass.

Validation: TestIntegration_LoadAttachClose now PASSES on real kernel.
Full suite green with -tags=integration_bpf -race under sudo: 189 tests.
K3D/K3s and many lightweight K8s distributions use the cgroupfs cgroup
driver instead of systemd. Our resolver previously only searched
systemd paths (kubepods.slice/...). Extends resolveCgroupIDAt to also
try cgroupfs paths (kubepods/pod<UID>/<containerID> et al), preferring
systemd (more common in production) and falling back to cgroupfs.

Also: K3D node containers don't expose /sys/kernel/security by
default, so the loader's BPF LSM precondition check fails. Mount
/sys/kernel/security read-only in the DaemonSet.
K3D runtime validation revealed that bpf_probe_write_user is not
permitted in LSM programs by the verifier — only KPROBE/TRACEPOINT
types may call it. The original lsm/bprm_check_security design
was therefore unloadable on any kernel.

Switches to tracepoint/syscalls/sys_enter_execve which fires before
the kernel copies envp from user memory to the new process's stack.
We walk the user-space envp[] pointer array, read each env string
window directly from user memory via bpf_probe_read_user (user
pointer + variable offset is accepted; map_value + variable offset
is not on kernels ≤ 6.8), compare against registered placeholders
using __builtin_memcmp, and write the substitution back to user
memory. The kernel's subsequent copy_strings (do_execveat_common)
propagates our writes into the new process's environment.

Additional runtime findings on kernel 6.8:
  - bpf_probe_write_user requires CAP_SYS_ADMIN at load time (not
    just CAP_BPF). Adds SYS_ADMIN to the DaemonSet capabilities.
  - tracefs must be bind-mounted into the pod for the pre-flight
    kernel-support check. Adds /sys/kernel/tracing hostPath volume.
  - Stack budget: #pragma unroll for PLACEHOLDER_LEN=77 byte loops
    combined with per-entry spills blows the 512-byte BPF stack. Fixed
    by using __builtin_memcmp (no per-byte spill) and bpf_probe_read_kernel
    for value copy, keeping the frame at 88 bytes.

The Go loader is updated to attach as a tracepoint (Tracepoint()
instead of AttachLSM()) and look up program "sys_enter_execve".
The kernel-support check is updated to probe tracefs instead of
/sys/kernel/security/lsm. Substitution semantics are unchanged
from the operator's view; only the kernel hook type changes.

Validated on kernel 6.8.0-111-generic + k3d (BPF LSM still enabled
in kernel cmdline but not required for tracepoint mode).
Two real bugs discovered during full end-to-end validation in a k3d
cluster on kernel 6.8.

1. Runner: container-IDs check now BEFORE Unwrap. The wrap token is
   single-use server-side; unwrapping before kubelet has populated
   ContainerStatuses caused retry events to burn the token. Now we
   silently wait for the next informer event when statuses aren't
   ready, then unwrap once.

2. Loader: Go struct 'entry' was 159 bytes (binary.Write doesn't pad)
   while C struct 'mapping' is 160 bytes (compiler aligns ValueLen to
   4 bytes after the 74-byte Value array). Map.Update rejected with
   'doesn't marshal to 1288 bytes'. Added explicit Pad1[1]byte to
   match C layout exactly.

End-to-end validation now passes:
- kubectl get pod -o yaml shows placeholders (__VDBI_PH_<hex>___)
- kubectl exec -- env shows REAL substituted values
Init containers and fast-restarting (CrashLoopBackoff) containers
exec'd before the DaemonSet had time to discover their per-container
cgroup_id and program the BPF map, so their first execve saw raw
placeholders.

Root cause: kubelet creates the pod-level cgroup at admission time
(before any container starts), then a per-container cgroup at runtime
just before the container's execve. The runner only programmed the
per-container ids, learned via container statuses, which arrive
asynchronously after the kubelet starts the container.

Fix:
- ResolvePodCgroupID(podUID): inode of kubepods{,.slice}/.../pod<UID>
  (no container suffix). Exists right after pod admission.
- processPodAdded: program the pod-cgroup in the BPF map FIRST, then
  best-effort the per-container ids. Failure to resolve a per-container
  id is no longer fatal; the ancestor walk covers it.
- substitute.bpf.c: on a leaf-cgroup miss, walk
  bpf_get_current_ancestor_cgroup_id(level) for level 0..7 and look up
  each in the map. Single hit suffices to drive substitution.

This eliminates the init-container race AND the CrashLoopBackoff
restart race (same root cause: per-container cgroup_id changes per
restart on some runtimes; the pod cgroup_id is stable for the pod
lifetime).

Validated in k3d (Linux 6.8): init container reads real $DB_USER from
the very first execve; crash-loop pod logs show real values across
3+ restart iterations.
emptyDir{medium: Memory} is per-POD: every DaemonSet pod restart
(rolling update, OOM, eviction) wipes it, so restoreBPFMaps finds
no entries and every running pod on the node loses substitution
until the credentials are re-issued (which can't happen — wrap
tokens are single-use).

Replace with hostPath at /run/vault-db-injector/bpf,
type DirectoryOrCreate. /run is tmpfs on every modern distribution
(systemd-tmpfiles), so:
- Files survive a DS pod restart (restoreBPFMaps re-programs the BPF
  map for already-running pods).
- Files DO NOT survive a node reboot, which is the desired lifecycle:
  a rebooted node loses all its pods anyway, so the next pod admission
  triggers a fresh wrap-token unwrap.

Validated in k3d: rolling-restart of the BPF DS leaves running pods
with their substitution intact, and the new DS logs
"restoreBPFMaps: restored N cgroup(s) for pod <uid>" for each entry.
Pivot rationale: BPF probe_write_user can't do variable-length writes,
breaking URI/DSN-mode envs (postgres://user:__PH__@host/db). NRI
plugin substitutes env in container spec before runc, fixing the
structural bug and removing all kernel/capability/lockdown concerns.
Pivot from eBPF probe_write_user (which couldn't do variable-length writes,
breaking URI/DSN-mode envs) to a containerd/CRI-O NRI plugin that mutates
container env in CreateContainer before runc.

- pkg/bpf/ deleted (runner, loader, cgroup, BPF C, embedded objects)
- pkg/nri/ added with substitute, plugin, vault unwrap, runner
- Annotation renamed bpf-mapping → nri-mapping
- Mode renamed bpf → nri across config/controller/main
- Webhook drops length cap (URI mode now works for arbitrary lengths)
- placeholder.MaxValue removed; only structural shape matters now
- Metrics: vdbi_nri_substitutions_total, vdbi_nri_unwrap_failures_total
- Makefile BPF targets removed

166 tests pass.
NRI DS runs non-root, drops all capabilities, mounts only /var/run/nri
hostPath socket. ConfigMap renamed; values schema renamed bpf.* → nri.*.
Webhook deployment env vars renamed accordingly.
- README, CONTRIBUTING, comparison docs use NRI naming
- mkdocs.yml drops bpf-mode and bpf-requirements pages
- test.yml CI drops BPF clang/object verification steps
- Test names and comments use NRI consistently
- 171 tests pass
Without persistence, a wrap token consumed at first CreateContainer was
lost from plugin memory if the DS pod restarted, leaving subsequent
container retries (CrashLoop, kubelet restart) unable to substitute —
the placeholder string leaked into the running app.

Reproduction (now fixed):
1. Apply pod with nri-mapping annotation, exit 1 to force CrashLoop
2. Kill plugin DS pods (cache wiped)
3. Wait for kubelet container retry
4. Before fix: container env shows literal placeholder; after: real value

Implementation:
- pkg/nri/persist.go: atomic write (tmp+rename), 0600 file perms,
  0700 dir perms
- pkg/nri/plugin.go: write-through on resolveMapping and RemovePodSandbox
- pkg/nri/runner.go: loadCache before stub.Run with boot logging
- Synchronize callback evicts entries for pods no longer on the node
- pkg/config/config.go: NRIConfig.CachePath (default /run/vault-db-injector/nri/cache.json)
- helm: hostPath /run/vault-db-injector/nri DirectoryOrCreate

Trade-off: unwrapped credentials sit in tmpfs cleartext on the node.
Same posture as the prior BPF design and as kubelet's projected SA
tokens. A root-on-node attacker can already read /proc/<pid>/environ;
the cache adds zero new attack surface vs that baseline.

177 unit tests pass; reproduction scenario verified on k3d.
An empty placeholder key in the mapping caused strings.ReplaceAll(env, "",
val) to insert val between every character, corrupting PATH/HOSTNAME and
every other env var. Same issue would arise from any non-conforming key
(short, wrong shape, etc.).

Validation: Substitute now skips entries where the key fails
placeholder.IsPlaceholder. Webhook never emits malformed keys
(placeholder.Generate is safe by construction), but a manually-crafted
or future-buggy annotation no longer corrupts container env.

Reproduction (now safe):
  apply pod with annotation '{"placeholders":{"":"x"}}'
  before: every env var doubled with junk inserted between every char
  after:  pod runs with original env intact

Tests in pkg/nri/substitute_test.go updated to use placeholder.Generate
output (existing tests used short tokens that fail validation).

178 unit tests pass; runtime reproduction confirmed on k3d.
Closes the placeholder-leak window when the NRI plugin is absent on a
node. Without this gate, a pod scheduled to a node where the plugin DS
is missing (image pull, broken DS, nodeSelector mismatch, post-install
delay) starts with the literal placeholder string in env, leaking into
the running app.

Plugin (pkg/nri/readiness.go):
- Labels its node with vault-db-injector.numberly.io/nri-ready=true once
  registered with containerd and cache loaded
- 30s reconcile loop in case label is removed externally
- Removes label on SIGTERM (best-effort, 5s timeout)

Webhook (pkg/k8smutator):
- Merges a nodeAffinity matchExpression requiring the readiness label
  into every existing NodeSelectorTerm (preserves OR semantic with
  user expressions). Tolerations, nodeSelector, and preferred (soft)
  affinities are untouched.

Helm:
- ClusterRole grants get/patch on nodes (cluster-wide; plugin restricts
  writes to os.Getenv("NODE_NAME") in code)

Behavior matrix:
- Plugin not yet ready          → pods Pending (visible failure mode)
- Plugin running normally       → pods schedule + substitute
- Plugin crashes after schedule → cache persistence covers (prior fix)
- Label removed manually        → re-applied within 30s

185 unit tests pass; runtime reproduction on k3d shows pod Pending when
plugin disabled, schedules and substitutes when restored.
After reverting the readiness-label/affinity approach (introduced more
complexity than it bought, plus a SIGKILL-stale-label bug), we rely on
the simpler 'visible failure' model: pods with credentials that miss
substitution start unsubstituted, fail to connect, and CrashLoop. This
is detectable via Prometheus.

Adds docs/how-it-works/nri-mode.md covering:
- Architecture summary
- Failure modes and existing metrics
- Sample Prometheus alert for 'plugin missing on node where pods have
  the annotation'
- Hardening checklist (priorityClass, resource requests)
- Trust posture vs kubelet's projected SA tokens
Force-deleted pods (kubectl delete --grace-period=0 --force, kubelet eviction
under pressure, or any path that bypasses graceful shutdown) do not trigger
NRI's RemovePodSandbox event. Without the sweep, unwrapped credentials of
those pods would sit on tmpfs until the next plugin restart.

The sweeper lists pods on the local node via the k8s API every 5 minutes
and evicts cache entries whose UIDs are no longer alive. Existing
ClusterRole grants the pod get/list/watch verbs already; no new RBAC
required.

Tested: pod with annotation force-deleted → cache initially still has the
entry → after 30s the sweep evicts it (interval lowered for the test).
…ction

Reference Kyverno ClusterPolicy at helm/policies/kyverno-restrict-nri-socket.yaml
blocks hostPath mounts of /var/run/nri and /opt/nri outside trusted
namespaces (default: vault-db-injector, kube-system).

README security section explains the inherent NRI threat (any pod that
mounts the socket becomes a plugin and can mutate every container) and
ranks mitigations: PSA restricted/baseline first, Kyverno policy second,
SELinux third.
Closes Hunter finding #H5 (wrap_token-as-bearer-credential leak):
anyone with pods.get + Vault network access could read the wrap_token
from a pod annotation and unwrap before the legitimate plugin did.

Design (docs/superpowers/specs/2026-05-03-pull-not-push-design.md):
- Webhook stops calling Vault sys/wrapping/wrap. Annotation schema v2
  contains only {db_path, db_role, placeholders, request_id,
  pod_namespace, pod_service_account} — no Vault credential.
- Plugin authenticates to Vault using its own SA (k8s auth method)
  and creates dynamic database credentials at CreateContainer time.
- Plugin re-runs CanIGetRoles for the target pod identity claimed in
  the annotation, defending against annotation forgery via pods.update
  RBAC.
- Lease tagged with pod UID — existing renewer/revoker pipeline
  unchanged.
- Cache flow unchanged: per-pod placeholder→value mapping, write-through
  to tmpfs, sweep on Synchronize and via periodic k8s pod-list query.

Webhook split into authorizeDbAccess + fetchDbCredentials. Legacy mode
(NRI disabled) still pre-fetches creds at admission. NRI mode just
authorizes + annotates.

Config validation: VaultSecretName / VaultSecretPrefix now required in
all modes (plugin needs them too for lease metadata KV write).

Schema versioning: NRIMapping.SchemaVersion=2 hardcoded; plugin rejects
schema 1 with a clear error so old wrap_token annotations fail visibly
during upgrade rather than silently leak.

178 unit tests pass. K3D end-to-end validation in follow-up.
… (#H6)

Hunter mode found a critical regression introduced by the pull-not-push
refactor: the plugin trusted pod_namespace and pod_service_account
straight out of the annotation. An attacker with pods.create could:

  1. Run a pod with their own (unprivileged) SA (e.g. "evil")
  2. Annotate it with pod_service_account="trusted-app"
  3. Plugin called CanIGetRoles with the CLAIMED identity → passed
  4. Plugin fetched the privileged role's credentials and put them in
     the attacker's pod env → privilege escalation

Reproduction (now blocked):
  kubectl apply <pod with sa=evil, annotation claiming sa=default>
  before: container env contains real privileged-app DB credentials
  after:  plugin logs 'pod identity mismatch: actual default/evil !=
          annotated default/default — refusing to fetch credentials'
          container starts with placeholder, app fails visibly

Implementation:
- fetchAndBuildMapping now takes the (namespace, name) reported by NRI
  for the actual pod, queries kube-apiserver for the pod, and reads
  spec.serviceAccountName directly. The annotation's claimed identity
  is only used to verify it MATCHES the actual; if disagreement, refuse.
- CanIGetRoles is called with the K8s-attested identity, never the
  annotation's. Same for the credential fetch (Namespace +
  ServiceAccount fields in DbCredentialsRequest).

178 unit tests pass. Runtime suite on k3d: 9/9 (A-I) including the new
I_identity_forge_blocked case.
@SoulKyu SoulKyu changed the title Feat/ebpf injection mode Feat/nri injection mode May 4, 2026
…P-2, IMP-5)

CRIT-1 — Pod identity is now verified by UID, not just (namespace, name).
Closes a name-reuse race where an attacker could force-delete a pod
between admission and CreateContainer, recreate one with the same name+SA,
and have the plugin fetch credentials for the recreated pod whose UID
doesn't match the NRI sandbox UID. fetchAndBuildMapping now refuses if
NRI sandbox UID != kube-apiserver pod.UID.

CRIT-2 — Removed dead wrap/unwrap path. The pull-not-push refactor
(76c2074) replaced the wrap_token-as-bearer-credential pattern, but left
behind:
  - pkg/vault/vault.go: WrapValues / UnwrapValues (deleted)
  - pkg/vault/wrap_test.go (deleted)
  - pkg/k8smutator/k8smutator_test.go: stubWrapper helper (deleted)
  - pkg/config/config.go: NRIConfig.WrapTokenTTL field (deleted)
  - helm/values.yml, configmaps.yaml, deployment-injector.yaml: nri.wrapTokenTTL
    knob and INJECTOR_NRI_WRAP_TOKEN_TTL env var (deleted)
Operators no longer see a knob that does nothing. Tests no longer
exercise dead code paths.

IMP-1 — Rewrote docs/how-it-works/nri-mode.md to describe the schema v2
pull-not-push design. Old text described the wrap/unwrap flow which no
longer exists. New version: architecture diagram for v2, the three-layer
identity attestation defense (UID + namespace + SA), schema upgrade
path, hardening checklist, accurate trust posture.

IMP-2 — Kyverno policy now blocks hostPath mounts of /run/vault-db-injector
in addition to /var/run/nri and /opt/nri. The credential cache lives there
and was previously not covered by the policy (PSA baseline already covered
it, but defense-in-depth).

IMP-5 — Deleted leaked pkg/nri/.tmp file (saveCache test artifact).

174 unit tests pass. K3D edge suite: 9/9 (A_substitution, B_uri_mode,
C_multi_container, D_init, E_empty_placeholder, F_malformed_json,
G_cache_persistence, H_v1_rejected, I_identity_forge_blocked).
@SoulKyu SoulKyu force-pushed the feat/ebpf-injection-mode branch from 5f851cb to b274ef2 Compare May 4, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant