Skip to content

feat: bound the KCL native (off-heap) memory growth (graceful recycle + render cache)#399

Merged
Peefy merged 1 commit into
crossplane-contrib:mainfrom
callum-stakater:fix/bound-native-kcl-memory
Jun 11, 2026
Merged

feat: bound the KCL native (off-heap) memory growth (graceful recycle + render cache)#399
Peefy merged 1 commit into
crossplane-contrib:mainfrom
callum-stakater:fix/bound-native-kcl-memory

Conversation

@callum-stakater

Copy link
Copy Markdown
Contributor

Problem

function-kcl recompiles the entire KCL module on every RunFunction call:
RunFunction → krm-kcl kio.Pipeline → kcl-go ExecProgram → native libkcl.
The native KCL runtime (loaded via dlopen/cgo) accumulates off-heap memory per
compile. That growth is process-global and is not governed by Go's GC,
GOGC, or GOMEMLIMIT because it is not on the Go heap. Over a long-lived
pod's lifetime RSS climbs steadily until the kubelet OOMKills the container
(exit 137), which drops the in-flight render and surfaces to Crossplane as
connection reset by peer / DeadlineExceeded.

The ideal fix — compile once, execute many via BuildProgram / ExecArtifact
— is not reachable through the native C-ABI this function uses (those methods
are only routed over the RPC transport, not the cgo/purego path), so the growth
can only be bounded at a process boundary.

This PR adds two complementary, opt-in mechanisms. Both are nil-safe and
default to off, so behaviour is identical to today unless explicitly enabled.

1. Graceful self-recycle (recycle.go)

A watchdog samples process RSS (which includes the native off-heap allocations,
read from /proc/self/statm) plus optional reconcile-count and lifetime limits.
When a limit is crossed it stops accepting new reconciles, drains in-flight ones
(bounded by a timeout), and exits 0 so the orchestrator restarts the pod
between renders rather than OOMKilling it during one. With multiple
replicas this is a rolling, self-healing recycle.

New reconciles that arrive while draining are rejected with codes.Unavailable
so Crossplane retries them against another replica or after the restart.

Env var Default Meaning
FUNCTION_KCL_MAX_RSS_BYTES (derived) Recycle when RSS reaches this. Plain byte count or Ki/Mi/Gi suffix.
FUNCTION_KCL_MAX_RSS_RATIO 0.85 When MAX_RSS_BYTES is unset, recycle at this fraction of the detected cgroup memory limit.
FUNCTION_KCL_MAX_RECONCILES 0 (off) Recycle after this many RunFunction calls.
FUNCTION_KCL_MAX_LIFETIME 0 (off) Recycle after this process uptime (e.g. 6h).
FUNCTION_KCL_RECYCLE_CHECK_INTERVAL 30s Watchdog sampling interval.
FUNCTION_KCL_RECYCLE_DRAIN_TIMEOUT 15s Max wait for in-flight reconciles before forcing the restart.

If no cgroup limit is detected and no trigger is set, the watchdog never starts
— identical to current behaviour, which keeps local runs and tests unaffected.

2. Render cache (rendercache.go)

A composition function is deterministic over its input. In steady state most
reconciles are no-op re-syncs with byte-identical input. The render cache
memoises the KCL pipeline output keyed on the exact serialized input (source +
dependencies + all params + config), so an identical reconcile returns the
cached output without invoking the native runtime — skipping the recompile,
its CPU cost, and a memory-growth increment. It does not help when inputs
genuinely change every reconcile (e.g. a composite actively churning during a
rollout); the recycler is the backstop for that.

Env var Default Meaning
FUNCTION_KCL_RENDER_CACHE_SIZE 0 (off) Max cached entries (bounded LRU). Set > 0 to enable.
FUNCTION_KCL_RENDER_CACHE_TTL 0 (none) Optional per-entry TTL (e.g. 10m).

Validation

Running with the recycler enabled in a live deployment, RSS plateaued below the
recycle threshold over a multi-hour window and the recurring OOMKills stopped;
turning on the render cache additionally removed the per-reconcile recompile for
steady-state no-op re-syncs.

Notes

  • grpc is promoted from an indirect to a direct dependency (used for
    codes/status on the drain-reject path).
  • Unit tests added for both mechanisms (recycle_test.go,
    rendercache_test.go) covering trigger evaluation, drain/admission gating,
    env parsing, LRU eviction, and TTL expiry.
  • No behavioural change when both features are disabled (the default).

Closes #273. Relates to #211, #147, #108.

function-kcl recompiles the whole KCL module on every reconcile. The native
KCL runtime (loaded via dlopen/cgo) accumulates off-heap memory per compile
that Go's GC, GOGC and GOMEMLIMIT cannot bound, so long-lived pods climb to
their memory limit and get OOMKilled mid-render, dropping reconciles
(connection reset / DeadlineExceeded). A true compile cache
(BuildProgram/ExecArtifact) is not reachable through the native C-ABI this
function uses, so this bounds the growth at a process boundary instead, with
two complementary, opt-in mechanisms (both nil-safe and default-off):

* recycle.go — a watchdog samples process RSS (incl. native off-heap memory)
  plus optional reconcile-count / lifetime limits; on threshold it stops
  accepting new reconciles, drains in-flight ones, and exits 0 so the
  orchestrator restarts the pod cleanly between renders instead of OOMKilling
  it during one. Defaults to 85% of the detected cgroup limit.

* rendercache.go — memoises the KCL pipeline output keyed on the exact
  serialized input. A composition function is deterministic over its input, so
  byte-identical reconciles return the cached output without invoking the KCL
  runtime, skipping the recompile and a leak increment. Opt-in via
  FUNCTION_KCL_RENDER_CACHE_SIZE; bounded LRU with optional TTL.

grpc is promoted to a direct dependency for codes/status.

Signed-off-by: Callum MacDonald <callum@stakater.com>
@callum-stakater callum-stakater force-pushed the fix/bound-native-kcl-memory branch from 23486ec to ede4b05 Compare June 10, 2026 16:46
@Peefy

Peefy commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

LGTM

@Peefy Peefy merged commit ca690de into crossplane-contrib:main Jun 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Project generated KCL-embedded-functions use lots of memory

3 participants