feat: bound the KCL native (off-heap) memory growth (graceful recycle + render cache)#399
Merged
Peefy merged 1 commit intoJun 11, 2026
Conversation
function-kcl recompiles the whole KCL module on every reconcile. The native KCL runtime (loaded via dlopen/cgo) accumulates off-heap memory per compile that Go's GC, GOGC and GOMEMLIMIT cannot bound, so long-lived pods climb to their memory limit and get OOMKilled mid-render, dropping reconciles (connection reset / DeadlineExceeded). A true compile cache (BuildProgram/ExecArtifact) is not reachable through the native C-ABI this function uses, so this bounds the growth at a process boundary instead, with two complementary, opt-in mechanisms (both nil-safe and default-off): * recycle.go — a watchdog samples process RSS (incl. native off-heap memory) plus optional reconcile-count / lifetime limits; on threshold it stops accepting new reconciles, drains in-flight ones, and exits 0 so the orchestrator restarts the pod cleanly between renders instead of OOMKilling it during one. Defaults to 85% of the detected cgroup limit. * rendercache.go — memoises the KCL pipeline output keyed on the exact serialized input. A composition function is deterministic over its input, so byte-identical reconciles return the cached output without invoking the KCL runtime, skipping the recompile and a leak increment. Opt-in via FUNCTION_KCL_RENDER_CACHE_SIZE; bounded LRU with optional TTL. grpc is promoted to a direct dependency for codes/status. Signed-off-by: Callum MacDonald <callum@stakater.com>
23486ec to
ede4b05
Compare
rasheedamir
approved these changes
Jun 10, 2026
Collaborator
|
LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
function-kcl recompiles the entire KCL module on every
RunFunctioncall:RunFunction→ krm-kclkio.Pipeline→ kcl-goExecProgram→ nativelibkcl.The native KCL runtime (loaded via dlopen/cgo) accumulates off-heap memory per
compile. That growth is process-global and is not governed by Go's GC,
GOGC, orGOMEMLIMITbecause it is not on the Go heap. Over a long-livedpod's lifetime RSS climbs steadily until the kubelet OOMKills the container
(exit 137), which drops the in-flight render and surfaces to Crossplane as
connection reset by peer/DeadlineExceeded.The ideal fix — compile once, execute many via
BuildProgram/ExecArtifact— is not reachable through the native C-ABI this function uses (those methods
are only routed over the RPC transport, not the cgo/purego path), so the growth
can only be bounded at a process boundary.
This PR adds two complementary, opt-in mechanisms. Both are nil-safe and
default to off, so behaviour is identical to today unless explicitly enabled.
1. Graceful self-recycle (
recycle.go)A watchdog samples process RSS (which includes the native off-heap allocations,
read from
/proc/self/statm) plus optional reconcile-count and lifetime limits.When a limit is crossed it stops accepting new reconciles, drains in-flight ones
(bounded by a timeout), and exits 0 so the orchestrator restarts the pod
between renders rather than OOMKilling it during one. With multiple
replicas this is a rolling, self-healing recycle.
New reconciles that arrive while draining are rejected with
codes.Unavailableso Crossplane retries them against another replica or after the restart.
FUNCTION_KCL_MAX_RSS_BYTESKi/Mi/Gisuffix.FUNCTION_KCL_MAX_RSS_RATIO0.85MAX_RSS_BYTESis unset, recycle at this fraction of the detected cgroup memory limit.FUNCTION_KCL_MAX_RECONCILES0(off)RunFunctioncalls.FUNCTION_KCL_MAX_LIFETIME0(off)6h).FUNCTION_KCL_RECYCLE_CHECK_INTERVAL30sFUNCTION_KCL_RECYCLE_DRAIN_TIMEOUT15sIf no cgroup limit is detected and no trigger is set, the watchdog never starts
— identical to current behaviour, which keeps local runs and tests unaffected.
2. Render cache (
rendercache.go)A composition function is deterministic over its input. In steady state most
reconciles are no-op re-syncs with byte-identical input. The render cache
memoises the KCL pipeline output keyed on the exact serialized input (source +
dependencies + all params + config), so an identical reconcile returns the
cached output without invoking the native runtime — skipping the recompile,
its CPU cost, and a memory-growth increment. It does not help when inputs
genuinely change every reconcile (e.g. a composite actively churning during a
rollout); the recycler is the backstop for that.
FUNCTION_KCL_RENDER_CACHE_SIZE0(off)FUNCTION_KCL_RENDER_CACHE_TTL0(none)10m).Validation
Running with the recycler enabled in a live deployment, RSS plateaued below the
recycle threshold over a multi-hour window and the recurring OOMKills stopped;
turning on the render cache additionally removed the per-reconcile recompile for
steady-state no-op re-syncs.
Notes
grpcis promoted from an indirect to a direct dependency (used forcodes/statuson the drain-reject path).recycle_test.go,rendercache_test.go) covering trigger evaluation, drain/admission gating,env parsing, LRU eviction, and TTL expiry.
Closes #273. Relates to #211, #147, #108.