Skip to content

vz: add adaptive memory ballooning for macOS Virtualization Framework#4828

Open
jwehrlich wants to merge 4 commits intolima-vm:masterfrom
jwehrlich:optimized-macos-memory
Open

vz: add adaptive memory ballooning for macOS Virtualization Framework#4828
jwehrlich wants to merge 4 commits intolima-vm:masterfrom
jwehrlich:optimized-macos-memory

Conversation

@jwehrlich
Copy link
Copy Markdown

@jwehrlich jwehrlich commented Apr 10, 2026

Summary

Add adaptive memory ballooning for macOS Virtualization Framework (VZ) VMs, allowing Lima to dynamically adjust guest memory based on actual pressure rather than requiring peak allocation upfront.

Closes #4220

Commits

This PR contains 4 logically grouped commits that build on each other:

  1. limatype,limayaml: add MemoryBalloon config and validation — Adds the MemoryBalloon struct with 4 user-facing fields (enabled, min, idleTarget, cooldown), validation, and documentation.

  2. guestagent: add memory metrics collection — Adds GetMemoryMetrics gRPC endpoint collecting /proc/meminfo, /proc/pressure/memory, /proc/vmstat, and container cgroup stats. Stateful collector tracks deltas for per-second rates.

  3. hostagent: add balloon controller state machine — Six-state machine (Bootstrap → LearningDescend → Steady → OOMRecovery → CircuitBreaker → AgentFailure) with PSI-driven grow/shrink decisions, shrink guards, learned floor persistence, and host pressure integration. ~1300 lines of tests.

  4. vz: wire balloon controller to VZ driver — Implements Ballooner interface in VZ driver, wires controller into hostagent polling loop, fills sensible defaults (min=25%, idleTarget=33%, cooldown=30s).

Configuration

vmOpts:
  vz:
    memoryBalloon:
      enabled: true
      min: "2GiB"
      idleTarget: "3GiB"
      cooldown: "30s"

All advanced tuning (thresholds, step sizes, guards) is hardcoded with sensible defaults to keep the user-facing API minimal.

Key Design Decisions

  • 4 public config fields only — Advanced knobs (growStepPercent, shrinkStepPercent, pressure thresholds) are internal constants, not user-facing config.
  • cgroupfs for container metrics — Reads cgroup2 cpu.stat/io.stat directly instead of depending on Docker API.
  • Learned floor — Persists minimum observed working memory to ~/.lima/<instance>/balloon_floor across restarts.
  • Host pressure awareness — Monitors macOS memory_pressure to avoid shrinking guest when host is stressed.

Testing

  • Unit tests for all new packages: validation, metrics parsing, controller state machine, host pressure, learned floor
  • ~2700 lines of test code covering states, transitions, edge cases, and guard conditions
  • Every commit builds, lints, and tests independently

Copy link
Copy Markdown
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This amount of PR is not reviewable.

Please split the PR into multiple ones.
Notably auto-pause stuff does not seem directly related to memory optimization and does not need to be overloaded in this PR?

@jwehrlich
Copy link
Copy Markdown
Author

This amount of PR is not reviewable.

Please split the PR into multiple ones. Notably auto-pause stuff does not seem directly related to memory optimization and does not need to be overloaded in this PR?

@AkihiroSuda, I was thinking the same thing, but also wanted to get something in to do some initial validations to see if any tests failed/etc. I'll try to break this up into multiple PRs over the next week.

string error = 3;
}

// MemoryMetrics contains guest memory statistics for the balloon controller.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread cmd/lima-guestagent/daemon_linux.go Outdated
}
defer logrus.Debug("exiting lima-guestagent daemon")
return server.StartServer(ctx, l, &server.GuestServer{Agent: agent, TunnelS: portfwdserver.NewTunnelServer()})
dockerSocket := "/var/run/docker.sock"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The path is different for rootless Docker and Podman
  • Why only support containerized processes?

Comment thread hack/bats/tests/alpine-docker-vz.bats Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most tests seem irrelevant to memory optimization

Comment thread pkg/guestagent/metrics/collector.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux specific codes must have _linux.go suffix

Comment thread pkg/guestagent/metrics/collector.go Outdated
// collectDockerStats queries the Docker socket for container count,
// aggregate CPU%, and aggregate IO bytes/sec. Containers are polled
// in parallel with a 3-second overall timeout. Returns zeros on error.
func (c *Collector) collectDockerStats(ctx context.Context) (count int, cpuPercent, ioBytesPerSec float64) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why depend on Docker API? Why not directly stat the cgroupfs?

Comment thread pkg/limatype/lima_instance.go Outdated
Memory int64 `json:"memory,omitempty"` // bytes
Disk int64 `json:"disk,omitempty"` // bytes
Memory int64 `json:"memory,omitempty"` // bytes (configured)
PhysicalMemory int64 `json:"physicalMemory,omitempty"` // bytes (actual host footprint)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PhysicalMemory int64 `json:"physicalMemory,omitempty"` // bytes (actual host footprint)
PhysicalMemory *int64 `json:"physicalMemory,omitempty"` // bytes (actual host footprint)

As this information will be only available for specific VM drivers

Comment thread pkg/limatype/lima_yaml.go Outdated
FloorStaleness *string `yaml:"floorStaleness,omitempty" json:"floorStaleness,omitempty" jsonschema:"nullable"`
// EnableTrendDetection enables pre-emptive grow on rising PSI trend (avg10 > 1.5*avg60).
EnableTrendDetection *bool `yaml:"enableTrendDetection,omitempty" json:"enableTrendDetection,omitempty" jsonschema:"nullable"`
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need all these configuration knobs?

Comment thread pkg/store/physmem_darwin.go Outdated
// findDiskOwnerPID finds the PID of the process that has the given disk file open
// using the macOS `fuser` command.
func findDiskOwnerPID(diskPath string) (int, error) {
out, err := exec.CommandContext(context.Background(), "fuser", diskPath).CombinedOutput()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs LANG=C LC_ALL=C?
Same for other commands too

Ideally this should rather call PROC_PIDLISTFDS

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems too complicated.
I suggest moving it to a third-party repo: https://lima-vm.io/docs/templates/github/
e.g. github:nixos-lima https://github.com/nixos-lima/nixos-lima


### Memory Ballooning

| ⚡ Requirement | Lima >= 2.0.0, macOS >= 13.0, VZ backend only |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in v2.0

Add the MemoryBalloon configuration struct to LimaYAML with four
user-facing fields: Enabled, Min, IdleTarget, and Cooldown.

Add validation ensuring min < idleTarget <= memory, cooldown is a
valid duration, and all size fields parse correctly. Include
comprehensive test coverage for valid and invalid configurations.

Add memory ballooning documentation to the VZ page.

Signed-off-by: Jason W. Ehrlich <jwehrlich@outlook.com>
Add a GetMemoryMetrics gRPC endpoint to the guest agent that reports
memory pressure and container activity to the host.

The collector reads /proc/meminfo, /proc/pressure/memory, and
/proc/vmstat to report memory availability, PSI pressure, swap rates,
and page fault rates. Container CPU and IO metrics are collected from
cgroupfs (systemd cgroup hierarchy) to avoid a Docker API dependency.

The collector is stateful, tracking deltas between polls to compute
per-second rates. Counter wraps from reboots produce zero deltas via
safeDelta().

Non-Linux platforms return a stub error since /proc is unavailable.

Signed-off-by: Jason W. Ehrlich <jwehrlich@outlook.com>
Add a BalloonController that manages VM memory allocation through a
six-state machine: Bootstrap, LearningDescend, Steady, OOMRecovery,
CircuitBreaker, and AgentFailure.

The controller uses guest metrics (PSI pressure, swap rates, container
activity) and host pressure (macOS memory_pressure) to decide when to
grow or shrink the balloon. Key behaviors:

- OOM detection triggers immediate 20% grow with circuit breaker
- PSI-based pressure monitoring with configurable thresholds
- MemAvailable heuristic when PSI is unavailable
- Shrink guards: swap-in rate, container CPU/IO, page faults
- Learned floor persisted to disk across instance restarts
- Host pressure integration prevents shrinking under host memory stress
- Cooldown enforcement between balloon actions

Include comprehensive test coverage (~1300 lines) for all states,
transitions, edge cases, and guard conditions.

Signed-off-by: Jason W. Ehrlich <jwehrlich@outlook.com>
Add the Ballooner interface to the driver package with
SetBalloonTarget for adjusting guest memory at runtime.

Implement Ballooner in the VZ driver: store the VirtIO balloon device
reference during VM configuration, and expose SetBalloonTarget which
calls the Virtualization.framework API under a mutex.

Wire the balloon controller into hostagent: parseBalloonConfig
converts the 4 public YAML fields into internal BalloonConfig with
hardcoded operational defaults, setupBalloon initializes the
controller after SSH is ready, and runBalloonLoop polls guest metrics
every 10 seconds to feed the controller's Evaluate method.

Fill sensible defaults in the VZ driver's FillConfig: min at 25% of
memory, idleTarget at 33%, and cooldown at 30 seconds.

Signed-off-by: Jason W. Ehrlich <jwehrlich@outlook.com>
@jwehrlich jwehrlich force-pushed the optimized-macos-memory branch from af2b6d1 to 849a3f3 Compare April 20, 2026 17:34
@jwehrlich jwehrlich changed the title Optimized macos memory vz: add adaptive memory ballooning for macOS Virtualization Framework Apr 20, 2026
@jwehrlich
Copy link
Copy Markdown
Author

This amount of PR is not reviewable.

Please split the PR into multiple ones. Notably auto-pause stuff does not seem directly related to memory optimization and does not need to be overloaded in this PR?

@jwehrlich jwehrlich closed this Apr 20, 2026
@jwehrlich jwehrlich reopened this Apr 20, 2026
@thewesjohnson
Copy link
Copy Markdown

@jwehrlich — I tested balloon reclamation on macOS 26.0.1 (M1 Max, 32GB, Colima 0.10.1, vz backend) and can confirm
the host-side issue your controller is working around.

My test captures both sides: guest /proc/meminfo shows 108% memory recovery after free + drop_caches, but the host
com.apple.Virtualization.VirtualMachine process RSS never decreases — it grew from 1731 to 1748 MB during a 120-second observation window with zero decreases.

Your adaptive controller's guest-side balloon inflation works correctly — the problem is entirely at Apple's layer. Once they fix the host-side madvise, your controller should just work.

Full test data and reproduction script: https://github.com/thewesjohnson/macos-virtio-balloon-test

Happy to run your branch against the same workload if that would help validate your state transitions (Bootstrap -> LearningDescend -> Steady -> OOMRecovery).

I've filed Apple Feedback FB22614752 with the full dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support memory ballooning in VMs

3 participants