Skip to content

Conversation

@GnatorX
Copy link

@GnatorX GnatorX commented Dec 1, 2025

Fixes #2655

Problem

Full deep copying of instance types caused excessive memory usage in deployments with many instance types and node pools (~1GB+ for 200 types × 5 pools).

Solution

Replace full deep copy with selective copying:

  • Share immutable fields (Requirements, Overhead)
  • Copy only modified fields (Capacity, affected Offerings)
  • New applyPriceOverlays() for granular offering copying

Before

overriddenInstanceType := it.DeepCopy()  // Always copies everything

After

overriddenInstanceType := &cloudprovider.InstanceType{
    Name:         it.Name,
    Requirements: it.Requirements, // Shared
    Overhead:     it.Overhead,     // Shared
}
// Only copy fields that need modification

Results

Before: red line, After: Orange
Screenshot 2025-12-02 at 2 13 51 PM

Before:

      === BEFORE (Deep Copy) - Memory Tests ===

      TestMemoryUsage_OverlayScenarios:
        no_overlays:             Total Allocs: 71,514     | Bytes: 12.75 MB    | Heap: 3.97 MB
        price_overlays_only:     Total Allocs: 2,184,576  | Bytes: 164.56 MB   | Heap: 3.99 MB
        capacity_overlays_only:  Total Allocs: 1,140,409  | Bytes: 297.21 MB   | Heap: 3.99 MB
        mixed_overlays:          Total Allocs: 11,137,342 | Bytes: 1,108.82 MB | Heap: 2.76 MB

      TestMemoryUsage_ScaleWithNodePools:
        nodepools_1:   Total Allocs: 2,213,176   | Bytes: 1,271.41 MB  | Heap: 4.00 MB
        nodepools_5:   Total Allocs: 11,137,216  | Bytes: 2,083.00 MB  | Heap: 4.00 MB
        nodepools_10:  Total Allocs: 22,274,324  | Bytes: 3,716.96 MB  | Heap: 4.00 MB
        nodepools_20:  Total Allocs: 44,547,761  | Bytes: 7,028.17 MB  | Heap: 4.01 MB
        nodepools_50:  Total Allocs: 111,365,788 | Bytes: 15,697.76 MB | Heap: 4.01 MB

      TestMemoryUsage_ScaleWithInstanceTypes:
        instances_50:  Total Allocs: 3,893,618  | Bytes: 15,981.94 MB | Heap: 2.77 MB
        instances_100: Total Allocs: 7,788,094  | Bytes: 16,550.38 MB | Heap: 2.77 MB
        instances_200: Total Allocs: 11,137,399 | Bytes: 17,363.31 MB | Heap: 2.78 MB
        instances_500: Total Allocs: 11,137,410 | Bytes: 18,176.25 MB | Heap: 2.78 MB

      === BEFORE (Deep Copy) - Benchmarks ===

      BenchmarkStoreApply/selective-copy-apply    568,892 ns/op    943,506 B/op    12,475 allocs/op
      BenchmarkStoreApplyNoOverlay                  5,525 ns/op          0 B/op         0 allocs/op
      BenchmarkStoreApplyPriceOnly              1,297,943 ns/op  2,191,611 B/op    30,403 allocs/op
      BenchmarkStoreApplyCapacityOnly             818,722 ns/op  1,912,016 B/op    15,800 allocs/op
      BenchmarkStoreApplyBothOverlays           1,376,852 ns/op  2,329,316 B/op    30,803 allocs/op
      BenchmarkStoreApplyAllScenario            7,406,884 ns/op 12,321,060 B/op   162,776 allocs/op
      BenchmarkNodeOverlayControllerScenario       75,832 ns/op     86,400 B/op     1,080 allocs/op

After:

=== AFTER (Selective Copy-on-Write) - Memory Tests ===
      TestMemoryUsage_OverlayScenarios:
      No overlays: Total Allocs: 71509 | Bytes: 12.75 MB | Heap: 3.97 MB | Objects: 27144
      price_overlays_only: Total Allocs: 1129755 | Bytes: 44.64 MB | Heap: 3.98 MB | Objects: 27160
      capacity_overlays_only: Total Allocs: 71501 | Bytes: 64.87 MB | Heap: 3.98 MB | Objects: 27157
      mixed_overlays:  Total Allocs: 6006405 | Bytes: 323.89 MB | Heap: 2.75 MB | Objects: 15904
     
      TestMemoryUsage_ScaleWithNodePools:
      nodepools_1: Total Allocs: 1186975 | Bytes: 375.95 MB | Heap: 3.99 MB | Objects: 27192
      nodepools_5: Total Allocs: 6006364 | Bytes: 634.95 MB | Heap: 3.99 MB | Objects: 27216
      nodepools_10:  Total Allocs: 12012637 | Bytes: 1163.79 MB | Heap: 3.99 MB | Objects: 27245
      nodepools_20:  Total Allocs: 24025139 | Bytes: 2265.00 MB | Heap: 4.00 MB | Objects: 27289
      nodepools_50:  Total Allocs: 60061914 | Bytes: 5410.40 MB | Heap: 4.00 MB | Objects: 27292
     
      TestMemoryUsage_ScaleWithInstanceTypes:
      instances_50: Total Allocs: 2100180 | Bytes: 5501.46 MB | Heap: 2.76 MB | Objects: 16011
      instances_100: Total Allocs: 4200300 | Bytes: 5683.51 MB | Heap: 2.77 MB | Objects: 16035
      instances_200: Total Allocs: 6006410 | Bytes: 5943.84 MB | Heap: 2.76 MB | Objects: 15985
      instances_500: Total Allocs: 6006397 | Bytes: 6204.17 MB | Heap: 2.76 MB | Objects: 16007
    
    === AFTER (Selective Copy-on-Write) - Benchmarks ===
 
     BenchmarkStoreApply/selective-copy-apply 	   	273716 ns/op	  299578 B/op	    6723 allocs/op
     BenchmarkStoreApplyNoOverlay                5619 ns/op	       0 B/op	       0 allocs/op
     BenchmarkStoreApplyPriceOnly	       593105 ns/op	  493055 B/op	   16400 allocs/op
     BenchmarkStoreApplyCapacityOnly	  	102446 ns/op	  294401 B/op	    1000 allocs/op
     BenchmarkStoreApplyBothOverlays            	    693806 ns/op	  768427 B/op	   17201 allocs/op
     BenchmarkStoreApplyAllScenario            	       3802529 ns/op	 4127444 B/op	   91365 allocs/op
     BenchmarkNodeOverlayControllerScenario  	 	77965 ns/op	   86400 B/op	    1080 allocs/op

Testing

make test-memory    # Run memory tests
make benchmark      # Run benchmarks

All tests pass. Comprehensive memory and benchmark tests added.

Files

  • pkg/controllers/nodeoverlay/store.go - Implementation
  • pkg/controllers/nodeoverlay/store_memory_test.go - Memory tests
  • pkg/controllers/nodeoverlay/store_bench_test.go - Benchmarks
  • Makefile - New test targets

Reviewers

Key areas:

  1. Field sharing logic in apply() method
  2. Selective copying in applyPriceOverlays()
  3. Memory test validation

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: GnatorX
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 1, 2025
@GnatorX GnatorX changed the title add fix to node overlay memory Improve Karpenter memory when using Node Overlay with copy-on-write Dec 1, 2025
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 1, 2025
@GnatorX GnatorX changed the title Improve Karpenter memory when using Node Overlay with copy-on-write Fix: Improve Karpenter memory when using Node Overlay with copy-on-write Dec 1, 2025
@GnatorX GnatorX changed the title Fix: Improve Karpenter memory when using Node Overlay with copy-on-write fix: Improve Karpenter memory when using Node Overlay with copy-on-write Dec 1, 2025
@coveralls
Copy link

coveralls commented Dec 2, 2025

Pull Request Test Coverage Report for Build 20491526894

Details

  • 32 of 32 (100.0%) changed or added relevant lines in 1 file are covered.
  • 75 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.4%) to 80.058%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/static/provisioning/controller.go 2 60.0%
pkg/controllers/provisioning/scheduling/preferences.go 7 88.76%
pkg/cloudprovider/zz_generated.deepcopy.go 30 0.0%
pkg/cloudprovider/types.go 36 82.01%
Totals Coverage Status
Change from base Build 20489292152: -0.4%
Covered Lines: 11963
Relevant Lines: 14943

💛 - Coveralls

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 3, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 3, 2025
@GnatorX GnatorX marked this pull request as ready for review December 3, 2025 00:28
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2025
@grosser
Copy link

grosser commented Dec 5, 2025

ran this and can confirm it works 🤞
image

@GnatorX
Copy link
Author

GnatorX commented Dec 5, 2025

Thanks for testing too!

@GnatorX
Copy link
Author

GnatorX commented Dec 8, 2025

A comment that came from working group meeting, @jmdeal, was that we want to ensure we won't affect the instance type cache of any provider. This code shouldn't because we are still making a copy (similar to the original implementation) but instead of a deep copy, we are doing a shallow copy. Because of this, we aren't changing any of the underlying objects directly but rather the copy of it.

The main difference here is that rather than recreating all parts of the object, we are copying the reference to parts of the object and only change the parts that have been change. This should not affect the underlying object in anyway.

Copy link
Member

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have too much feedback, the logic looks good to me and I appreciate the data you've included.

@GnatorX
Copy link
Author

GnatorX commented Dec 24, 2025

Updated based on comments. Let me know what you think.

@GnatorX
Copy link
Author

GnatorX commented Jan 9, 2026

Wanted to give a nudge @jmdeal and @engedaam to see if there is anything else we wanted to updated with the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Karpenter node overlay causing increased in memory usage

7 participants