Skip to content

Add metaldata#941

Open
maxmoehl wants to merge 1 commit into
mainfrom
maxmoehl/metaldata
Open

Add metaldata#941
maxmoehl wants to merge 1 commit into
mainfrom
maxmoehl/metaldata

Conversation

@maxmoehl

@maxmoehl maxmoehl commented Jun 10, 2026

Copy link
Copy Markdown
Member

Resolves: #932

Summary by CodeRabbit

  • New Features

    • Added metaldata service component that serves server metadata and user-data through REST endpoints
    • Added userDataRef field to ServerClaim allowing users to reference secrets containing user-data for claimed servers
  • Infrastructure

    • Configured metaldata component for deployment with appropriate RBAC permissions
    • Updated Docker build pipeline to include metaldata image
  • Documentation

    • Updated API reference documentation

Resolves: #932

Signed-off-by: Maximilian Moehl <maximilian@moehl.eu>
@maxmoehl maxmoehl requested a review from a team as a code owner June 10, 2026 09:20
@github-actions github-actions Bot added size/XXL api-change documentation Improvements or additions to documentation labels Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a complete metadata service (metaldata) that exposes bare-metal server metadata and user-data via a Kubernetes-native HTTP API, as specified in IEP-18. The implementation spans API extensions, an in-memory IP-indexed server registry, HTTP handlers with metadata-flavor authentication, Kubernetes deployment manifests, and build infrastructure.

Changes

Metaldata Service Implementation

Layer / File(s) Summary
API Extensions for User-Data References
api/v1alpha1/constants.go, api/v1alpha1/serverclaim_types.go, api/v1alpha1/server_types.go, api/v1alpha1/applyconfiguration/api/v1alpha1/serverclaimspec.go, api/v1alpha1/applyconfiguration/internal/internal.go, api/v1alpha1/zz_generated.deepcopy.go, config/crd/bases/metal.ironcore.dev_serverclaims.yaml, docs/api-reference/api.md
ServerClaim now includes optional UserDataRef field for secret references; new metadata constants (MetadataKeyPrefix, MetadataFlavorHeader, MetadataFlavorValue, SecretTypeUserData) standardize metadata key/header conventions; label constants refactored to use prefix constant.
IP-to-Metadata Index and Synchronization
internal/metaldata/index.go, internal/metaldata/index_test.go
Index maps server IPs to enriched ServerEntry objects by subscribing to Server cache events; merges static metadata with server annotations/labels (prefixed with MetadataKeyPrefix), tracks server claims, and maintains per-server IP lists for efficient removal.
HTTP Metadata Server and Endpoints
internal/metaldata/http.go, internal/metaldata/http_test.go
HTTP server exposes /v1/, /v1/{field}, /v1/user-data, /v1/user-data/{key} endpoints; enforces Metadata-Flavor: IronCore Metal header; resolves client IP to server metadata and fetches user-data from referenced Kubernetes secrets with type validation.
CLI Entry Point and Manager Wiring
cmd/metaldata/main.go
Main() parses static metadata flags, initializes controller-runtime manager with health/readiness checks, registers Server informer and attaches index event handler, wires HTTP server, and starts manager with proper exit codes.
Kubernetes Manifests and RBAC
config/metaldata/service_account.yaml, config/metaldata/role.yaml, config/metaldata/role_binding.yaml, config/metaldata/service.yaml, config/metaldata/daemonset.yaml, config/metaldata/kustomization.yaml, config/default/kustomization.yaml
Service account and RBAC bindings; LoadBalancer service; DaemonSet with security context, liveness/readiness probes, and resource limits; integrated into default kustomization.
Docker Build and Make Targets
Dockerfile, Makefile
New metaldata-builder Docker stage compiles binary with cache mounts; distroless runtime stage; Makefile defines METALDATA_IMG variable and adds docker-build-metaldata and corresponding push targets.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Flavor as Metadata-Flavor<br/>Middleware
  participant Handler as v1 Handler
  participant Index
  participant Manager as Manager<br/>(cache)
  participant APIReader as API Reader
  Client->>Flavor: GET /v1/rack + Header
  Flavor->>Flavor: validate Metadata-Flavor: IronCore Metal
  alt Header missing/invalid
    Flavor->>Client: 403 Forbidden
  else Valid
    Flavor->>Handler: proceed
    Handler->>Handler: parse client IP
    Handler->>Index: Lookup(IP)
    Index-->>Handler: ServerEntry (name, metadata, claimRef)
    Handler->>Handler: build response metadata
    alt user-data requested
      Handler->>Manager: Get ServerClaim by claimRef
      Manager-->>Handler: ServerClaim or NotFound
      alt claim has UserDataRef
        Handler->>APIReader: Get Secret by UserDataRef
        APIReader-->>Handler: Secret or NotFound
        alt Secret type == user-data
          Handler->>Handler: convert bytes to strings
          Handler->>Client: 200 JSON (metadata + user-data)
        else Wrong type
          Handler->>Client: 404 Not Found
        end
      else No UserDataRef
        Handler->>Client: 404 Not Found
      end
    else Metadata-only request
      Handler->>Client: 200 JSON (metadata)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

enhancement, api-change, size/XXL, area/metal-automation, documentation

Suggested reviewers

  • afritzler
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description only contains a link reference without detailed explanation of the changes, missing key information about objectives, implementation details, and testing. Expand the description to explain the metaldata service purpose, key implementation details (HTTP endpoints, indexing, user-data support), and how the implementation satisfies IEP-18 requirements.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title "Add metaldata" is concise and directly reflects the main change, which is the introduction of a new metaldata HTTP service component with supporting infrastructure.
Linked Issues check ✅ Passed The PR successfully implements a metadata service for bare-metal servers per IEP-18, including HTTP server with metadata endpoints, index-based IP lookup, user-data secret references, CRD extensions, Kubernetes deployment manifests, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes directly support the metaldata service implementation: API enhancements for user-data secrets, build system updates for Docker images, Kubernetes deployment infrastructure, and HTTP server implementation with tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch maxmoehl/metaldata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
internal/metaldata/index_test.go (1)

116-176: ⚡ Quick win

Add a regression test for duplicate IP ownership transitions.

Please add a case where Server A and Server B share an IP, then A is deleted/updated; lookup should still return B. This prevents silent regressions in conflict handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/index_test.go` around lines 116 - 176, Add a regression
test that verifies duplicate IP ownership transitions: create Server A and
Server B that both include the same IP (use newServer/addServer to add both),
then delete or update Server A (use deleteServer or updateServer) and assert
that idx.Lookup(theSharedIP) still returns an entry and that entry corresponds
to Server B; use idx.Lookup, addServer, deleteServer, updateServer, and
newServer to locate the test scaffolding and ensure the shared-IP case is
exercised for both delete and update transitions so the index does not drop B's
ownership when A is removed.
internal/metaldata/http.go (1)

1-280: Reminder: Run make lint-fix and make test after editing.

As per coding guidelines, Go source files should be validated with make lint-fix and make test after changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/http.go` around lines 1 - 280, The review is a reminder to
validate changes with linters and tests; run make lint-fix and make test and
address any issues they report for the modified symbols in this file (e.g.
Server.Start, NewServer, accessLogger, metadataFlavorMiddleware, and the
v1Handler methods like
serveFullMetadata/serveField/serveUserData/serveUserDataKey) until both lint and
test pass. Ensure any reported formatting/go vet/staticcheck errors are fixed,
unresolved nil/err handling or unused imports are corrected, and then re-run
make lint-fix and make test to confirm the file is clean.

Source: Coding guidelines

internal/metaldata/http_test.go (1)

1-294: Reminder: Run make test after editing.

As per coding guidelines, Go source files should be validated with make test after changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/http_test.go` around lines 1 - 294, The test suite must be
run and any failing tests fixed before merging: run `make test` and address
failures related to the HTTP metadata tests by inspecting and correcting the
test helpers and test logic (functions getMetadata, decodeJSON, readBody, claim,
userDataSecret and the Describe blocks for metadata-flavor, /v1/, /v1/{field},
/v1/user-data and /v1/user-data/{key}) so that all expectations pass; iterate
until `make test` completes successfully and then commit the green test results.

Source: Coding guidelines

cmd/metaldata/main.go (2)

1-154: Reminder: Run make lint-fix and make test after editing.

As per coding guidelines, Go source files should be validated with make lint-fix and make test after changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/metaldata/main.go` around lines 1 - 154, The PR needs linting and tests
run; run "make lint-fix" and "make test", fix any issues tools report
(formatting/gofmt, goimports, vet/lint warnings, and failing tests) and commit
the fixes; specifically run these against the edited symbols Main, main,
stringMapFlag.Set and stringMapFlag.String (and any changes in
cmd/metaldata/main.go) and address whatever the linters/tests point to (e.g.,
unused imports, formatting, error handling or test failures) until both commands
succeed.

Source: Coding guidelines


87-93: ⚡ Quick win

Make development mode configurable for production deployments.

The Development: true setting is hardcoded in the zap logger options. Development mode uses console encoding and adds stack traces, which are typically undesirable in production environments where structured JSON logging is preferred for log aggregation systems.

Consider adding a flag to control this behavior.

♻️ Proposed change to add development mode flag
 func Main() int {
 	var (
 		listenPort         int
 		metricsBindAddress string
 		healthBindAddress  string
 		staticMetadata     stringMapFlag
+		development        bool
 	)

 	flag.IntVar(&listenPort, "metaldata-port", 10002, "The port the metadata HTTP server listens on.")
 	flag.StringVar(&metricsBindAddress, "metrics-bind-address", ":8080",
 		"The address the metrics endpoint binds to. Use \"0\" to disable.")
 	flag.StringVar(&healthBindAddress, "health-probe-bind-address", ":8081",
 		"The address the health probe endpoint binds to. Use \"0\" to disable.")
 	flag.Var(&staticMetadata, "static-metadata",
 		"Static metadata key=value pair exposed to all servers. May be repeated.")
+	flag.BoolVar(&development, "development", false,
+		"Enable development mode for the logger (console encoding, stack traces).")

 	opts := zap.Options{
-		Development: true,
+		Development: development,
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/metaldata/main.go` around lines 87 - 93, The zap.Options currently
hardcodes Development: true; make this configurable by adding a boolean CLI flag
(e.g., dev or development) and use its value to set opts.Development after
flag.Parse. Locate the zap.Options struct named opts and the BindFlags call on
flag.CommandLine, add a flag definition before flag.Parse, then set
opts.Development = <flagValue> (or bind the flag into opts via BindFlags) and
keep the existing ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts))) so
production can default to false and dev can be enabled via the flag.
config/metaldata/daemonset.yaml (1)

54-65: ⚡ Quick win

Consider hardening the container security context.

The container security context correctly drops all capabilities and disables privilege escalation, but it is missing readOnlyRootFilesystem: true. Since metaldata is a stateless HTTP service that reads from the Kubernetes API and serves responses from an in-memory index, it does not require a writable filesystem.

🛡️ Recommended security hardening
         securityContext:
           allowPrivilegeEscalation: false
+          readOnlyRootFilesystem: true
           capabilities:
             drop:
             - "ALL"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/metaldata/daemonset.yaml` around lines 54 - 65, Add
readOnlyRootFilesystem: true to the container's securityContext in the DaemonSet
spec so the container runs with a read-only root filesystem; update the existing
securityContext block that contains allowPrivilegeEscalation and
capabilities.drop (the same block shown) to include readOnlyRootFilesystem:
true, ensuring it sits alongside allowPrivilegeEscalation and capabilities to
harden the container for this stateless metaldata HTTP service.
internal/metaldata/metaldata_suite_test.go (2)

71-75: 💤 Low value

Consider improving error propagation for faster failure detection.

If srv.Start(ctx) fails immediately (e.g., port binding error), the Eventually block will time out before the goroutine's Expect failure is reported, making debugging harder.

💡 Optional pattern for faster failure detection
+startedCh := make(chan struct{})
+errCh := make(chan error, 1)
 go func() {
 	defer GinkgoRecover()
-	Expect(srv.Start(ctx)).To(Succeed(), "failed to start metaldata server")
+	close(startedCh) // Signal that goroutine started
+	if err := srv.Start(ctx); err != nil {
+		errCh <- err
+	}
 }()

-Eventually(func() error {
-	_, err := http.Get(testServerURL)
-	return err
-}).Should(Succeed())
+Eventually(func() error {
+	select {
+	case err := <-errCh:
+		return fmt.Errorf("server failed to start: %w", err)
+	default:
+	}
+	_, err := http.Get(testServerURL)
+	return err
+}).Should(Succeed())

This allows early detection of startup failures rather than waiting for the full timeout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/metaldata_suite_test.go` around lines 71 - 75, The
goroutine-start pattern around srv.Start(ctx) can hide immediate startup errors
(like bind failures) until Eventually times out; modify the test to surface
startup errors quickly by running srv.Start(ctx) inside the goroutine but
sending its returned error (or nil) on a channel (e.g., errCh) and then
immediately select on that channel with a short timeout before proceeding to the
Eventually checks; update references to srv.Start, the anonymous goroutine, and
the Expect call to read from errCh and call Expect(errChVal).To(Succeed()) so
startup failures are asserted fast.

77-80: ⚡ Quick win

Consider adding an explicit timeout and polling a valid endpoint.

The Eventually block lacks an explicit timeout and polls the root path / which returns 404 (handlers are at /v1/...). While this works for checking server availability (connection success regardless of status code), it could be more explicit.

🔧 Suggested improvements
 Eventually(func() error {
-	_, err := http.Get(testServerURL)
+	resp, err := http.Get(testServerURL + "/v1/")
+	if err != nil {
+		return err
+	}
+	resp.Body.Close()
+	if resp.StatusCode >= 500 {
+		return fmt.Errorf("server error: %d", resp.StatusCode)
+	}
 	return err
-}).Should(Succeed())
+}, "10s", "100ms").Should(Succeed())

Changes:

  • Poll /v1/ instead of / to verify actual handler registration
  • Add explicit 10-second timeout and 100ms polling interval
  • Check for 5xx errors (server not ready) vs 4xx (expected for empty request)
  • Close response body to avoid resource leak
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/metaldata_suite_test.go` around lines 77 - 80, The
Eventually HTTP probe should target a valid handler and include timeout/polling
and response-body cleanup: change the probe to call http.Get on testServerURL +
"/v1/" (or the appropriate registered path), add an explicit timeout of 10s and
a 100ms polling interval to the Eventually call, inspect the returned
*http.Response to close resp.Body to avoid leaks and treat 5xx as retryable
errors while considering 4xx (e.g., 404/400) as acceptable success for handler
presence; update the block around Eventually, http.Get and response handling
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metaldata/http.go`:
- Around line 79-82: The http.Server instantiation for srv currently omits
timeout settings, leaving the service vulnerable to slow-client/resource
exhaustion; update the srv: &http.Server{ ... } block (where Addr, Handler are
set and accessLogger/mux are used) to include sensible ReadHeaderTimeout,
WriteTimeout, and IdleTimeout values (e.g., seconds-level durations) to cap how
long headers, responses, and idle connections can persist; ensure durations are
set via time.Second or time.Duration constants and choose values appropriate for
metadata request latency.

In `@internal/metaldata/index.go`:
- Around line 145-147: When removing addresses from the hosts map, only delete
an entry if the current owner still matches the server being removed to avoid
wiping a recently reassigned IP; for each addr in addrs, look up idx.hosts[addr]
and compare it to the identifier of the server being removed (the same ID used
when that server registered), and call delete(idx.hosts, addr) only when they
are equal. Ensure you reference the existing symbols idx.hosts and addrs and use
the server ID variable used in this function (e.g., the removal target’s ID) for
the comparison so concurrent reassignments aren’t lost.

In `@internal/metaldata/metaldata_suite_test.go`:
- Around line 24-25: Tests use a hardcoded port (30003) causing flakiness;
modify the suite to request port 0 from metaldata.NewServer so the OS assigns a
free port, start the server (via srv.Start or similar), then expose and read the
actual bound port from the server (add a method like Server.Port() or similar on
the metaldata.Server type that returns the bound port after Start/Listen begins)
and build testServerURL dynamically with fmt.Sprintf("http://127.0.0.1:%d",
srv.Port()); ensure Start is launched in a goroutine with GinkgoRecover and the
test waits for the server to be ready before using testServerURL.

---

Nitpick comments:
In `@cmd/metaldata/main.go`:
- Around line 1-154: The PR needs linting and tests run; run "make lint-fix" and
"make test", fix any issues tools report (formatting/gofmt, goimports, vet/lint
warnings, and failing tests) and commit the fixes; specifically run these
against the edited symbols Main, main, stringMapFlag.Set and
stringMapFlag.String (and any changes in cmd/metaldata/main.go) and address
whatever the linters/tests point to (e.g., unused imports, formatting, error
handling or test failures) until both commands succeed.
- Around line 87-93: The zap.Options currently hardcodes Development: true; make
this configurable by adding a boolean CLI flag (e.g., dev or development) and
use its value to set opts.Development after flag.Parse. Locate the zap.Options
struct named opts and the BindFlags call on flag.CommandLine, add a flag
definition before flag.Parse, then set opts.Development = <flagValue> (or bind
the flag into opts via BindFlags) and keep the existing
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts))) so production can default to
false and dev can be enabled via the flag.

In `@config/metaldata/daemonset.yaml`:
- Around line 54-65: Add readOnlyRootFilesystem: true to the container's
securityContext in the DaemonSet spec so the container runs with a read-only
root filesystem; update the existing securityContext block that contains
allowPrivilegeEscalation and capabilities.drop (the same block shown) to include
readOnlyRootFilesystem: true, ensuring it sits alongside
allowPrivilegeEscalation and capabilities to harden the container for this
stateless metaldata HTTP service.

In `@internal/metaldata/http_test.go`:
- Around line 1-294: The test suite must be run and any failing tests fixed
before merging: run `make test` and address failures related to the HTTP
metadata tests by inspecting and correcting the test helpers and test logic
(functions getMetadata, decodeJSON, readBody, claim, userDataSecret and the
Describe blocks for metadata-flavor, /v1/, /v1/{field}, /v1/user-data and
/v1/user-data/{key}) so that all expectations pass; iterate until `make test`
completes successfully and then commit the green test results.

In `@internal/metaldata/http.go`:
- Around line 1-280: The review is a reminder to validate changes with linters
and tests; run make lint-fix and make test and address any issues they report
for the modified symbols in this file (e.g. Server.Start, NewServer,
accessLogger, metadataFlavorMiddleware, and the v1Handler methods like
serveFullMetadata/serveField/serveUserData/serveUserDataKey) until both lint and
test pass. Ensure any reported formatting/go vet/staticcheck errors are fixed,
unresolved nil/err handling or unused imports are corrected, and then re-run
make lint-fix and make test to confirm the file is clean.

In `@internal/metaldata/index_test.go`:
- Around line 116-176: Add a regression test that verifies duplicate IP
ownership transitions: create Server A and Server B that both include the same
IP (use newServer/addServer to add both), then delete or update Server A (use
deleteServer or updateServer) and assert that idx.Lookup(theSharedIP) still
returns an entry and that entry corresponds to Server B; use idx.Lookup,
addServer, deleteServer, updateServer, and newServer to locate the test
scaffolding and ensure the shared-IP case is exercised for both delete and
update transitions so the index does not drop B's ownership when A is removed.

In `@internal/metaldata/metaldata_suite_test.go`:
- Around line 71-75: The goroutine-start pattern around srv.Start(ctx) can hide
immediate startup errors (like bind failures) until Eventually times out; modify
the test to surface startup errors quickly by running srv.Start(ctx) inside the
goroutine but sending its returned error (or nil) on a channel (e.g., errCh) and
then immediately select on that channel with a short timeout before proceeding
to the Eventually checks; update references to srv.Start, the anonymous
goroutine, and the Expect call to read from errCh and call
Expect(errChVal).To(Succeed()) so startup failures are asserted fast.
- Around line 77-80: The Eventually HTTP probe should target a valid handler and
include timeout/polling and response-body cleanup: change the probe to call
http.Get on testServerURL + "/v1/" (or the appropriate registered path), add an
explicit timeout of 10s and a 100ms polling interval to the Eventually call,
inspect the returned *http.Response to close resp.Body to avoid leaks and treat
5xx as retryable errors while considering 4xx (e.g., 404/400) as acceptable
success for handler presence; update the block around Eventually, http.Get and
response handling accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd93f171-d195-480e-808f-1ea272d95020

📥 Commits

Reviewing files that changed from the base of the PR and between 80ddd83 and a3ada03.

⛔ Files ignored due to path filters (1)
  • dist/chart/templates/crd/metal.ironcore.dev_serverclaims.yaml is excluded by !**/dist/**
📒 Files selected for processing (23)
  • Dockerfile
  • Makefile
  • api/v1alpha1/applyconfiguration/api/v1alpha1/serverclaimspec.go
  • api/v1alpha1/applyconfiguration/internal/internal.go
  • api/v1alpha1/constants.go
  • api/v1alpha1/server_types.go
  • api/v1alpha1/serverclaim_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • cmd/metaldata/main.go
  • config/crd/bases/metal.ironcore.dev_serverclaims.yaml
  • config/default/kustomization.yaml
  • config/metaldata/daemonset.yaml
  • config/metaldata/kustomization.yaml
  • config/metaldata/role.yaml
  • config/metaldata/role_binding.yaml
  • config/metaldata/service.yaml
  • config/metaldata/service_account.yaml
  • docs/api-reference/api.md
  • internal/metaldata/http.go
  • internal/metaldata/http_test.go
  • internal/metaldata/index.go
  • internal/metaldata/index_test.go
  • internal/metaldata/metaldata_suite_test.go

Comment on lines +79 to +82
srv: &http.Server{
Addr: net.JoinHostPort("", strconv.Itoa(port)),
Handler: accessLogger(log.WithName("access"), mux),
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Configure server timeouts to prevent resource exhaustion.

The http.Server lacks timeout configurations (ReadHeaderTimeout, WriteTimeout, IdleTimeout). Without ReadHeaderTimeout, the server is vulnerable to Slowloris-style attacks where clients hold connections open by slowly sending headers. For a service handling metadata requests from potentially many bare-metal servers, these timeouts are essential for stability.

🛡️ Recommended timeout configuration
 	return &Server{
 		srv: &http.Server{
 			Addr:    net.JoinHostPort("", strconv.Itoa(port)),
 			Handler: accessLogger(log.WithName("access"), mux),
+			ReadHeaderTimeout: 10 * time.Second,
+			WriteTimeout:      30 * time.Second,
+			IdleTimeout:       90 * time.Second,
 		},
 		log: log,
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
srv: &http.Server{
Addr: net.JoinHostPort("", strconv.Itoa(port)),
Handler: accessLogger(log.WithName("access"), mux),
},
srv: &http.Server{
Addr: net.JoinHostPort("", strconv.Itoa(port)),
Handler: accessLogger(log.WithName("access"), mux),
ReadHeaderTimeout: 10 * time.Second,
WriteTimeout: 30 * time.Second,
IdleTimeout: 90 * time.Second,
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/http.go` around lines 79 - 82, The http.Server
instantiation for srv currently omits timeout settings, leaving the service
vulnerable to slow-client/resource exhaustion; update the srv: &http.Server{ ...
} block (where Addr, Handler are set and accessLogger/mux are used) to include
sensible ReadHeaderTimeout, WriteTimeout, and IdleTimeout values (e.g.,
seconds-level durations) to cap how long headers, responses, and idle
connections can persist; ensure durations are set via time.Second or
time.Duration constants and choose values appropriate for metadata request
latency.

Comment on lines +145 to +147
for _, addr := range addrs {
delete(idx.hosts, addr)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard IP removal by current owner to avoid deleting another Server’s mapping.

Line 145-147 deletes hosts[addr] unconditionally. After Line 127-132 overwrites an IP to a different server, deleting the old server will remove the new server’s live entry.

Proposed fix
 func (idx *Index) removeLocked(name string) {
 	addrs, ok := idx.serverIPs[name]
 	if !ok {
 		return
 	}
 	for _, addr := range addrs {
-		delete(idx.hosts, addr)
+		if entry, exists := idx.hosts[addr]; exists && entry.Name == name {
+			delete(idx.hosts, addr)
+		}
 	}
 	delete(idx.serverIPs, name)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/index.go` around lines 145 - 147, When removing addresses
from the hosts map, only delete an entry if the current owner still matches the
server being removed to avoid wiping a recently reassigned IP; for each addr in
addrs, look up idx.hosts[addr] and compare it to the identifier of the server
being removed (the same ID used when that server registered), and call
delete(idx.hosts, addr) only when they are equal. Ensure you reference the
existing symbols idx.hosts and addrs and use the server ID variable used in this
function (e.g., the removal target’s ID) for the comparison so concurrent
reassignments aren’t lost.

Comment on lines +24 to +25
testServerURL = "http://127.0.0.1:30003"
testListenPort = 30003

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use dynamic port allocation instead of hardcoded port 30003.

Hardcoded ports cause test flakiness from port conflicts and prevent parallel test execution. If port 30003 is already in use (by another test suite, service, or process), the suite will fail intermittently.

🔧 Recommended approach using OS-assigned port
  1. Pass 0 as the port to metaldata.NewServer to let the OS assign a free port
  2. After starting the server, retrieve the actual port from the listener
  3. Construct testServerURL dynamically

Example pattern:

// Let OS assign a free port
srv := metaldata.NewServer(GinkgoLogr, idx, reader, 0)
go func() {
    defer GinkgoRecover()
    Expect(srv.Start(ctx)).To(Succeed())
}()

// Get actual assigned port (requires exposing listener or port from Server)
actualPort := srv.Port() // Would need to add this method to Server
testServerURL := fmt.Sprintf("http://127.0.0.1:%d", actualPort)

Note: This requires the metaldata.Server type to expose its assigned port. If that's not currently available, consider adding a Port() int method that returns the actual bound port after ListenAndServe starts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metaldata/metaldata_suite_test.go` around lines 24 - 25, Tests use a
hardcoded port (30003) causing flakiness; modify the suite to request port 0
from metaldata.NewServer so the OS assigns a free port, start the server (via
srv.Start or similar), then expose and read the actual bound port from the
server (add a method like Server.Port() or similar on the metaldata.Server type
that returns the bound port after Start/Listen begins) and build testServerURL
dynamically with fmt.Sprintf("http://127.0.0.1:%d", srv.Port()); ensure Start is
launched in a goroutine with GinkgoRecover and the test waits for the server to
be ready before using testServerURL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change documentation Improvements or additions to documentation size/XXL

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Add Metadata Support (IEP-18)

1 participant