Skip to content

metrics-proxy: add osVersion to the alive metric via a config-driven host-dimension mapping#37254

Merged
onurkaracali merged 3 commits into
masterfrom
onur/osversion-alive-metric
Jun 19, 2026
Merged

metrics-proxy: add osVersion to the alive metric via a config-driven host-dimension mapping#37254
onurkaracali merged 3 commits into
masterfrom
onur/osversion-alive-metric

Conversation

@onurkaracali

@onurkaracali onurkaracali commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Why

A tenant (Indeed) needs the host OS version on the alive (host_life) metric to correlate host liveness with the OS each host runs on.

osVersion is only known on host-admin (Vespa Cloud), but the host_life packet is generated locally inside metrics-proxy and carries only vespaVersion. Rather than a one-off osVersion hack, this adds a generic way to attach host-admin-provided dimensions to specific metrics, keeping cardinality off everything else.

What

A config-driven mapping deciding which host-level dimensions (provided by host-admin) are allowed on which metric:

  • Harvest — host-admin pushes host dimensions (host, parentHostname, now osVersion) on its packets via the setExtraMetrics RPC; metrics-proxy harvests them into a dedicated store, separate from the global {role, state} dimensions.
  • Apply — the locally-built alive/host_life packet gets the host dimensions allowed for its service.
  • Strip — host dimensions not allowed for a packet's service are removed from the carrier packets (vespa.node, vespa.sidecar, vespa.routing).

The allow-list is a config-def (metric-dimension-mapping.def) generated by config-model: default = {host, parentHostname}, host_life = {host, parentHostname, osVersion}.

Net effect (hosted): osVersion appears only on alive; host/parentHostname added to alive and kept where already native; role/state and internal Vespa metrics unchanged. Self-hosted emits an empty mapping → full no-op. Adding a host dimension to another metric later is now a config-model change, not source.

Commits

  1. add osVersion (and host/parentHostname) to the alive metric — harvest/apply/strip (mapping initially hardcoded)
  2. make the metric-to-dimension mapping config-driven — moves it to the config-def + MetricDimensionMapping component

Cross-repo / deploy order

host-admin must emit osVersion — companion cloud change: vespaai/cloud#3154. Deploy metrics-proxy BEFORE host-admin, else osVersion briefly appears on carriers until the strip is live.

Tests

ExternalMetricsTest, NodeMetricGathererTest, MetricsManagerTest (within-proxy end-to-end), MetricDimensionMappingTest, config-model MetricsProxyContainerClusterTest. metrics-proxy 155 green, config-model metricsproxy 23 green.

🤖 Generated with Claude Code

onurkaracali and others added 2 commits June 19, 2026 01:20
…etric

Operators (e.g. Indeed) need the host OS version on the `alive`
(host_life) metric to correlate host liveness with the OS each host
runs. osVersion is only known on host-admin, and the host_life packet is
generated locally inside metrics-proxy carrying only `vespaVersion`.

This makes metrics-proxy able to receive host-level dimensions from
host-admin and attach them to specific metrics (currently `alive` only),
while keeping them off all other metrics.

Mechanism:
- Harvest: host-admin pushes host-level dimensions (host, parentHostname,
  and now osVersion) on its metric packets via the setExtraMetrics RPC.
  ExternalMetrics.extractHostDimensions() harvests these into a dedicated
  `extraHostDimensions` store on MetricsManager, kept separate from the
  global {role,state} extraDimensions so the existing global spray and
  the getMetricsAsBuilders hot path are untouched.
- Apply: NodeMetricGatherer adds the harvested host dimensions to the
  host_life packet, filtered by a serviceId -> allowed-dimensions
  mapping (host_life -> {host, parentHostname, osVersion}).
- Strip: ExternalMetrics.setExtraMetrics removes host dimensions not
  allowed for a packet's service. The default allowed set is
  {host, parentHostname}, so the host-admin carrier packets (vespa.node,
  vespa.sidecar, vespa.routing) keep host/parentHostname but lose
  osVersion -- osVersion therefore ends up exclusively on `alive`.

Semantics: osVersion is "exclusive" (added to alive, stripped from
carriers); host/parentHostname are additive (added to alive, kept where
they already are natively). role/state are unchanged and remain global.

Supporting changes:
- MetricsPacket.Builder.getServiceId(), used to look up the mapping per
  packet.
- OS_VERSION and PARENT_HOSTNAME name constants in PublicDimensions.

The serviceId -> dimensions mapping is currently hardcoded in
ExternalMetrics; a follow-up will move it to a config-def generated by
config-model (the generic, configurable form).

Cross-repo: host-admin must emit osVersion (separate cloud change).
Deploy metrics-proxy BEFORE host-admin, otherwise osVersion briefly
appears on the carrier metrics until the strip is live.

Tests: ExternalMetricsTest (harvest + strip), NodeMetricGathererTest
(alive apply), MetricsManagerTest (incl. a within-proxy end-to-end test
asserting alive gets osVersion while the carrier is stripped). Full
metrics-proxy module green (154 tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous commit hardcoded the serviceId -> allowed-dimensions
mapping inside ExternalMetrics. This makes it generic and config-driven:
which host-admin dimensions land on which metric is now decided by
config-model, not by metrics-proxy source.

Config:
- New config-def metric-dimension-mapping.def:
    defaultDimension[]   -- host dimensions allowed on metrics whose
                            service is not listed below
    mapping[].service    -- per-service override
    mapping[].dimension[]
- New runtime component MetricDimensionMapping wraps the config and
  answers allowedFor(serviceId) (explicit mapping, else default) and
  managedDimensions() (the union the mapping governs; only these are
  harvested and stripped).

Wiring:
- ExternalMetrics now takes a MetricDimensionMapping instead of the
  hardcoded statics; allowed/harvest/strip read from it.
  MetricsManager.getExtraHostDimensions filters by the same mapping.
- config-model's MetricsProxyContainerCluster implements
  MetricDimensionMappingConfig.Producer and registers the component.
  In hosted zones it emits:
    default   = {host, parentHostname}
    host_life = {host, parentHostname, osVersion}
  Self-hosted gets an empty mapping -> harvest and strip are a full
  no-op, so self-hosted behaviour is unchanged.

This is the follow-up promised in the previous commit; runtime behaviour
(osVersion exclusively on alive, host/parentHostname additive) is
identical -- only the source of the mapping changed from code to config.

Tests: new MetricDimensionMappingTest (allowedFor default/explicit,
managedDimensions union); config-model MetricsProxyContainerClusterTest
(hosted emits the mapping, self-hosted empty); existing
ExternalMetrics/MetricsManager/NodeMetricGatherer tests migrated to the
config-built mapping via TestUtil.standardDimensionMapping(). metrics-proxy
module green (155), config-model metricsproxy green (23).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Set<DimensionId> allowed = allowedHostDimensions(packet.getServiceId());
Set<DimensionId> managed = dimensionMapping.managedDimensions();
Set<DimensionId> retained = packet.getDimensionIds();
retained.removeIf(id -> managed.contains(id) && ! allowed.contains(id));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutates retained, which is just the return value of packet.getDimensionIds(). This method shouldn't assume that return value is mutable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair to flag, but I think this is within the method's documented contract rather than an assumption. getDimensionIds()'s javadoc says it

"Returns a modifiable copy of the dimension IDs of this builder, usually for use with retainDimensions(Collection)"

It hands back a fresh LinkedHashSet (not a view over the builder's internal map), intended for exactly this get → mutate → retainDimensions() pattern. So retained.removeIf(...) only mutates that copy, never the builder.

Existing block in PublicDimensionsProcessor does the same thing.

That said, if you'd rather this code not depend on that contract, I can to wrap it in a local new HashSet<>(...)..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's fine. Reviewing such code is harder since the code isn't obviously correct. So usually code does the wrapping client-side. It would also have been fine if the method name had been clearer (e.g. getMutableDimensionIds())

.service(HOST_LIFE_SERVICE)
.dimension(PublicDimensions.HOSTNAME)
.dimension(PublicDimensions.PARENT_HOSTNAME)
.dimension(PublicDimensions.OS_VERSION));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to add the OS version dimension to the host_life metric for everyone?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we want it to be a generic feature for everyone. Also it will be useful for us on our dashboards.

@onurkaracali onurkaracali requested a review from hakonhall June 19, 2026 12:28
hakonhall
hakonhall previously approved these changes Jun 19, 2026
Comment thread metrics-proxy/src/main/resources/configdefinitions/metric-dimension-mapping.def Outdated
Address gjoranv's review on metric-dimension-mapping.def:
- Change `mapping[].service` + `mapping[].dimension[]` to a map keyed by
  service name: `service{}.dimension[]`. Models service -> dimensions
  directly, drops the redundant service field, and makes duplicate-service
  entries impossible (the array form silently allowed duplicates).
- Use the generated Consumer<Builder> lambda overload at the three build
  sites (TestUtil.standardDimensionMapping, MetricDimensionMappingTest,
  MetricsProxyContainerCluster.getConfig) instead of building the inner
  struct explicitly.

No behaviour change: MetricDimensionMapping.allowedFor/managedDimensions and
all of ExternalMetrics/MetricsManager are unchanged; only how the config is
expressed and parsed changed.

metrics-proxy 155 green, config-model metricsproxy 23 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gjoranv gjoranv left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@onurkaracali onurkaracali merged commit e488082 into master Jun 19, 2026
3 checks passed
@onurkaracali onurkaracali deleted the onur/osversion-alive-metric branch June 19, 2026 16:16
esolitos added a commit that referenced this pull request Jun 22, 2026
* origin/master: (55 commits)
  Optimize sameElement over multivalue
  Allow configuring ONNX graph optimization for ONNX embedders
  metrics-proxy: add osVersion to the alive metric via a config-driven host-dimension mapping (#37254)
  Allow configuring optimization for ONNX models in services.xml
  Restrict tensorFromLabelsWithOffset to array attributes
  Use float cell type for tensorFromLabelsWithOffset
  Restrict tensorFromLabelsWithOffset to attribute source
  Add estimated flush duration to flush target interface.
  Add tensorFromLabelsWithOffset rank feature
  tweak abs cost of always_strict complex leafs
  Add method for getting only tenants with applications
  Use std::format
  Make ONNX graph optimization configurable per model (default off)
  Add OpenTelemetry API to the in-process container test classpath (#37251)
  Add dot product computation between two quantized vectors
  Expose disk index flush duration in proton state explorer.
  Expose reserved memory in proton state explorer.
  Install telemetry.def so the config server can resolve it (#37246)
  Expose reserved memory metric.
  Revert "Revert "Revert "Revert "Add OpenTelemetry tracing provider to the con…" (#37244)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants