metrics-proxy: add osVersion to the alive metric via a config-driven host-dimension mapping#37254
Conversation
…etric
Operators (e.g. Indeed) need the host OS version on the `alive`
(host_life) metric to correlate host liveness with the OS each host
runs. osVersion is only known on host-admin, and the host_life packet is
generated locally inside metrics-proxy carrying only `vespaVersion`.
This makes metrics-proxy able to receive host-level dimensions from
host-admin and attach them to specific metrics (currently `alive` only),
while keeping them off all other metrics.
Mechanism:
- Harvest: host-admin pushes host-level dimensions (host, parentHostname,
and now osVersion) on its metric packets via the setExtraMetrics RPC.
ExternalMetrics.extractHostDimensions() harvests these into a dedicated
`extraHostDimensions` store on MetricsManager, kept separate from the
global {role,state} extraDimensions so the existing global spray and
the getMetricsAsBuilders hot path are untouched.
- Apply: NodeMetricGatherer adds the harvested host dimensions to the
host_life packet, filtered by a serviceId -> allowed-dimensions
mapping (host_life -> {host, parentHostname, osVersion}).
- Strip: ExternalMetrics.setExtraMetrics removes host dimensions not
allowed for a packet's service. The default allowed set is
{host, parentHostname}, so the host-admin carrier packets (vespa.node,
vespa.sidecar, vespa.routing) keep host/parentHostname but lose
osVersion -- osVersion therefore ends up exclusively on `alive`.
Semantics: osVersion is "exclusive" (added to alive, stripped from
carriers); host/parentHostname are additive (added to alive, kept where
they already are natively). role/state are unchanged and remain global.
Supporting changes:
- MetricsPacket.Builder.getServiceId(), used to look up the mapping per
packet.
- OS_VERSION and PARENT_HOSTNAME name constants in PublicDimensions.
The serviceId -> dimensions mapping is currently hardcoded in
ExternalMetrics; a follow-up will move it to a config-def generated by
config-model (the generic, configurable form).
Cross-repo: host-admin must emit osVersion (separate cloud change).
Deploy metrics-proxy BEFORE host-admin, otherwise osVersion briefly
appears on the carrier metrics until the strip is live.
Tests: ExternalMetricsTest (harvest + strip), NodeMetricGathererTest
(alive apply), MetricsManagerTest (incl. a within-proxy end-to-end test
asserting alive gets osVersion while the carrier is stripped). Full
metrics-proxy module green (154 tests).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous commit hardcoded the serviceId -> allowed-dimensions
mapping inside ExternalMetrics. This makes it generic and config-driven:
which host-admin dimensions land on which metric is now decided by
config-model, not by metrics-proxy source.
Config:
- New config-def metric-dimension-mapping.def:
defaultDimension[] -- host dimensions allowed on metrics whose
service is not listed below
mapping[].service -- per-service override
mapping[].dimension[]
- New runtime component MetricDimensionMapping wraps the config and
answers allowedFor(serviceId) (explicit mapping, else default) and
managedDimensions() (the union the mapping governs; only these are
harvested and stripped).
Wiring:
- ExternalMetrics now takes a MetricDimensionMapping instead of the
hardcoded statics; allowed/harvest/strip read from it.
MetricsManager.getExtraHostDimensions filters by the same mapping.
- config-model's MetricsProxyContainerCluster implements
MetricDimensionMappingConfig.Producer and registers the component.
In hosted zones it emits:
default = {host, parentHostname}
host_life = {host, parentHostname, osVersion}
Self-hosted gets an empty mapping -> harvest and strip are a full
no-op, so self-hosted behaviour is unchanged.
This is the follow-up promised in the previous commit; runtime behaviour
(osVersion exclusively on alive, host/parentHostname additive) is
identical -- only the source of the mapping changed from code to config.
Tests: new MetricDimensionMappingTest (allowedFor default/explicit,
managedDimensions union); config-model MetricsProxyContainerClusterTest
(hosted emits the mapping, self-hosted empty); existing
ExternalMetrics/MetricsManager/NodeMetricGatherer tests migrated to the
config-built mapping via TestUtil.standardDimensionMapping(). metrics-proxy
module green (155), config-model metricsproxy green (23).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| Set<DimensionId> allowed = allowedHostDimensions(packet.getServiceId()); | ||
| Set<DimensionId> managed = dimensionMapping.managedDimensions(); | ||
| Set<DimensionId> retained = packet.getDimensionIds(); | ||
| retained.removeIf(id -> managed.contains(id) && ! allowed.contains(id)); |
There was a problem hiding this comment.
This mutates retained, which is just the return value of packet.getDimensionIds(). This method shouldn't assume that return value is mutable.
There was a problem hiding this comment.
Fair to flag, but I think this is within the method's documented contract rather than an assumption. getDimensionIds()'s javadoc says it
"Returns a modifiable copy of the dimension IDs of this builder, usually for use with retainDimensions(Collection)"
It hands back a fresh LinkedHashSet (not a view over the builder's internal map), intended for exactly this get → mutate → retainDimensions() pattern. So retained.removeIf(...) only mutates that copy, never the builder.
Existing block in PublicDimensionsProcessor does the same thing.
That said, if you'd rather this code not depend on that contract, I can to wrap it in a local new HashSet<>(...)..
There was a problem hiding this comment.
No that's fine. Reviewing such code is harder since the code isn't obviously correct. So usually code does the wrapping client-side. It would also have been fine if the method name had been clearer (e.g. getMutableDimensionIds())
| .service(HOST_LIFE_SERVICE) | ||
| .dimension(PublicDimensions.HOSTNAME) | ||
| .dimension(PublicDimensions.PARENT_HOSTNAME) | ||
| .dimension(PublicDimensions.OS_VERSION)); |
There was a problem hiding this comment.
We want to add the OS version dimension to the host_life metric for everyone?
There was a problem hiding this comment.
Yes, we want it to be a generic feature for everyone. Also it will be useful for us on our dashboards.
Address gjoranv's review on metric-dimension-mapping.def:
- Change `mapping[].service` + `mapping[].dimension[]` to a map keyed by
service name: `service{}.dimension[]`. Models service -> dimensions
directly, drops the redundant service field, and makes duplicate-service
entries impossible (the array form silently allowed duplicates).
- Use the generated Consumer<Builder> lambda overload at the three build
sites (TestUtil.standardDimensionMapping, MetricDimensionMappingTest,
MetricsProxyContainerCluster.getConfig) instead of building the inner
struct explicitly.
No behaviour change: MetricDimensionMapping.allowedFor/managedDimensions and
all of ExternalMetrics/MetricsManager are unchanged; only how the config is
expressed and parsed changed.
metrics-proxy 155 green, config-model metricsproxy 23 green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* origin/master: (55 commits) Optimize sameElement over multivalue Allow configuring ONNX graph optimization for ONNX embedders metrics-proxy: add osVersion to the alive metric via a config-driven host-dimension mapping (#37254) Allow configuring optimization for ONNX models in services.xml Restrict tensorFromLabelsWithOffset to array attributes Use float cell type for tensorFromLabelsWithOffset Restrict tensorFromLabelsWithOffset to attribute source Add estimated flush duration to flush target interface. Add tensorFromLabelsWithOffset rank feature tweak abs cost of always_strict complex leafs Add method for getting only tenants with applications Use std::format Make ONNX graph optimization configurable per model (default off) Add OpenTelemetry API to the in-process container test classpath (#37251) Add dot product computation between two quantized vectors Expose disk index flush duration in proton state explorer. Expose reserved memory in proton state explorer. Install telemetry.def so the config server can resolve it (#37246) Expose reserved memory metric. Revert "Revert "Revert "Revert "Add OpenTelemetry tracing provider to the con…" (#37244) ...
Why
A tenant (Indeed) needs the host OS version on the
alive(host_life) metric to correlate host liveness with the OS each host runs on.osVersionis only known on host-admin (Vespa Cloud), but thehost_lifepacket is generated locally inside metrics-proxy and carries onlyvespaVersion. Rather than a one-off osVersion hack, this adds a generic way to attach host-admin-provided dimensions to specific metrics, keeping cardinality off everything else.What
A config-driven mapping deciding which host-level dimensions (provided by host-admin) are allowed on which metric:
host,parentHostname, nowosVersion) on its packets via thesetExtraMetricsRPC; metrics-proxy harvests them into a dedicated store, separate from the global{role, state}dimensions.alive/host_lifepacket gets the host dimensions allowed for its service.vespa.node,vespa.sidecar,vespa.routing).The allow-list is a config-def (
metric-dimension-mapping.def) generated by config-model:default = {host, parentHostname},host_life = {host, parentHostname, osVersion}.Net effect (hosted):
osVersionappears only onalive;host/parentHostnameadded toaliveand kept where already native; role/state and internal Vespa metrics unchanged. Self-hosted emits an empty mapping → full no-op. Adding a host dimension to another metric later is now a config-model change, not source.Commits
MetricDimensionMappingcomponentCross-repo / deploy order
host-admin must emit
osVersion— companion cloud change: vespaai/cloud#3154. Deploy metrics-proxy BEFORE host-admin, elseosVersionbriefly appears on carriers until the strip is live.Tests
ExternalMetricsTest, NodeMetricGathererTest, MetricsManagerTest (within-proxy end-to-end), MetricDimensionMappingTest, config-model MetricsProxyContainerClusterTest. metrics-proxy 155 green, config-model metricsproxy 23 green.
🤖 Generated with Claude Code