[AutoScaler] implement prometheus metrics client by weilerN · Pull Request #84 · v3io/scaler

weilerN · 2026-01-12T13:43:16Z

📝 Description

This PR introduces a Prometheus-based metrics client to the autoscaler and refactors the metrics collection flow to support it.

🛠️ Changes Made

Added a new Prometheus client that implements MetricsClient interface some key functionalities:
- Query templates
- Filter relevant window size before query the Prometheus server
- Round up values the ensure any Prometheus's fractional value > 0 becomes at least 1
Refactored autoscaler metrics flow to pass all nominated STZ resources instead of precomputed metric names.
Moved metric-name extraction logic from the autoscaler into the Kubernetes custom-metrics client.

✅ Checklist

I have tested the changes in this PR

🧪 Testing

Added comprehensive unit tests for the Prometheus metrics client using a mock Prometheus server.
Dev tests include backward compatibility

🔗 References

Ticket link: https://iguazio.atlassian.net/browse/NUC-712
Design docs links: https://iguazio.atlassian.net/wiki/spaces/PLAT/pages/627572831/Nuclio+STZ+using+Prometheus+client+-+HLD#Create-prometheus-client
External links:

🚨 Breaking Changes?

Yes (explain below)
No

🔍️ Additional Notes

…etheus server

rokatyy

Nice!

Some suggestions and comments below. And also, please change the pr title :)

rokatyy · 2026-01-15T16:10:03Z

pkg/autoscaler/metricsclient/prometheus.go

+	"github.com/prometheus/common/model"
+)
+
+type PrometheusClient struct {


to align with K8sCustomMetricsClient:

Suggested change

type PrometheusClient struct {

type PrometheusMetricsClient struct {

Fixed here - CR 1 - namings, consts and handler functions

rokatyy · 2026-01-15T16:27:15Z

pkg/autoscaler/metricsclient/prometheus.go

+			continue
+		}
+
+		resourceNameRegex := pc.buildResourceNameRegex(resources)


no need to run it on every iteration

Fixed here - CR 1 - namings, consts and handler functions

rokatyy · 2026-01-15T16:29:52Z

pkg/autoscaler/metricsclient/prometheus.go

+					"error", err)
+				continue
+			}
+


This is not very good in terms of O complexity.

resourceNameRegex := pc.buildResourceNameRegex(resources) iterates over resources on each iteration of queryTemplates without no reason.

Then, in windowSizes loop it does all the same as extractWindowSizesForMetric does again but even with worse complexity. And on every iteration we iterate over everything. Iterate over this list once and put the data into a map, where key is a metrics name, this way we can avoid unnecessary iterations.

Fixed here - CR 2 - one iteration over resources

rokatyy · 2026-01-15T16:31:18Z

pkg/autoscaler/metricsclient/prometheus.go

+
+func NewPrometheusClient(parentLogger logger.Logger, prometheusURL, namespace string, templates []scalertypes.QueryTemplate) (*PrometheusClient, error) {
+	if len(templates) == 0 {
+		return nil, errors.New("Failed to created Prometheus client: query template cannot be empty")


and it all the places below

Suggested change

return nil, errors.New("Failed to created Prometheus client: query template cannot be empty")

return nil, errors.New("Failed to create Prometheus client: query template cannot be empty")

Fixed here - CR 1 - namings, consts and handler functions

rokatyy · 2026-01-15T16:31:56Z

pkg/autoscaler/metricsclient/prometheus.go

+	labelNames := []model.LabelName{
+		"function",     // For nuclio functions (nuclio_processor_handled_events_total) - maps to nucliofunction resource
+		"service_name", // For deployments (num_of_requests, jupyter_kernel_busyness) - maps to deployment resource
+		"pod",          // For pod-based metrics (DCGM_FI_DEV_GPU_UTIL) - maps to pod resource


Fixed here - CR 1 - namings, consts and handler functions

rokatyy · 2026-01-15T16:40:39Z

pkg/autoscaler/metricsclient/prometheus.go

+				return nil, errors.Wrapf(err, "Failed to render query for metricName=%s, windowSize=%s", metricName, windowSize)
+			}
+
+			rawResult, warnings, err := pc.apiClient.Query(context.Background(), query, time.Now())


retry if failed?

Also, I'm not sure about recreating context.Background() every time, especially that it's without timeout (so no actual purpose).

I suggest creating one ctx object per whole func and using semaphore to send requests to prometheus.

I fixed the context per iteration here - CR 3 - parallel querying and use single context per iteration

Regarding retries, I don’t think they’re needed at this level. If a request fails, it will be tried again in the next iteration anyway. The existing Kubernetes custom-metrics client also doesn’t use retries, so adding them here would make this implementation behave differently. WDYT?

@weilerN what do you mean by:

adding them here would make this implementation behave differently
?

how often do we usually run? I think this flow is relatively critical as it scales things that might use GPUs and which affects the final cost of used resources. So if smth is wrong with network temporarily, I think it's better to retry now than later, isn't it?

It run by default every 1m link.
I agree that adding a retry mechanism makes sense. That said, I think it’s out of scope for this CR, so I suggest we track it separately by opening a tech-debt ticket to address retries in a more comprehensive way.

pkg/autoscaler/metricsclient/prometheus.go

TomerShor

Looking much better!

pkg/autoscaler/metricsclient/prometheus.go

TomerShor · 2026-01-19T14:35:57Z

pkg/autoscaler/metricsclient/prometheus.go

+		for resultChan != nil || errChan != nil {
+			select {
+			case result, ok := <-resultChan:
+				if !ok {
+					resultChan = nil
+					continue
+				}


Isn't it the same as:

Suggested change

for resultChan != nil || errChan != nil {

select {

case result, ok := <-resultChan:

if !ok {

resultChan = nil

continue

}

for {

select {

case result, ok := <-resultChan:

if !ok {

return

}

It's not the same, because in your case the goroutine will exist on the first channel that will be closed, regardless of the second channel and I want that it will exist only after the 2 channels will be closed.
I agree that for { is better than for resultChan != nil || errChan != nil { so I will change it and add comments.

Fixed here CR 5 - failure only if all goroutines are failing, log failure on the…

TomerShor · 2026-01-19T14:36:50Z

pkg/autoscaler/metricsclient/prometheus.go

+			case err, ok := <-errChan:
+				if !ok {
+					errChan = nil
+					continue
+				}


Fixed here - CR 5 - failure only if all goroutines are failing, log failure on the…

Copilot

Pull request overview

This PR introduces a Prometheus-based metrics client implementation to support scaling decisions based on Prometheus metrics, refactoring the metrics collection architecture to be more flexible.

Changes:

Added a new PrometheusMetricsClient that implements the MetricsClient interface with support for query templates, window size filtering, and automatic rounding of fractional values
Refactored the MetricsClient interface to accept Resource objects instead of pre-computed metric names, moving metric name extraction logic into individual client implementations
Exported helper functions (ShortDurationString, GetKubernetesMetricName) to support both K8s and Prometheus clients

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
pkg/scalertypes/types.go	Added `QueryTemplate` struct and methods, updated `MetricsClient` interface signature, exported utility functions for metric name generation
pkg/autoscaler/metricsclient/prometheus.go	New Prometheus client implementation with query templating, concurrent metric fetching, and value rounding logic
pkg/autoscaler/metricsclient/prometheus_test.go	Comprehensive unit tests for Prometheus client using mock HTTP server
pkg/autoscaler/metricsclient/k8s.go	Updated K8s custom metrics client to accept resources and extract metric names internally
pkg/autoscaler/metricsclient/factory.go	Added factory support for creating Prometheus client
pkg/autoscaler/autoscaler.go	Simplified by removing `getMetricNames` function and passing resources directly to metrics client
go.mod, go.sum	Added Prometheus client dependencies and updated transitive dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/scalertypes/types.go

pkg/autoscaler/metricsclient/prometheus.go

pkg/autoscaler/metricsclient/factory.go

pkg/autoscaler/metricsclient/prometheus.go

pkg/scalertypes/types.go

rokatyy · 2026-01-19T15:53:56Z

pkg/autoscaler/metricsclient/prometheus.go

+type windowSizeLookup map[string]map[string]struct{}
+
+// metricLookup maps metricName → windowSizeLookup
+type metricLookup map[string]windowSizeLookup


what is the point of adding these types instead of returning the native types directly?

I think it improve the readability of the nested structure

rokatyy · 2026-01-19T15:58:30Z

pkg/autoscaler/metricsclient/prometheus.go

+
+// GetResourceMetrics retrieves metrics for multiple resources
+func (pc *PrometheusMetricsClient) GetResourceMetrics(resources []scalertypes.Resource) (map[string]map[string]int, error) {
+	ctx, cancel := context.WithTimeout(context.Background(), pc.interval)


So if it takes longer than an interval and the order in which we execute requests is always the same, are we at risk of failing every time with the same functions?

If a single function have a consistent error it won't be STZ.
I updated the code so a single function failure won't fail the all other resources' STZ here - CR 5 - failure only if all goroutines are failing, log failure on the…

rokatyy · 2026-01-19T15:58:41Z

pkg/autoscaler/metricsclient/prometheus.go

+	resultCh := make(chan result, 1)
+	defer close(resultCh)
+
+	go func() {
+		metrics, err := pc.getResourceMetrics(ctx, metricToWindowSizes)
+		resultCh <- result{metrics: metrics, err: err}
+	}()
+
+	select {
+	case <-ctx.Done():
+		return nil, errors.Wrap(ctx.Err(), "timeout waiting for resource metrics")
+	case res := <-resultCh:
+		return res.metrics, res.err


why is this needed?

The need for this select is basically limit the execution for interval.
I added some comments in the code here - CR 5 - failure only if all goroutines are failing, log failure on the…

rokatyy · 2026-01-19T16:00:14Z

pkg/autoscaler/metricsclient/prometheus.go

+	wg := sync.WaitGroup{}
+
+	for metricName, queryTemplate := range pc.queryTemplates {
+		windowSizeToResources := metricToWindowSizes[metricName]
+
+		for windowSize, resourcesInWindowSize := range windowSizeToResources {
+			wg.Add(1)


use semaphore with limited number of goroutines, no need to run all at once, go scheduler will spend more time switching if number of functions is large

The number of goroutines is limited to the number of windowSize.
The default is 6 link so I would prefer to keep the wg use as is

rokatyy · 2026-01-19T16:06:35Z

pkg/autoscaler/metricsclient/prometheus.go

+	}
+
+	var collectedErrors []error
+	collectorDone := make(chan struct{})


if we need at least one error to fail, why getting all the errors. I suggest having one atomic var (firstErr or whatever) and use CompareAndSwap to set it. If at least one error occurred, fail. Why do we store each error?

Fixed here - CR 5 - failure only if all goroutines are failing, log failure on the…

… way

weilerN · 2026-01-20T09:14:17Z

Hey all,

I’ve summarized the feedback into the following decision:

If there is a partial failure during execution (e.g., failing to query a specific window size, failing to parse a response, etc.), the error will be logged and the overall routine will continue.

The metrics client will return an error only if all metrics fail, or if there is a conflict between two metrics for the same resource and it cannot determine which one to use. We can further discuss how to handle the second case if needed.

TomerShor

LGTM

rokatyy

LGTM

### 📝 Description This code changes are supporting the metricsClient changes in the scaler repository. This PR introduce the configurations additions and init the scaler with the metrics client. --- ### 🛠️ Changes Made - Add `MetricsClient` configuration to the platform config - Update the `createAutoScaler` to use the metricsClient --- ### ✅ Checklist - [ ] I updated the documentation (if applicable) - [x] I have tested the changes in this PR --- ### 🧪 Testing - Dev test - backward compatibility: nothing is changed with the new scaler and no configuration changes - prometheus client: added the relevant configurations and validates that functions STZ as expected --- ### 🔗 References - Ticket link: https://iguazio.atlassian.net/browse/NUC-713 - Design docs links: https://iguazio.atlassian.net/wiki/spaces/PLAT/pages/627572831/Nuclio+STZ+using+Prometheus+client+-+HLD#Support-Prometheus-client-during-scaler-init - Merge blocked by: v3io/scaler#84 --- ### 🚨 Breaking Changes? - [ ] Yes (explain below) - [x] No  --- ### 🔍️ Additional Notes

weilerN force-pushed the NUC-712-implement_prometheus_client branch 4 times, most recently from cf7e195 to 7f0f482 Compare January 14, 2026 14:50

weilerN added 2 commits January 15, 2026 10:44

Align interface parameter to get resources

ef5f544

TMP DEBUG LOGS

57e7da8

weilerN force-pushed the NUC-712-implement_prometheus_client branch from 7f0f482 to 6482f50 Compare January 15, 2026 08:47

weilerN requested review from TomerShor and rokatyy and removed request for rokatyy January 15, 2026 14:59

weilerN force-pushed the NUC-712-implement_prometheus_client branch from bd6ad0a to 3944b48 Compare January 15, 2026 15:11

weilerN added 9 commits January 15, 2026 17:17

Implement prometheus client

9274a5c

go.mod changes

e50e927

Add prometheus client to factory

675635e

support filtering relevant metrics before sending the request to prom…

d2a3cfe

…etheus server

go.mod updates

5fbfe03

round up fractional metrics values

360046b

Prometheus client UTs

4474d8f

Clean debug logs

c7ef4d6

Final touch changes

1b9b74b

weilerN force-pushed the NUC-712-implement_prometheus_client branch from 2f52fc9 to 1b9b74b Compare January 15, 2026 15:17

weilerN mentioned this pull request Jan 15, 2026

[Scaler] Use metric client in scaler nuclio/nuclio#3969

Merged

4 tasks

weilerN marked this pull request as ready for review January 15, 2026 15:54

rokatyy requested changes Jan 15, 2026

View reviewed changes

weilerN changed the title ~~Nuc 712 implement prometheus client~~ [Scaler] implement prometheus client Jan 15, 2026

weilerN changed the title ~~[Scaler] implement prometheus client~~ [AutoScaler] implement prometheus client Jan 15, 2026

weilerN changed the title ~~[AutoScaler] implement prometheus client~~ [AutoScaler] implement prometheus metrics client Jan 15, 2026

TomerShor requested changes Jan 18, 2026

View reviewed changes

CR 1 - namings, consts and handler functions

549c27e

weilerN force-pushed the NUC-712-implement_prometheus_client branch from ac8f30b to 549c27e Compare January 19, 2026 07:38

weilerN added 2 commits January 19, 2026 13:25

CR 2 - one iteration over resources

6568731

CR 3 - parallel querying and use single context per iteration

2322b61

TomerShor reviewed Jan 19, 2026

View reviewed changes

TomerShor requested a review from Copilot January 19, 2026 14:39

Copilot started reviewing on behalf of TomerShor January 19, 2026 14:39 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

rokatyy requested changes Jan 19, 2026

View reviewed changes

weilerN added 2 commits January 19, 2026 18:57

CR 4 - copilot comments

6c5b832

CR 5 - failure only if all goroutines are failing, log failure on the…

c5d69e2

… way

weilerN requested review from TomerShor and rokatyy January 20, 2026 09:14

TomerShor approved these changes Jan 20, 2026

View reviewed changes

CR 6 - remove ctx.Done from go routines and simplify the main function

e89e0b1

weilerN force-pushed the NUC-712-implement_prometheus_client branch from 4b3d22f to e89e0b1 Compare January 20, 2026 11:23

rokatyy approved these changes Jan 20, 2026

View reviewed changes

TomerShor merged commit a81e869 into v3io:development Jan 20, 2026
3 checks passed

	type PrometheusClient struct {
	type PrometheusMetricsClient struct {

	return nil, errors.New("Failed to created Prometheus client: query template cannot be empty")
	return nil, errors.New("Failed to create Prometheus client: query template cannot be empty")

Conversation

weilerN commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Description

🛠️ Changes Made

✅ Checklist

🧪 Testing

🔗 References

🚨 Breaking Changes?

🔍️ Additional Notes

Uh oh!

rokatyy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rokatyy Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weilerN Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rokatyy Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomerShor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weilerN commented Jan 12, 2026 •

edited

Loading

rokatyy left a comment •

edited

Loading

rokatyy Jan 15, 2026 •

edited

Loading

weilerN Jan 19, 2026 •

edited

Loading

rokatyy Jan 19, 2026 •

edited

Loading