Skip to content

fix(azure/aks): price store fails to populate — API query times out after 15 minutes #864

@Duologic

Description

@Duologic

Summary

The AKS VM price store never populates successfully. The Azure Retail Prices API query times out after the 15-minute context deadline, leaving the price store empty. All VM pricing metrics are missing.

Symptoms

time=2026-03-30T00:05:13.773Z level=INFO  msg="populating price store"
time=2026-03-30T00:20:13.773Z level=ERROR msg="error populating prices" err="max retries reached: failed to advance page: context deadline exceeded"
time=2026-03-30T00:20:36.493Z level=ERROR msg="region not found in price map" region=westeurope
  • PopulatePriceStore starts and runs for exactly 15 minutes before hitting the context deadline
  • The retry mechanism fails to recover (see below)
  • The price store remains empty, causing "region not found in price map" errors on every scrape
  • Other stores (disk, machine) populate without issue

Root cause

The API filter in pkg/azure/aks/price_store.go is too broad:

serviceName eq 'Virtual Machines' and priceType eq 'Consumption'

This returns every VM-related SKU across all regions (including Cloud Services, Dedicated Hosts, Reserved instances, etc.), producing a result set too large to paginate within 15 minutes. The code then filters most of these out client-side in validateMachinePriceIsRelevantFromSku.

For comparison, the disk pricing store uses a more specific filter and fetches 9,010 prices in ~2 seconds.

A stricter server-side filter exists as a comment in the code (price_store.go:159-164):

serviceName eq 'Virtual Machines' and priceType eq 'Consumption' and contains(productName, 'Virtual Machines') and contains(skuName, 'Low Priority') ne true

This filter matches the client-side logic exactly, so it should return the same data. However, the comment (introduced by @logyball in #221 as part of the original implementation) claims this server-side filter is slower in practice. There is no evidence in the git history that this was tested — the comment shipped with the initial code. This claim needs to be validated, as the current broad filter cannot finish paginating in 15 minutes.

Additional issues

These are not the root cause but would reduce impact if the API call fails for other reasons:

Retries share a single expired context

All 5 try.Do retry attempts reuse the same pricingCtx created at the top of PopulatePriceStore. Once the 15-minute deadline expires, all retries fail instantly (~36ms total for all 5 attempts).

pricingCtx, cancel := context.WithTimeout(ctx, 15*time.Minute) // created once
// ...
err := try.Do(func(attempt int) (bool, error) {
    prices, err = p.azureClientWrapper.ListPrices(pricingCtx, opts) // same expired ctx
    // ...
})

No recovery until next 24-hour tick

After all retries fail, the next PopulatePriceStore call only happens when the 24-hour ticker fires. There is no shorter retry interval on failure.

Proposed fix

  1. Validate the stricter API filter — test whether the server-side filter from the comment is actually slower, or whether that assumption is outdated. If it performs acceptably, use it to reduce the result set.

To reduce impact if the API call fails for other reasons:

  1. Create a fresh context per retry attempt so retries are meaningful after a timeout.
  2. Add a shorter retry interval when population fails, instead of waiting 24 hours.

Affected files

  • pkg/azure/aks/price_store.go — filter, retry logic, context handling
  • pkg/azure/aks/aks.go — ticker goroutine

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions