Summary
The AKS VM price store never populates successfully. The Azure Retail Prices API query times out after the 15-minute context deadline, leaving the price store empty. All VM pricing metrics are missing.
Symptoms
time=2026-03-30T00:05:13.773Z level=INFO msg="populating price store"
time=2026-03-30T00:20:13.773Z level=ERROR msg="error populating prices" err="max retries reached: failed to advance page: context deadline exceeded"
time=2026-03-30T00:20:36.493Z level=ERROR msg="region not found in price map" region=westeurope
PopulatePriceStore starts and runs for exactly 15 minutes before hitting the context deadline
- The retry mechanism fails to recover (see below)
- The price store remains empty, causing
"region not found in price map" errors on every scrape
- Other stores (disk, machine) populate without issue
Root cause
The API filter in pkg/azure/aks/price_store.go is too broad:
serviceName eq 'Virtual Machines' and priceType eq 'Consumption'
This returns every VM-related SKU across all regions (including Cloud Services, Dedicated Hosts, Reserved instances, etc.), producing a result set too large to paginate within 15 minutes. The code then filters most of these out client-side in validateMachinePriceIsRelevantFromSku.
For comparison, the disk pricing store uses a more specific filter and fetches 9,010 prices in ~2 seconds.
A stricter server-side filter exists as a comment in the code (price_store.go:159-164):
serviceName eq 'Virtual Machines' and priceType eq 'Consumption' and contains(productName, 'Virtual Machines') and contains(skuName, 'Low Priority') ne true
This filter matches the client-side logic exactly, so it should return the same data. However, the comment (introduced by @logyball in #221 as part of the original implementation) claims this server-side filter is slower in practice. There is no evidence in the git history that this was tested — the comment shipped with the initial code. This claim needs to be validated, as the current broad filter cannot finish paginating in 15 minutes.
Additional issues
These are not the root cause but would reduce impact if the API call fails for other reasons:
Retries share a single expired context
All 5 try.Do retry attempts reuse the same pricingCtx created at the top of PopulatePriceStore. Once the 15-minute deadline expires, all retries fail instantly (~36ms total for all 5 attempts).
pricingCtx, cancel := context.WithTimeout(ctx, 15*time.Minute) // created once
// ...
err := try.Do(func(attempt int) (bool, error) {
prices, err = p.azureClientWrapper.ListPrices(pricingCtx, opts) // same expired ctx
// ...
})
No recovery until next 24-hour tick
After all retries fail, the next PopulatePriceStore call only happens when the 24-hour ticker fires. There is no shorter retry interval on failure.
Proposed fix
- Validate the stricter API filter — test whether the server-side filter from the comment is actually slower, or whether that assumption is outdated. If it performs acceptably, use it to reduce the result set.
To reduce impact if the API call fails for other reasons:
- Create a fresh context per retry attempt so retries are meaningful after a timeout.
- Add a shorter retry interval when population fails, instead of waiting 24 hours.
Affected files
pkg/azure/aks/price_store.go — filter, retry logic, context handling
pkg/azure/aks/aks.go — ticker goroutine
Summary
The AKS VM price store never populates successfully. The Azure Retail Prices API query times out after the 15-minute context deadline, leaving the price store empty. All VM pricing metrics are missing.
Symptoms
PopulatePriceStorestarts and runs for exactly 15 minutes before hitting the context deadline"region not found in price map"errors on every scrapeRoot cause
The API filter in
pkg/azure/aks/price_store.gois too broad:This returns every VM-related SKU across all regions (including Cloud Services, Dedicated Hosts, Reserved instances, etc.), producing a result set too large to paginate within 15 minutes. The code then filters most of these out client-side in
validateMachinePriceIsRelevantFromSku.For comparison, the disk pricing store uses a more specific filter and fetches 9,010 prices in ~2 seconds.
A stricter server-side filter exists as a comment in the code (
price_store.go:159-164):This filter matches the client-side logic exactly, so it should return the same data. However, the comment (introduced by @logyball in #221 as part of the original implementation) claims this server-side filter is slower in practice. There is no evidence in the git history that this was tested — the comment shipped with the initial code. This claim needs to be validated, as the current broad filter cannot finish paginating in 15 minutes.
Additional issues
These are not the root cause but would reduce impact if the API call fails for other reasons:
Retries share a single expired context
All 5
try.Doretry attempts reuse the samepricingCtxcreated at the top ofPopulatePriceStore. Once the 15-minute deadline expires, all retries fail instantly (~36ms total for all 5 attempts).No recovery until next 24-hour tick
After all retries fail, the next
PopulatePriceStorecall only happens when the 24-hour ticker fires. There is no shorter retry interval on failure.Proposed fix
To reduce impact if the API call fails for other reasons:
Affected files
pkg/azure/aks/price_store.go— filter, retry logic, context handlingpkg/azure/aks/aks.go— ticker goroutine