Author	BoKeum Kim (bkkim@lablup.com)
Status	Draft
Created	2025-02-20
Created-Version	26.3.0
Target-Version	26.3.0
Implemented-Version

Prometheus Query Preset System

Related Issues

JIRA: BA-4052
JIRA: BA-4040 (epic: Prometheus Client Extraction and Querier Interface Abstraction)

Motivation

In the current design introduced by BEP-1045, PromQL templates are hardcoded in the service layer. Adding a new metric query requires a code change, a review, and a release cycle — even when the underlying Prometheus metric already exists.

This proposal introduces a Prometheus Query Preset System that stores PromQL templates in the database and exposes them via API. Administrators can register, update, and remove query presets at runtime; authenticated users can execute presets by injecting parameters — without any code deployment.

Goals

Decouple PromQL template definitions from application code
Allow administrators to manage query presets via CRUD API at runtime
Provide a safe execution API that validates and renders templates using existing MetricPreset infrastructure
Maintain injection safety via _escape_label_value() and label whitelisting

Current Design

BEP-1045 extracted reusable Prometheus components (MetricPreset, PrometheusClient, MetricQuerier), but the PromQL template selection logic remains hardcoded in the service layer. Templates are chosen via a match statement at compile time, meaning any new query pattern requires a code change and redeployment.

Problems

Template rigidity: New query patterns require code changes and redeployment
No runtime management: Administrators cannot add or modify queries without developer intervention
No label governance: Any label can be used without validation; there is no concept of which labels are filterable or groupable per metric

Proposed Design

DB Schema

Table: prometheus_query_presets

Column	Type	Constraints	Description
`id`	`UUID`	PK, default `uuid_generate_v4()`	Primary key
`name`	`VARCHAR(256)`	NOT NULL, UNIQUE	Human-readable preset identifier (used in execute URL)
`metric_name`	`VARCHAR(256)`	NOT NULL	Prometheus metric name (e.g., `backendai_container_utilization`)
`query_template`	`TEXT`	NOT NULL	PromQL template with `{labels}`, `{window}`, `{group_by}` placeholders
`time_window`	`VARCHAR(32)`	NULLABLE	Preset-specific default window; falls back to server config `metric.timewindow` if NULL
`options`	`JSONB`	NOT NULL, default `'{"filter_labels":[],"group_labels":[]}'`	Preset options stored as `PydanticColumn(PresetOptions)`
`created_at`	`TIMESTAMPTZ`	NOT NULL, default `now()`	Creation timestamp
`updated_at`	`TIMESTAMPTZ`	NOT NULL, default `now()`	Last update timestamp

class PresetOptions(BaseModel):
    filter_labels: list[str]
    group_labels: list[str]

    model_config = ConfigDict(frozen=True)

The PresetOptions wrapper model allows adding future preset-level settings (e.g., caching, rate limiting) without schema migration.

API Design

CRUD Endpoints (SUPERADMIN only)

Method	Path	Description
`POST`	`/resource/prometheus-query-presets`	Create a new preset
`GET`	`/resource/prometheus-query-presets`	List all presets
`GET`	`/resource/prometheus-query-presets/{id}`	Get preset by ID
`PATCH`	`/resource/prometheus-query-presets/{id}`	Modify a preset
`DELETE`	`/resource/prometheus-query-presets/{id}`	Delete a preset

Create Request:

{
  "name": "container_cpu_rate",
  "metric_name": "backendai_container_utilization",
  "query_template": "sum by ({group_by})(rate({metric_name}{{{labels}}}[{window}]))",
  "time_window": "5m",
  "options": {
    "filter_labels": ["container_metric_name", "kernel_id", "session_id", "value_type"],
    "group_labels": ["kernel_id", "session_id", "value_type"]
  }
}

Create / Get Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "container_cpu_rate",
  "metric_name": "backendai_container_utilization",
  "query_template": "sum by ({group_by})(rate({metric_name}{{{labels}}}[{window}]))",
  "time_window": "5m",
  "options": {
    "filter_labels": ["container_metric_name", "kernel_id", "session_id", "value_type"],
    "group_labels": ["kernel_id", "session_id", "value_type"]
  },
  "created_at": "2025-02-20T10:00:00Z",
  "updated_at": "2025-02-20T10:00:00Z"
}

Execute Endpoint (All authenticated users)

Method	Path	Description
`POST`	`/resource/prometheus-query-presets/{id}/execute`	Execute a preset by id

Execute Request:

{
  "labels": [
    {"key": "container_metric_name", "value": "cpu_util"},
    {"key": "kernel_id", "value": "abc-123"}
  ],
  "group_labels": ["kernel_id", "value_type"],
  "window": "5m",
  "time_range": {
    "start": "2025-02-20T09:00:00Z",
    "end": "2025-02-20T10:00:00Z",
    "step": "60s"
  }
}

Execute Response:

{
  "status": "success",
  "data": {
    "result_type": "matrix",
    "result": [
      {
        "metric": [
          {"key": "kernel_id", "value": "abc-123"},
          {"key": "value_type", "value": "current"}
        ],
        "values": [[1708412400, "0.85"], [1708412460, "0.72"]]
      }
    ]
  }
}

The metric field uses a key-value entries pattern instead of fixed fields, allowing each preset to return a different set of labels.

REST

CRUD types:

class PresetOptions(BaseModel):
    filter_labels: list[str]
    group_labels: list[str]

class PrometheusQueryPresetCreate(BaseModel):
    name: str
    metric_name: str
    query_template: str
    time_window: str | None
    options: PresetOptions

class PrometheusQueryPresetModify(BaseModel):
    name: str | None
    metric_name: str | None
    query_template: str | None
    time_window: str | None
    options: PresetOptions | None

Execute types:

class MetricLabelEntry(BaseModel):
    key: str
    value: str

class ExecutePresetRequest(BaseModel):
    labels: list[MetricLabelEntry]
    group_labels: list[str]
    window: str | None
    time_range: QueryTimeRange

class PresetMetricResult(BaseModel):
    metric: list[MetricLabelEntry]
    values: list[tuple[Decimal, str]]

class PresetExecuteData(BaseModel):
    result_type: str
    result: list[PresetMetricResult]

class PresetExecuteResponse(BaseModel):
    status: str
    data: PresetExecuteData

GraphQL API

The preset system also exposes a Strawberry GraphQL interface following existing Backend.AI conventions (Node/Connection pattern, admin_ prefix for superadmin operations).

Types:

type PrometheusPresetOptionsGQL {
  filterLabels: [String!]!
  groupLabels: [String!]!
}

type PrometheusQueryPreset implements Node {
  id: ID!
  name: String!
  metricName: String!
  queryTemplate: String!
  timeWindow: String
  options: PrometheusPresetOptionsGQL!
  createdAt: DateTime!
  updatedAt: DateTime!
}

Admin Queries (SUPERADMIN only):

type Query {
  adminPrometheusQueryPreset(id: ID!): PrometheusQueryPreset
  adminPrometheusQueryPresets(
    filter: PrometheusQueryPresetFilter
    orderBy: [PrometheusQueryPresetOrderBy!]
    first: Int, after: String, last: Int, before: String
    limit: Int, offset: Int
  ): PrometheusQueryPresetConnection!
}

Admin Mutations (SUPERADMIN only):

type Mutation {
  adminCreatePrometheusQueryPreset(input: CreatePrometheusQueryPresetInput!): CreatePrometheusQueryPresetPayload!
  adminModifyPrometheusQueryPreset(id: ID!, input: ModifyPrometheusQueryPresetInput!): ModifyPrometheusQueryPresetPayload!
  adminDeletePrometheusQueryPreset(id: ID!): DeletePrometheusQueryPresetPayload!
}

Execute Query (all authenticated users):

type MetricLabelEntryGQL {
  key: String!
  value: String!
}

type MetricResultValueGQL {
  timestamp: Float!
  value: String!
}

type MetricResultGQL {
  metric: [MetricLabelEntryGQL!]!
  values: [MetricResultValueGQL!]!
}

type PrometheusQueryResultGQL {
  status: String!
  resultType: String!
  result: [MetricResultGQL!]!
}

input QueryTimeRangeInput {
  start: DateTime!
  end: DateTime!
  step: String!
}

input MetricLabelEntryInput {
  key: String!
  value: String!
}

type Query {
  prometheusQueryPresetResult(
    name: String!
    labels: [MetricLabelEntryInput!]
    groupLabels: [String!]
    window: String
    timeRange: QueryTimeRangeInput!
  ): PrometheusQueryResultGQL!
}

MetricLabelEntryGQL uses the key-value entries pattern (consistent with ImageV2LabelEntryGQL, ResourceOptsEntryGQL, etc.) so that each preset can return a different set of labels without requiring schema changes. QueryTimeRangeInput corresponds to the existing QueryTimeRange Pydantic model. The execute query is available to all authenticated users and is modeled as a Query because the operation is essentially a read against Prometheus.

CLI

The CLI follows existing Backend.AI Click-based conventions.

Admin CRUD (SUPERADMIN):

backend.ai admin prometheus-query-preset list
backend.ai admin prometheus-query-preset info <ID>
backend.ai admin prometheus-query-preset add \
    --name <NAME> \
    --metric-name <METRIC> \
    --query-template <TEMPLATE> \
    [--time-window <WINDOW>] \
    [--options <JSON>]
    # --options example: '{"filter_labels": ["kernel_id", "container_metric_name"], "group_labels": ["kernel_id"]}'
backend.ai admin prometheus-query-preset modify <ID> [--name ...] [--query-template ...] [--options ...]
backend.ai admin prometheus-query-preset delete <ID>

Execute (all authenticated users):

backend.ai prometheus-query-preset execute <ID> \
    --start <ISO8601> \
    --end <ISO8601> \
    --step <STEP> \
    [--label container_metric_name=cpu_util] \
    [--label kernel_id=abc-123] \
    [--group-labels label1,label2] \
    [--window <WINDOW>]

--label is a repeatable flag using key=value format. The execute command is a top-level (non-admin) command because it is available to all authenticated users.

Execute Flow

sequenceDiagram
    participant Client
    participant Handler as REST / GQL Handler
    participant Service as PresetService
    participant DB as Database
    participant Preset as MetricPreset
    participant Prom as PrometheusClient
    participant PromServer as Prometheus

    Client->>+Handler: Execute preset (name, labels, group_labels, window, time_range)
    Handler->>+Service: ExecutePresetAction
    Service->>+DB: Lookup preset by ID
    DB-->>-Service: Preset + allowed labels
    Service->>Service: Validate labels & group_labels & window
    Service->>+Preset: Build MetricPreset (template, labels, group_by, window)
    Preset-->>-Service: Rendered PromQL
    Service->>+Prom: query_range(rendered_query, time_range)
    Prom->>+PromServer: HTTP GET /api/v1/query_range
    PromServer-->>-Prom: Time-series result
    Prom-->>-Service: PrometheusQueryRangeResponse
    Service-->>-Handler: ExecutePresetActionResult
    Handler-->>-Client: Query result

Validation rules:

Each label key in the request must exist in the preset's options.filter_labels list
Each entry in group_labels must exist in the preset's options.group_labels list
window must match ^\d+[smhdw]$ (single-unit durations only; compound durations like 1h30m or 500ms are intentionally not supported); if absent, falls back to the preset's time_window or the server config metric.timewindow

Security

Threat	Mitigation
Label value injection	Reuse `_escape_label_value()` from `preset.py` — escapes `\`, `"`, `\n`, `\r`
Arbitrary label keys	Only labels in `options.filter_labels` can be used in `{labels}`, only those in `options.group_labels` in `{group_by}`
Window format injection	Validate against `^\d+[smhdw]$` regex before substitution
Template modification	CRUD operations restricted to SUPERADMIN role
Metric name substitution	`{metric_name}` is resolved from the preset's `metric_name` field (DB-stored, admin-controlled), not from user input
Query resource exhaustion	Validate `time_range` duration and `step` size at the service layer — reject excessively large ranges or sub-second steps to prevent Prometheus overload

Reused Infrastructure

Component	From	Usage
`MetricPreset`	`common/clients/prometheus/preset.py`	Template rendering with `render()`, `_escape_label_value()`
`PrometheusClient`	`common/clients/prometheus/client.py`	`query_range()` execution
`QueryTimeRange`	`common/dto/clients/prometheus/request.py`	Time range parameters for execute

Migration / Compatibility

Database Migration

One new table: prometheus_query_presets (with options JSONB column using PydanticColumn)
Alembic migration: CREATE TABLE only — no existing tables are modified
No data migration needed

Backward Compatibility

No breaking changes: Existing ContainerUtilizationMetricService continues to work with its hardcoded templates
Additive only: New REST endpoints and DB tables; no modifications to existing APIs
Future migration path: Once presets are populated, ContainerUtilizationMetricService._build_preset() can be refactored to look up presets from the repository instead of using hardcoded templates. This is out of scope for this BEP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Query Preset System

Related Issues

Motivation

Goals

Current Design

Problems

Proposed Design

DB Schema

API Design

CRUD Endpoints (SUPERADMIN only)

Execute Endpoint (All authenticated users)

REST

GraphQL API

CLI

Execute Flow

Security

Reused Infrastructure

Migration / Compatibility

Database Migration

Backward Compatibility

References

FilesExpand file tree

BEP-1050-prometheus-query-preset-system.md

Latest commit

History

BEP-1050-prometheus-query-preset-system.md

File metadata and controls

Prometheus Query Preset System

Related Issues

Motivation

Goals

Current Design

Problems

Proposed Design

DB Schema

API Design

CRUD Endpoints (SUPERADMIN only)

Execute Endpoint (All authenticated users)

REST

GraphQL API

CLI

Execute Flow

Security

Reused Infrastructure

Migration / Compatibility

Database Migration

Backward Compatibility

References