| Author | BoKeum Kim (bkkim@lablup.com) |
|---|---|
| Status | Draft |
| Created | 2025-02-20 |
| Created-Version | 26.3.0 |
| Target-Version | 26.3.0 |
| Implemented-Version |
- JIRA: BA-4052
- JIRA: BA-4040 (epic: Prometheus Client Extraction and Querier Interface Abstraction)
In the current design introduced by BEP-1045, PromQL templates are hardcoded in the service layer. Adding a new metric query requires a code change, a review, and a release cycle — even when the underlying Prometheus metric already exists.
This proposal introduces a Prometheus Query Preset System that stores PromQL templates in the database and exposes them via API. Administrators can register, update, and remove query presets at runtime; authenticated users can execute presets by injecting parameters — without any code deployment.
- Decouple PromQL template definitions from application code
- Allow administrators to manage query presets via CRUD API at runtime
- Provide a safe execution API that validates and renders templates using existing
MetricPresetinfrastructure - Maintain injection safety via
_escape_label_value()and label whitelisting
BEP-1045 extracted reusable Prometheus components (MetricPreset, PrometheusClient, MetricQuerier), but the PromQL template selection logic remains hardcoded in the service layer. Templates are chosen via a match statement at compile time, meaning any new query pattern requires a code change and redeployment.
- Template rigidity: New query patterns require code changes and redeployment
- No runtime management: Administrators cannot add or modify queries without developer intervention
- No label governance: Any label can be used without validation; there is no concept of which labels are filterable or groupable per metric
Table: prometheus_query_presets
| Column | Type | Constraints | Description |
|---|---|---|---|
id |
UUID |
PK, default uuid_generate_v4() |
Primary key |
name |
VARCHAR(256) |
NOT NULL, UNIQUE | Human-readable preset identifier (used in execute URL) |
metric_name |
VARCHAR(256) |
NOT NULL | Prometheus metric name (e.g., backendai_container_utilization) |
query_template |
TEXT |
NOT NULL | PromQL template with {labels}, {window}, {group_by} placeholders |
time_window |
VARCHAR(32) |
NULLABLE | Preset-specific default window; falls back to server config metric.timewindow if NULL |
options |
JSONB |
NOT NULL, default '{"filter_labels":[],"group_labels":[]}' |
Preset options stored as PydanticColumn(PresetOptions) |
created_at |
TIMESTAMPTZ |
NOT NULL, default now() |
Creation timestamp |
updated_at |
TIMESTAMPTZ |
NOT NULL, default now() |
Last update timestamp |
class PresetOptions(BaseModel):
filter_labels: list[str]
group_labels: list[str]
model_config = ConfigDict(frozen=True)The PresetOptions wrapper model allows adding future preset-level settings (e.g., caching, rate limiting) without schema migration.
| Method | Path | Description |
|---|---|---|
POST |
/resource/prometheus-query-presets |
Create a new preset |
GET |
/resource/prometheus-query-presets |
List all presets |
GET |
/resource/prometheus-query-presets/{id} |
Get preset by ID |
PATCH |
/resource/prometheus-query-presets/{id} |
Modify a preset |
DELETE |
/resource/prometheus-query-presets/{id} |
Delete a preset |
Create Request:
{
"name": "container_cpu_rate",
"metric_name": "backendai_container_utilization",
"query_template": "sum by ({group_by})(rate({metric_name}{{{labels}}}[{window}]))",
"time_window": "5m",
"options": {
"filter_labels": ["container_metric_name", "kernel_id", "session_id", "value_type"],
"group_labels": ["kernel_id", "session_id", "value_type"]
}
}Create / Get Response:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "container_cpu_rate",
"metric_name": "backendai_container_utilization",
"query_template": "sum by ({group_by})(rate({metric_name}{{{labels}}}[{window}]))",
"time_window": "5m",
"options": {
"filter_labels": ["container_metric_name", "kernel_id", "session_id", "value_type"],
"group_labels": ["kernel_id", "session_id", "value_type"]
},
"created_at": "2025-02-20T10:00:00Z",
"updated_at": "2025-02-20T10:00:00Z"
}| Method | Path | Description |
|---|---|---|
POST |
/resource/prometheus-query-presets/{id}/execute |
Execute a preset by id |
Execute Request:
{
"labels": [
{"key": "container_metric_name", "value": "cpu_util"},
{"key": "kernel_id", "value": "abc-123"}
],
"group_labels": ["kernel_id", "value_type"],
"window": "5m",
"time_range": {
"start": "2025-02-20T09:00:00Z",
"end": "2025-02-20T10:00:00Z",
"step": "60s"
}
}Execute Response:
{
"status": "success",
"data": {
"result_type": "matrix",
"result": [
{
"metric": [
{"key": "kernel_id", "value": "abc-123"},
{"key": "value_type", "value": "current"}
],
"values": [[1708412400, "0.85"], [1708412460, "0.72"]]
}
]
}
}The metric field uses a key-value entries pattern instead of fixed fields, allowing each preset to return a different set of labels.
CRUD types:
class PresetOptions(BaseModel):
filter_labels: list[str]
group_labels: list[str]
class PrometheusQueryPresetCreate(BaseModel):
name: str
metric_name: str
query_template: str
time_window: str | None
options: PresetOptions
class PrometheusQueryPresetModify(BaseModel):
name: str | None
metric_name: str | None
query_template: str | None
time_window: str | None
options: PresetOptions | NoneExecute types:
class MetricLabelEntry(BaseModel):
key: str
value: str
class ExecutePresetRequest(BaseModel):
labels: list[MetricLabelEntry]
group_labels: list[str]
window: str | None
time_range: QueryTimeRange
class PresetMetricResult(BaseModel):
metric: list[MetricLabelEntry]
values: list[tuple[Decimal, str]]
class PresetExecuteData(BaseModel):
result_type: str
result: list[PresetMetricResult]
class PresetExecuteResponse(BaseModel):
status: str
data: PresetExecuteDataThe preset system also exposes a Strawberry GraphQL interface following existing Backend.AI conventions (Node/Connection pattern, admin_ prefix for superadmin operations).
Types:
type PrometheusPresetOptionsGQL {
filterLabels: [String!]!
groupLabels: [String!]!
}
type PrometheusQueryPreset implements Node {
id: ID!
name: String!
metricName: String!
queryTemplate: String!
timeWindow: String
options: PrometheusPresetOptionsGQL!
createdAt: DateTime!
updatedAt: DateTime!
}Admin Queries (SUPERADMIN only):
type Query {
adminPrometheusQueryPreset(id: ID!): PrometheusQueryPreset
adminPrometheusQueryPresets(
filter: PrometheusQueryPresetFilter
orderBy: [PrometheusQueryPresetOrderBy!]
first: Int, after: String, last: Int, before: String
limit: Int, offset: Int
): PrometheusQueryPresetConnection!
}Admin Mutations (SUPERADMIN only):
type Mutation {
adminCreatePrometheusQueryPreset(input: CreatePrometheusQueryPresetInput!): CreatePrometheusQueryPresetPayload!
adminModifyPrometheusQueryPreset(id: ID!, input: ModifyPrometheusQueryPresetInput!): ModifyPrometheusQueryPresetPayload!
adminDeletePrometheusQueryPreset(id: ID!): DeletePrometheusQueryPresetPayload!
}Execute Query (all authenticated users):
type MetricLabelEntryGQL {
key: String!
value: String!
}
type MetricResultValueGQL {
timestamp: Float!
value: String!
}
type MetricResultGQL {
metric: [MetricLabelEntryGQL!]!
values: [MetricResultValueGQL!]!
}
type PrometheusQueryResultGQL {
status: String!
resultType: String!
result: [MetricResultGQL!]!
}
input QueryTimeRangeInput {
start: DateTime!
end: DateTime!
step: String!
}
input MetricLabelEntryInput {
key: String!
value: String!
}
type Query {
prometheusQueryPresetResult(
name: String!
labels: [MetricLabelEntryInput!]
groupLabels: [String!]
window: String
timeRange: QueryTimeRangeInput!
): PrometheusQueryResultGQL!
}MetricLabelEntryGQL uses the key-value entries pattern (consistent with ImageV2LabelEntryGQL, ResourceOptsEntryGQL, etc.) so that each preset can return a different set of labels without requiring schema changes. QueryTimeRangeInput corresponds to the existing QueryTimeRange Pydantic model. The execute query is available to all authenticated users and is modeled as a Query because the operation is essentially a read against Prometheus.
The CLI follows existing Backend.AI Click-based conventions.
Admin CRUD (SUPERADMIN):
backend.ai admin prometheus-query-preset list
backend.ai admin prometheus-query-preset info <ID>
backend.ai admin prometheus-query-preset add \
--name <NAME> \
--metric-name <METRIC> \
--query-template <TEMPLATE> \
[--time-window <WINDOW>] \
[--options <JSON>]
# --options example: '{"filter_labels": ["kernel_id", "container_metric_name"], "group_labels": ["kernel_id"]}'
backend.ai admin prometheus-query-preset modify <ID> [--name ...] [--query-template ...] [--options ...]
backend.ai admin prometheus-query-preset delete <ID>
Execute (all authenticated users):
backend.ai prometheus-query-preset execute <ID> \
--start <ISO8601> \
--end <ISO8601> \
--step <STEP> \
[--label container_metric_name=cpu_util] \
[--label kernel_id=abc-123] \
[--group-labels label1,label2] \
[--window <WINDOW>]
--label is a repeatable flag using key=value format. The execute command is a top-level (non-admin) command because it is available to all authenticated users.
sequenceDiagram
participant Client
participant Handler as REST / GQL Handler
participant Service as PresetService
participant DB as Database
participant Preset as MetricPreset
participant Prom as PrometheusClient
participant PromServer as Prometheus
Client->>+Handler: Execute preset (name, labels, group_labels, window, time_range)
Handler->>+Service: ExecutePresetAction
Service->>+DB: Lookup preset by ID
DB-->>-Service: Preset + allowed labels
Service->>Service: Validate labels & group_labels & window
Service->>+Preset: Build MetricPreset (template, labels, group_by, window)
Preset-->>-Service: Rendered PromQL
Service->>+Prom: query_range(rendered_query, time_range)
Prom->>+PromServer: HTTP GET /api/v1/query_range
PromServer-->>-Prom: Time-series result
Prom-->>-Service: PrometheusQueryRangeResponse
Service-->>-Handler: ExecutePresetActionResult
Handler-->>-Client: Query result
Validation rules:
- Each label key in the request must exist in the preset's
options.filter_labelslist - Each entry in
group_labelsmust exist in the preset'soptions.group_labelslist windowmust match^\d+[smhdw]$(single-unit durations only; compound durations like1h30mor500msare intentionally not supported); if absent, falls back to the preset'stime_windowor the server configmetric.timewindow
| Threat | Mitigation |
|---|---|
| Label value injection | Reuse _escape_label_value() from preset.py — escapes \, ", \n, \r |
| Arbitrary label keys | Only labels in options.filter_labels can be used in {labels}, only those in options.group_labels in {group_by} |
| Window format injection | Validate against ^\d+[smhdw]$ regex before substitution |
| Template modification | CRUD operations restricted to SUPERADMIN role |
| Metric name substitution | {metric_name} is resolved from the preset's metric_name field (DB-stored, admin-controlled), not from user input |
| Query resource exhaustion | Validate time_range duration and step size at the service layer — reject excessively large ranges or sub-second steps to prevent Prometheus overload |
| Component | From | Usage |
|---|---|---|
MetricPreset |
common/clients/prometheus/preset.py |
Template rendering with render(), _escape_label_value() |
PrometheusClient |
common/clients/prometheus/client.py |
query_range() execution |
QueryTimeRange |
common/dto/clients/prometheus/request.py |
Time range parameters for execute |
- One new table:
prometheus_query_presets(withoptionsJSONB column usingPydanticColumn) - Alembic migration:
CREATE TABLEonly — no existing tables are modified - No data migration needed
- No breaking changes: Existing
ContainerUtilizationMetricServicecontinues to work with its hardcoded templates - Additive only: New REST endpoints and DB tables; no modifications to existing APIs
- Future migration path: Once presets are populated,
ContainerUtilizationMetricService._build_preset()can be refactored to look up presets from the repository instead of using hardcoded templates. This is out of scope for this BEP.