Skip to content

Commit 0890244

Browse files
[streams][features] schema changes (elastic#251196)
## Summary - Rename name into subtype as the former is not aligned with the value we currently populate (eg logging_agent, operating_system, ...) - Delete name - Remove the deterministic _id generation. Uses an uuid instead - Create an LLM-generated id field that we'll use to deduplicate features. We currently deduplicate in storage (overwrite if it already exists). We can remove that step if/when we need history - Rename value into properties <img width="588" height="803" alt="features-schema-change" src="https://github.com/user-attachments/assets/bac641e8-c851-4323-bdf3-cfb0c2bbda7e" /> --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
1 parent d58b022 commit 0890244

16 files changed

Lines changed: 214 additions & 173 deletions

File tree

x-pack/platform/packages/shared/kbn-streams-ai/src/features/prompt.ts

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,25 @@ const featuresSchema = {
2020
items: {
2121
type: 'object',
2222
properties: {
23+
id: {
24+
type: 'string',
25+
description: 'Unique identifier for the feature.',
26+
},
2327
type: {
2428
type: 'string',
2529
},
26-
description: {
30+
subtype: {
2731
type: 'string',
28-
description: 'A summary of the feature.',
2932
},
30-
name: {
33+
description: {
3134
type: 'string',
35+
description: 'A summary of the feature.',
3236
},
3337
title: {
3438
type: 'string',
3539
description: 'Very short human-readable title for UI (e.g. table, flyout header).',
3640
},
37-
value: {
41+
properties: {
3842
type: 'object',
3943
properties: {},
4044
},
@@ -65,11 +69,12 @@ const featuresSchema = {
6569
},
6670
},
6771
required: [
72+
'id',
6873
'type',
74+
'subtype',
6975
'description',
70-
'name',
7176
'title',
72-
'value',
77+
'properties',
7378
'confidence',
7479
'evidence',
7580
'tags',

x-pack/platform/packages/shared/kbn-streams-ai/src/features/system_prompt.text

Lines changed: 57 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,68 +2,75 @@ You are extracting **features** from log data. Features are stable facts about r
22

33
Every feature you output MUST include ALL required fields:
44
- `type` (string): one of `infrastructure`, `technology`, `dependency`
5-
- `name` (string): generic snake_case name
5+
- `subtype` (string): categorization within the type (e.g. `operating_system`, `programming_language`, `service_dependency`)
6+
- `id` (string): unique concise identifier for deduplication across runs (e.g. "aws-deployment", "log4j-2.14.1", "api-user-http")
67
- `title` (string): very short human-readable title for UI display (e.g. "Ubuntu 20.04", "Log4j 2.14.1", "api-service → user-service"). Keep it to a few words; it can summarize type + key value (e.g. technology + version or source → target for dependencies).
78
- `description` (string): a short summary of what the feature represents
8-
- `value` (object): stable, low-cardinality properties for deduplication
9+
- `properties` (object): stable, low-cardinality key facts for deduplication
910
- `confidence` (number): 0–100
1011
- `evidence` (array of strings): supporting evidence from logs (2–5 items)
1112
- `tags` (array of strings): descriptive tags
12-
- `meta` (object): high-cardinality or variable data (can be `{}`)
13+
- `meta` (object): supplementary information that doesn't fit in `properties` - use for high-cardinality data, contextual notes, or interesting observations (can be `{}`)
1314

1415
## Feature Types
1516

1617
Extract features in three categories:
1718
- **infrastructure**: cloud provider/deployment, container orchestration, operating systems, networking, hardware
18-
- Example names: `cloud_deployment`, `container_orchestration`, `operating_system`
19+
- Example subtypes: `cloud_deployment`, `container_orchestration`, `operating_system`
1920
- **technology**: programming languages, web servers, databases, libraries, frameworks
20-
- Example names: `programming_language`, `web_server`, `database`, `logging_library`
21+
- Example subtypes: `programming_language`, `web_server`, `database`, `logging_library`
2122
- **dependency**: explicit relationships between systems (service-to-service calls, DB connections, API integrations)
22-
- Example names: `service_dependency`, `database_connection`, `api_integration`
23+
- Example subtypes: `service_dependency`, `database_connection`, `api_integration`
2324

2425
## Consolidation Rules
2526

2627
**Consolidate** when properties belong to the same entity and appear together in logs:
27-
- Good: A single cloud deployment feature with provider in `value` and regions/zones in `meta`
28-
- Good: A single container orchestration feature for Kubernetes with stable platform/version in `value`
28+
- Good: A single cloud deployment feature with provider in `properties` and regions/zones in `meta`
29+
- Good: A single container orchestration feature for Kubernetes with stable platform/version in `properties`
2930

3031
**Separate** distinct technologies even if related:
3132
- Good: Separate features for `web_server` (nginx), `database` (postgresql), `cache` (redis)
3233
- Bad: Do not combine multiple distinct technologies into one feature
3334

34-
Also: **do not emit multiple features with the same (`type`, `name`, `value`) tuple**. Merge evidence/tags/meta instead.
35+
Also: **do not emit multiple features with the same (`type`, `subtype`, `properties`) tuple**. Merge evidence/tags/meta instead.
3536

3637
## Naming Conventions
3738

38-
Use **generic names** with specific values in the `value` object:
39-
- Good: `{ "name": "programming_language", "value": { "language": "java", "version": "11" } }`
40-
- Bad: `{ "name": "java_runtime", "value": { "version": "11" } }`
39+
Use **generic subtypes** with specific values in the `properties` object:
40+
- Good: `{ "subtype": "programming_language", "properties": { "language": "java", "version": "11" } }`
41+
- Bad: `{ "subtype": "java_runtime", "properties": { "version": "11" } }`
4142

42-
Rules:
43-
- Use `snake_case` for `name`
44-
- Keep names descriptive but concise
45-
- Put specificity in `value`, not in `name`
43+
Subtype rules:
44+
- Use `snake_case` for `subtype`
45+
- Keep subtypes descriptive but concise
46+
- Put specificity in `properties`, not in `subtype`
4647

47-
## Value vs Meta Fields
48+
ID rules:
49+
- Generate a short, stable identifier based on key properties
50+
- Use hyphens for readability (e.g. "aws-deployment", "log4j-2.14.1", "api-user-http")
51+
- Include distinguishing characteristics (provider, name, version, or source/target for dependencies)
52+
- Keep it concise (typically 2-5 tokens)
4853

49-
The `value` field MUST contain **stable, low-cardinality properties** that enable deduplication across many log lines and deployments.
54+
## Properties vs Meta Fields
5055

51-
- Use `value` for: cloud provider (`aws`, `gcp`, `azure`), technology/library name, protocol, major/normalized version, stable service names
52-
- Use `meta` for: regions/availability zones, hostnames, instance IDs, pod/container names, IPs, URLs/paths, request/trace IDs, endpoint lists, build hashes/edition labels
56+
The `properties` field MUST contain **stable, low-cardinality key facts** that enable deduplication across many log lines and deployments.
57+
58+
- Use `properties` for: cloud provider (`aws`, `gcp`, `azure`), technology/library name, protocol, major/normalized version, stable service names
59+
- Use `meta` for: supplementary details like regions/availability zones, hostnames, instance IDs, pod/container names, IPs, URLs/paths, request/trace IDs, endpoint lists, build hashes/edition labels, contextual notes, security observations, or any interesting information that doesn't fit as a stable property
5360

5461
Example (cloud deployment):
55-
- Good: `value: { "provider": "aws" }` (stable)
56-
- Good: `meta: { "regions": ["eu-west-1"], "availability_zones": ["eu-west-1a"] }` (variable/high-cardinality)
62+
- Good: `properties: { "provider": "aws" }` (stable)
63+
- Good: `meta: { "regions": ["eu-west-1"], "availability_zones": ["eu-west-1a"], "note": "Multi-region deployment pattern observed" }` (variable/high-cardinality + contextual info)
5764

5865
**Conflict & cardinality rules**:
59-
- If multiple values are observed for the same property, prefer the most frequently supported value in `value` and record alternates in `meta.observed_*` with evidence.
66+
- If multiple values are observed for the same property, prefer the most frequently supported value in `properties` and record alternates in `meta.observed_*` with evidence.
6067
- Avoid emitting separate features that differ only by high-cardinality metadata. Merge into one feature and store varying details in `meta`.
6168

6269
## Inference & Confidence
6370

6471
**One-level inference is allowed** when strong patterns exist, but use lower confidence and clearly label it:
6572
- Tag inferred features with `"inferred"` in `tags`
66-
- Explain the inference briefly in `meta.notes`
73+
- Explain the inference briefly in `meta.note`
6774

6875
Confidence bands:
6976
- **90–100**: explicit, unambiguous evidence
@@ -81,21 +88,22 @@ Evidence requirements:
8188
- Evidence must directly support the feature claim
8289

8390
Version formatting rules (important for CVE/vulnerability analysis):
84-
- Prefer normalized numeric versions in `value.version` (e.g., `"11"`, `"11.0"`, `"11.0.2"`)
91+
- Prefer normalized numeric versions in `properties.version` (e.g., `"11"`, `"11.0"`, `"11.0.2"`)
8592
- Strip leading `v` and surrounding text; keep only the numeric portion when possible
86-
- If the original version contains labels (LTS/Enterprise/codename/build metadata), store the normalized numeric part in `value.version` and store the original in `meta.raw_version`
87-
- Remove codenames/release names/edition labels from `value.version`
93+
- If the original version contains labels (LTS/Enterprise/codename/build metadata), store the normalized numeric part in `properties.version` and store the original in `meta.raw_version`
94+
- Remove codenames/release names/edition labels from `properties.version`
8895

8996
## Examples
9097

9198
**Example 1 - Infrastructure with clean version**
9299
```
93100
{
94101
"type": "infrastructure",
95-
"name": "operating_system",
102+
"subtype": "operating_system",
103+
"id": "ubuntu-20.04.6",
96104
"title": "Ubuntu 20.04.6",
97105
"description": "Ubuntu Linux operating system version 20.04.6",
98-
"value": {
106+
"properties": {
99107
"os": "ubuntu",
100108
"version": "20.04.6"
101109
},
@@ -111,14 +119,15 @@ Version formatting rules (important for CVE/vulnerability analysis):
111119
}
112120
```
113121

114-
**Example 2 - Infrastructure showing value vs meta**
122+
**Example 2 - Infrastructure showing properties vs meta**
115123
```
116124
{
117125
"type": "infrastructure",
118-
"name": "cloud_deployment",
126+
"subtype": "cloud_deployment",
127+
"id": "aws-deployment",
119128
"title": "AWS",
120129
"description": "AWS cloud deployment observed across one or more regions/availability zones",
121-
"value": {
130+
"properties": {
122131
"provider": "aws"
123132
},
124133
"confidence": 92,
@@ -130,7 +139,8 @@ Version formatting rules (important for CVE/vulnerability analysis):
130139
"tags": ["infrastructure", "cloud"],
131140
"meta": {
132141
"regions": ["eu-west-1"],
133-
"availability_zones": ["eu-west-1a", "eu-west-1b"]
142+
"availability_zones": ["eu-west-1a", "eu-west-1b"],
143+
"note": "Multi-AZ deployment pattern observed"
134144
}
135145
}
136146
```
@@ -139,10 +149,11 @@ Version formatting rules (important for CVE/vulnerability analysis):
139149
```
140150
{
141151
"type": "technology",
142-
"name": "logging_library",
152+
"subtype": "logging_library",
153+
"id": "log4j-2.14.1",
143154
"title": "Log4j 2.14.1",
144155
"description": "Apache Log4j logging library version 2.14.1",
145-
"value": {
156+
"properties": {
146157
"library": "log4j",
147158
"version": "2.14.1"
148159
},
@@ -153,7 +164,7 @@ Version formatting rules (important for CVE/vulnerability analysis):
153164
],
154165
"tags": ["technology", "library", "logging"],
155166
"meta": {
156-
"security_note": "Library versions can be used for CVE/vulnerability queries."
167+
"note": "Version 2.14.1 may have known CVEs; suitable for vulnerability queries"
157168
}
158169
}
159170
```
@@ -162,10 +173,11 @@ Version formatting rules (important for CVE/vulnerability analysis):
162173
```
163174
{
164175
"type": "technology",
165-
"name": "programming_language",
176+
"subtype": "programming_language",
177+
"id": "java",
166178
"title": "Java",
167179
"description": "Java programming language (inferred from exception patterns)",
168-
"value": {
180+
"properties": {
169181
"language": "java"
170182
},
171183
"confidence": 45,
@@ -175,19 +187,20 @@ Version formatting rules (important for CVE/vulnerability analysis):
175187
],
176188
"tags": ["technology", "inferred"],
177189
"meta": {
178-
"notes": "Inferred from Java exception class names and .java stack trace references."
190+
"note": "Inferred from Java exception class names and .java stack trace references"
179191
}
180192
}
181193
```
182194

183-
**Example 5 - Dependency feature showing value vs meta + aggregation/capping**
195+
**Example 5 - Dependency feature showing properties vs meta + aggregation/capping**
184196
```
185197
{
186198
"type": "dependency",
187-
"name": "service_dependency",
199+
"subtype": "service_dependency",
200+
"id": "api-user-http",
188201
"title": "api-service → user-service",
189202
"description": "Service-to-service HTTP dependency from api-service to user-service",
190-
"value": {
203+
"properties": {
191204
"source": "api-service",
192205
"target": "user-service",
193206
"protocol": "http"
@@ -201,7 +214,7 @@ Version formatting rules (important for CVE/vulnerability analysis):
201214
"meta": {
202215
"endpoints": ["/users", "/users/:id", "/users/:id/profile", "/users/search"],
203216
"methods": ["GET", "POST", "PUT"],
204-
"notes": "Aggregate endpoints under one dependency; cap the list (e.g., max 10) and summarize additional entries."
217+
"note": "Aggregate endpoints under one dependency; cap the list (e.g., max 10) and summarize additional entries"
205218
}
206219
}
207220
```
@@ -210,7 +223,7 @@ Version formatting rules (important for CVE/vulnerability analysis):
210223

211224
- Extract all features that meet the confidence threshold and have supporting evidence.
212225
- Prefer fewer, higher-confidence features over many speculative ones.
213-
- Sort features by descending `confidence` (and within ties, stable alphabetical order by `type`, then `name`).
226+
- Sort features by descending `confidence` (and within ties, stable alphabetical order by `type`, then `subtype`).
214227
- Dependency anti-spam: only emit a dependency feature when logs contain explicit evidence of a relationship; aggregate endpoints in `meta.endpoints` and cap the list.
215228

216229
Extract all features that meet the confidence threshold and have supporting evidence. Use the finalize_features tool to return the results.

x-pack/platform/packages/shared/kbn-streams-schema/src/feature.ts

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,30 +13,32 @@ export type FeatureStatus = (typeof featureStatus)[number];
1313
export const featureStatusSchema = z.enum(featureStatus);
1414

1515
export interface BaseFeature {
16+
id: string;
1617
type: string;
17-
name: string;
18+
subtype?: string;
1819
title?: string;
1920
description: string;
20-
value: Record<string, any>;
21+
properties: Record<string, string>;
2122
confidence: number;
2223
evidence: string[];
2324
tags: string[];
2425
meta: Record<string, any>;
2526
}
2627

2728
export interface Feature extends BaseFeature {
28-
id: string;
29+
uuid: string;
2930
status: FeatureStatus;
3031
last_seen: string;
3132
expires_at?: string;
3233
}
3334

3435
export const baseFeatureSchema: z.Schema<BaseFeature> = z.object({
36+
id: z.string(),
3537
type: z.string(),
36-
name: z.string(),
38+
subtype: z.string().optional(),
3739
title: z.string().optional(),
3840
description: z.string(),
39-
value: z.record(z.string(), z.any()),
41+
properties: z.record(z.string(), z.string()),
4042
confidence: z.number().min(0).max(100),
4143
evidence: z.array(z.string()),
4244
tags: z.array(z.string()),
@@ -45,7 +47,7 @@ export const baseFeatureSchema: z.Schema<BaseFeature> = z.object({
4547

4648
export const featureSchema: z.Schema<Feature> = baseFeatureSchema.and(
4749
z.object({
48-
id: z.string(),
50+
uuid: z.string(),
4951
status: featureStatusSchema,
5052
last_seen: z.string(),
5153
expires_at: z.string().optional(),

0 commit comments

Comments
 (0)