You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
- Rename name into subtype as the former is not aligned with the value
we currently populate (eg logging_agent, operating_system, ...)
- Delete name
- Remove the deterministic _id generation. Uses an uuid instead
- Create an LLM-generated id field that we'll use to deduplicate
features. We currently deduplicate in storage (overwrite if it already
exists). We can remove that step if/when we need history
- Rename value into properties
<img width="588" height="803" alt="features-schema-change"
src="https://github.com/user-attachments/assets/bac641e8-c851-4323-bdf3-cfb0c2bbda7e"
/>
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Copy file name to clipboardExpand all lines: x-pack/platform/packages/shared/kbn-streams-ai/src/features/system_prompt.text
+57-44Lines changed: 57 additions & 44 deletions
Original file line number
Diff line number
Diff line change
@@ -2,68 +2,75 @@ You are extracting **features** from log data. Features are stable facts about r
2
2
3
3
Every feature you output MUST include ALL required fields:
4
4
- `type` (string): one of `infrastructure`, `technology`, `dependency`
5
-
- `name` (string): generic snake_case name
5
+
- `subtype` (string): categorization within the type (e.g. `operating_system`, `programming_language`, `service_dependency`)
6
+
- `id` (string): unique concise identifier for deduplication across runs (e.g. "aws-deployment", "log4j-2.14.1", "api-user-http")
6
7
- `title` (string): very short human-readable title for UI display (e.g. "Ubuntu 20.04", "Log4j 2.14.1", "api-service → user-service"). Keep it to a few words; it can summarize type + key value (e.g. technology + version or source → target for dependencies).
7
8
- `description` (string): a short summary of what the feature represents
8
-
- `value` (object): stable, low-cardinality properties for deduplication
9
+
- `properties` (object): stable, low-cardinality key facts for deduplication
9
10
- `confidence` (number): 0–100
10
11
- `evidence` (array of strings): supporting evidence from logs (2–5 items)
11
12
- `tags` (array of strings): descriptive tags
12
-
- `meta` (object): high-cardinality or variable data (can be `{}`)
13
+
- `meta` (object): supplementary information that doesn't fit in `properties` - use for high-cardinality data, contextual notes, or interesting observations (can be `{}`)
The `properties` field MUST contain **stable, low-cardinality key facts** that enable deduplication across many log lines and deployments.
57
+
58
+
- Use `properties` for: cloud provider (`aws`, `gcp`, `azure`), technology/library name, protocol, major/normalized version, stable service names
59
+
- Use `meta` for: supplementary details like regions/availability zones, hostnames, instance IDs, pod/container names, IPs, URLs/paths, request/trace IDs, endpoint lists, build hashes/edition labels, contextual notes, security observations, or any interesting information that doesn't fit as a stable property
- If multiple values are observed for the same property, prefer the most frequently supported value in `value` and record alternates in `meta.observed_*` with evidence.
66
+
- If multiple values are observed for the same property, prefer the most frequently supported value in `properties` and record alternates in `meta.observed_*` with evidence.
60
67
- Avoid emitting separate features that differ only by high-cardinality metadata. Merge into one feature and store varying details in `meta`.
61
68
62
69
## Inference & Confidence
63
70
64
71
**One-level inference is allowed** when strong patterns exist, but use lower confidence and clearly label it:
65
72
- Tag inferred features with `"inferred"` in `tags`
66
-
- Explain the inference briefly in `meta.notes`
73
+
- Explain the inference briefly in `meta.note`
67
74
68
75
Confidence bands:
69
76
- **90–100**: explicit, unambiguous evidence
@@ -81,21 +88,22 @@ Evidence requirements:
81
88
- Evidence must directly support the feature claim
82
89
83
90
Version formatting rules (important for CVE/vulnerability analysis):
- Strip leading `v` and surrounding text; keep only the numeric portion when possible
86
-
- If the original version contains labels (LTS/Enterprise/codename/build metadata), store the normalized numeric part in `value.version` and store the original in `meta.raw_version`
87
-
- Remove codenames/release names/edition labels from `value.version`
93
+
- If the original version contains labels (LTS/Enterprise/codename/build metadata), store the normalized numeric part in `properties.version` and store the original in `meta.raw_version`
94
+
- Remove codenames/release names/edition labels from `properties.version`
88
95
89
96
## Examples
90
97
91
98
**Example 1 - Infrastructure with clean version**
92
99
```
93
100
{
94
101
"type": "infrastructure",
95
-
"name": "operating_system",
102
+
"subtype": "operating_system",
103
+
"id": "ubuntu-20.04.6",
96
104
"title": "Ubuntu 20.04.6",
97
105
"description": "Ubuntu Linux operating system version 20.04.6",
98
-
"value": {
106
+
"properties": {
99
107
"os": "ubuntu",
100
108
"version": "20.04.6"
101
109
},
@@ -111,14 +119,15 @@ Version formatting rules (important for CVE/vulnerability analysis):
111
119
}
112
120
```
113
121
114
-
**Example 2 - Infrastructure showing value vs meta**
122
+
**Example 2 - Infrastructure showing properties vs meta**
115
123
```
116
124
{
117
125
"type": "infrastructure",
118
-
"name": "cloud_deployment",
126
+
"subtype": "cloud_deployment",
127
+
"id": "aws-deployment",
119
128
"title": "AWS",
120
129
"description": "AWS cloud deployment observed across one or more regions/availability zones",
121
-
"value": {
130
+
"properties": {
122
131
"provider": "aws"
123
132
},
124
133
"confidence": 92,
@@ -130,7 +139,8 @@ Version formatting rules (important for CVE/vulnerability analysis):
"notes": "Aggregate endpoints under one dependency; cap the list (e.g., max 10) and summarize additional entries."
217
+
"note": "Aggregate endpoints under one dependency; cap the list (e.g., max 10) and summarize additional entries"
205
218
}
206
219
}
207
220
```
@@ -210,7 +223,7 @@ Version formatting rules (important for CVE/vulnerability analysis):
210
223
211
224
- Extract all features that meet the confidence threshold and have supporting evidence.
212
225
- Prefer fewer, higher-confidence features over many speculative ones.
213
-
- Sort features by descending `confidence` (and within ties, stable alphabetical order by `type`, then `name`).
226
+
- Sort features by descending `confidence` (and within ties, stable alphabetical order by `type`, then `subtype`).
214
227
- Dependency anti-spam: only emit a dependency feature when logs contain explicit evidence of a relationship; aggregate endpoints in `meta.endpoints` and cap the list.
215
228
216
229
Extract all features that meet the confidence threshold and have supporting evidence. Use the finalize_features tool to return the results.
0 commit comments