Skip to content

Commit 95ad291

Browse files
mnonnenmachersschuberth
authored andcommitted
feat(elasticsearch): Make the provider compatible with JSON logging
Make the Elasticsearch log file provider compatible with the recently introduced JSON logging. This simplifies using the provider in any environment where the default dynamic mapping is enabled. If JSON logging is not enabled, logstash must be configured to map the log lines similar to that. Also add support for a custom prefix for the ingested logs, which is typically used to separate the application logs from other log entries based on the environment. For the same reason the property that contains the Kubernetes namespace is also made configurable. The log lines are rendered as `<timestamp> <level> <message>` with the optional throwable appended. This avoids cluttering them with redundant information or technical details that are not relevant for users. The schema can later be extended based on demand. Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@doubleopen.io>
1 parent 62292b8 commit 95ad291

7 files changed

Lines changed: 291 additions & 127 deletions

File tree

logaccess/elasticsearch/README.md

Lines changed: 100 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -6,34 +6,50 @@ This module provides an implementation of the [Log access abstraction](../README
66

77
The `ElasticsearchLogFileProvider` sends requests against the Elasticsearch Search API to retrieve the logs of a
88
specific ORT run and component. Results are fetched in chronological order and paginated using Elasticsearch's
9-
`search_after` mechanism with a stable sort on `time` and `sortId`.
9+
`search_after` mechanism with a stable sort on `timestamp` and `sequenceNumber`.
1010

1111
The provider assumes a canonical indexed schema that ORT Server can query independent of the concrete collector used
1212
to ship logs to Elasticsearch. Deployments are responsible for creating compatible Elasticsearch mappings and for
1313
normalizing metadata and extracted log fields into this shape before documents are indexed.
1414

15-
The provider expects log documents to contain these fields:
16-
17-
| Field | Expected Elasticsearch type | Purpose |
18-
|-------|-----------------------------|---------|
19-
| `namespace` | `keyword` | Exact-match deployment namespace filter; must match `elasticsearchNamespace`. |
20-
| `component` | `keyword` | Exact-match ORT component filter using ORT Server component names. |
21-
| `ortRunId` | `keyword` with optional `long` subfield, for example `ortRunId.numeric` | Exact-match ORT run ID filter. |
22-
| `level` | `keyword` | Exact-match log level filter using ORT Server log level names. |
23-
| `time` | `date` | ORT log event timestamp used for range queries and primary sorting. |
24-
| `sortId` | `keyword` | Secondary sort key for stable `search_after` pagination. |
25-
| `message` | present in `_source`; `text` recommended for Kibana searches | Rendered log line written to the downloaded log file. |
26-
27-
The `message` field is written to the downloaded log file unchanged, one line per hit. It does not need to be indexed
28-
for search by the provider, but it must be present in `_source`. Indexing `message` as a `text` field is recommended for
29-
Kibana, so users can search log lines, exceptions, request paths, and other free-form text. A `.keyword` subfield can be
30-
useful for exact matches, sorting, or aggregations, but should usually have an `ignore_above` limit to avoid indexing
31-
very large log lines as keyword terms.
32-
33-
Although `ortRunId` contains numeric values, the provider treats it as an identifier and queries it as a keyword. This
34-
matches Elasticsearch's guidance for numeric-looking identifiers that are primarily used in term queries. A numeric
35-
multi-field can still be added for future range queries or numeric sorting without changing the provider's exact-match
36-
query behavior.
15+
The provider expects log documents to follow the structured JSON schema produced by the
16+
[`OrtServerJsonEncoder`](../../utils/logging/src/main/kotlin/OrtServerJsonEncoder.kt) (the same field names emitted by
17+
the Logback JSON encoder, such as `formattedMessage`, `level`, `timestamp`, `sequenceNumber`, and the MDC entries under
18+
`mdc.*`). The relevant fields are:
19+
20+
| Field | Used as | Expected Elasticsearch type | Purpose |
21+
|--------------------|----------------------------------------|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
22+
| `namespace` | filter | `keyword` | Exact-match deployment namespace filter; must match `elasticsearchNamespace`. The field name is configurable via `elasticsearchNamespaceField`. |
23+
| `mdc.component` | filter (`mdc.component.keyword`) | `text` with `keyword` subfield | Exact-match ORT component filter using ORT Server component names. |
24+
| `mdc.ortRunId` | filter (`mdc.ortRunId.keyword`) | `text` with `keyword` subfield | Exact-match ORT run ID filter. |
25+
| `level` | filter (`level.keyword`) and `_source` | `text` with `keyword` subfield | Exact-match log level filter; the level value is also prepended to each written log line. |
26+
| `timestamp` | range filter, primary sort, `_source` | `date` (`epoch_millis`) | Log event timestamp; used for range queries, primary sorting, and rendered as the leading timestamp of each log line. |
27+
| `sequenceNumber` | secondary sort | `long` | Stable tie-breaker for `search_after` pagination among hits that share the same `timestamp`. |
28+
| `formattedMessage` | `_source` | `text` recommended | Rendered log line written to the downloaded log file. |
29+
| `throwable` | `_source` | `text` recommended | Optional rendered throwable; appended after the message when present. |
30+
31+
Filtering and sorting use the indexed (`keyword`) fields, while the log content is read from `_source`. The provider
32+
therefore requests only `level`, `formattedMessage`, `throwable`, and `timestamp` in `_source` and uses the indexed
33+
fields for the remaining filters and sorts. Each downloaded log line is rendered as `<timestamp> <level> <message>`,
34+
with the `throwable` (if present) appended on the following line.
35+
36+
The `formattedMessage` field is written to the downloaded log file, one line per hit. It does not need to be indexed
37+
for search by the provider, but it must be present in `_source`. Indexing `formattedMessage` as a `text` field is
38+
recommended for Kibana, so users can search log lines, exceptions, request paths, and other free-form text. A
39+
`.keyword` subfield can be useful for exact matches, sorting, or aggregations, but should usually have an
40+
`ignore_above` limit to avoid indexing very large log lines as keyword terms.
41+
42+
Although `mdc.ortRunId` contains numeric values, the provider treats it as an identifier and queries its `keyword`
43+
subfield. This matches Elasticsearch's guidance for numeric-looking identifiers that are primarily used in term
44+
queries.
45+
46+
### Field prefix
47+
48+
If the indexing pipeline nests all log-line fields under a common prefix (for example, when Logstash is configured to
49+
add a custom prefix to all fields during indexing), set `elasticsearchFieldPrefix`. The provider then prepends this
50+
prefix to every field taken from the log line. The namespace field is *not* prefixed, because it is derived
51+
from deployment metadata rather than from the log line. Prefixed fields are returned by Elasticsearch as nested
52+
objects and resolved accordingly when reading values from `_source`.
3753

3854
## Configuration
3955

@@ -45,23 +61,26 @@ logFileService {
4561
elasticsearchServerUrl = "https://elasticsearch.example.org"
4662
elasticsearchIndex = "ort-server-logs-*"
4763
elasticsearchNamespace = "prod"
64+
elasticsearchNamespaceField = "namespace"
65+
elasticsearchFieldPrefix = "ortserver"
4866
elasticsearchPageSize = 1000
4967
elasticsearchApiKey = "base64-api-key"
5068
}
5169
```
5270

5371
Supported properties:
5472

55-
| Property | Variable | Description | Default | Secret |
56-
|----------|----------|-------------|---------|--------|
57-
| `elasticsearchServerUrl` | `ELASTICSEARCH_SERVER_URL` | Base URL of the Elasticsearch instance. | mandatory | no |
58-
| `elasticsearchIndex` | `ELASTICSEARCH_INDEX` | Index or index pattern to query. | mandatory | no |
59-
| `elasticsearchNamespace` | `ELASTICSEARCH_NAMESPACE` | Namespace label used to restrict queries. | mandatory | no |
60-
| `elasticsearchPageSize` | `ELASTICSEARCH_PAGE_SIZE` | Number of hits to fetch per search request. | `1000` | no |
61-
| `elasticsearchUsername` | `ELASTICSEARCH_USERNAME` | Optional username for Basic Auth. Ignored when an API key is configured. | undefined | no |
62-
| `elasticsearchPassword` | `ELASTICSEARCH_PASSWORD` | Optional password for Basic Auth. | undefined | yes |
63-
| `elasticsearchApiKey` | `ELASTICSEARCH_API_KEY` | Optional Elasticsearch API key. Takes precedence over Basic Auth. | undefined | yes |
64-
| `elasticsearchTimeoutSec` | `ELASTICSEARCH_TIMEOUT_SEC` | Optional request timeout in seconds. | `30` | no |
73+
| Property | Variable | Description | Default | Secret |
74+
|-------------------------------|---------------------------------|-------------------------------------------------------------------------------------------|-------------|--------|
75+
| `elasticsearchServerUrl` | `ELASTICSEARCH_SERVER_URL` | Base URL of the Elasticsearch instance. | mandatory | no |
76+
| `elasticsearchIndex` | `ELASTICSEARCH_INDEX` | Index or index pattern to query. | mandatory | no |
77+
| `elasticsearchNamespace` | `ELASTICSEARCH_NAMESPACE` | Namespace label used to restrict queries. | mandatory | no |
78+
| `elasticsearchNamespaceField` | `ELASTICSEARCH_NAMESPACE_FIELD` | Name of the field that holds the namespace value used by the namespace filter. | `namespace` | no |
79+
| `elasticsearchFieldPrefix` | `ELASTICSEARCH_FIELD_PREFIX` | Optional prefix prepended to all log-line fields (everything except the namespace field). | undefined | no |
80+
| `elasticsearchPageSize` | `ELASTICSEARCH_PAGE_SIZE` | Number of hits to fetch per search request. | `1000` | no |
81+
| `elasticsearchUsername` | `ELASTICSEARCH_USERNAME` | Optional username for Basic Auth. Ignored when an API key is configured. | undefined | no |
82+
| `elasticsearchPassword` | `ELASTICSEARCH_PASSWORD` | Optional password for Basic Auth. | undefined | yes |
83+
| `elasticsearchApiKey` | `ELASTICSEARCH_API_KEY` | Optional Elasticsearch API key. Takes precedence over Basic Auth. | undefined | yes |
6584

6685
If both Basic Auth credentials and an API key are configured, the API key is used and Basic Auth is ignored.
6786

@@ -71,74 +90,82 @@ In addition, default properties of the HTTP client that is used to send requests
7190

7291
Queries use a bool filter with:
7392

74-
- namespace term filter on `namespace`
75-
- component term filter on `component`
76-
- ORT run ID term filter on `ortRunId`
77-
- log level terms filter on `level`
78-
- time range filter on `time`
93+
- namespace term filter on the configured namespace field (default `namespace`)
94+
- component term filter on `mdc.component.keyword`
95+
- ORT run ID term filter on `mdc.ortRunId.keyword`
96+
- log level terms filter on `level.keyword`
97+
- time range filter on `timestamp`
7998

80-
Results are sorted ascending by `time` and `sortId` and fetched via Elasticsearch's `search_after` mechanism until all
81-
matching hits have been retrieved. Instead of using offset-based paging, `search_after` asks Elasticsearch for the next
82-
page after the sort values of the last hit from the previous page. This avoids deep paging limits.
99+
Results are sorted ascending by `timestamp` and `sequenceNumber` and fetched via Elasticsearch's `search_after`
100+
mechanism until all matching hits have been retrieved. Instead of using offset-based paging, `search_after` asks
101+
Elasticsearch for the next page after the sort values of the last hit from the previous page. This avoids deep paging
102+
limits.
83103

84-
The secondary `sortId` field is required because multiple log lines can share the same `time` value. It should be a
85-
keyword-compatible value that is stable for the indexed log document and unique enough to distinguish log lines with the
86-
same timestamp. The value does not need to be meaningful to users.
87-
88-
Good sources for `sortId`, in order of preference, are:
89-
90-
- a collector-provided event ID, sequence number, or log-file offset if available
91-
- a deterministic hash of stable event identity fields such as pod UID, container ID, stream, log-file offset,
92-
high-precision event timestamp, and, if needed, the log message
93-
- for the local Docker Compose setup, a hash of `container_id`, `created`, and `message`
94-
95-
Avoid deriving `sortId` only from low-cardinality fields such as `component`, only from `time`, or from values that can
96-
change between reprocessing attempts. A UUID or generated event ID is acceptable if it is stored with the document and
97-
remains unchanged on retries or reindexing. A deterministic hash is usually easier to reproduce when debugging.
104+
The secondary `sequenceNumber` field is required because multiple log lines can share the same `timestamp`. It must be
105+
present on every document and provide a stable, monotonically increasing tie-breaker for log lines with the same
106+
timestamp. ORT Server's logging stack emits this value automatically; deployments only need to ensure it is indexed in
107+
a sortable numeric type such as `long`.
98108

99109
## Collector Normalization
100110

101-
ORT Server currently queries the canonical field names listed above and does not support configuring alternate
102-
Elasticsearch field names per deployment. Collector pipelines must therefore normalize platform-specific metadata and
103-
log content into this schema before documents are indexed.
111+
ORT Server queries a fixed set of field names derived from its structured log schema. The only per-deployment
112+
customizations are the namespace field name (`elasticsearchNamespaceField`) and an optional global field prefix
113+
(`elasticsearchFieldPrefix`); the remaining field names are fixed. Collector pipelines must therefore normalize
114+
platform-specific metadata and log content into this schema before documents are indexed.
104115

105-
For Kubernetes deployments, this typically means deriving `namespace` and `component` from pod metadata, extracting
106-
`ortRunId`, `level`, `time`, and `message` from the structured ORT Server log line, and adding `sortId` as described in
107-
the query behavior section.
116+
For Kubernetes deployments, this typically means deriving `namespace` from pod metadata, and ensuring that JSON logging
117+
is enabled by setting the `LOG_FORMAT` environment variable to `json` in all ORT Server containers.
108118

109119
If the filter or sort fields are missing or mapped incompatibly, ORT Server queries will not find the expected log
110-
documents. Hits without `_source.message` are skipped because there is no rendered log line to write to the downloaded
111-
file.
120+
documents. Hits without `_source.formattedMessage` are skipped because there is no rendered log line to write to the
121+
downloaded file.
112122

113-
An Elasticsearch index template for the required fields can look like this:
123+
An Elasticsearch index template for the required fields can look like this. The `.keyword` subfields on
124+
`mdc.component`, `mdc.ortRunId`, and `level` are what the provider filters and sorts on; they correspond to
125+
Elasticsearch's default dynamic mapping for string fields.
114126

115127
```json
116128
{
117129
"mappings": {
118130
"properties": {
119131
"namespace": { "type": "keyword" },
120-
"component": { "type": "keyword" },
121-
"ortRunId": {
122-
"type": "keyword",
132+
"mdc": {
133+
"properties": {
134+
"component": {
135+
"type": "text",
136+
"fields": {
137+
"keyword": { "type": "keyword", "ignore_above": 256 }
138+
}
139+
},
140+
"ortRunId": {
141+
"type": "text",
142+
"fields": {
143+
"keyword": { "type": "keyword", "ignore_above": 256 }
144+
}
145+
}
146+
}
147+
},
148+
"level": {
149+
"type": "text",
123150
"fields": {
124-
"numeric": { "type": "long" }
151+
"keyword": { "type": "keyword", "ignore_above": 256 }
125152
}
126153
},
127-
"level": { "type": "keyword" },
128-
"time": {
154+
"timestamp": {
129155
"type": "date",
130156
"format": "strict_date_optional_time||epoch_millis"
131157
},
132-
"sortId": { "type": "keyword" },
133-
"message": {
158+
"sequenceNumber": { "type": "long" },
159+
"formattedMessage": {
134160
"type": "text",
135161
"fields": {
136162
"keyword": {
137163
"type": "keyword",
138164
"ignore_above": 8191
139165
}
140166
}
141-
}
167+
},
168+
"throwable": { "type": "text" }
142169
}
143170
}
144171
}

logaccess/elasticsearch/src/main/kotlin/ElasticsearchConfig.kt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ package org.eclipse.apoapsis.ortserver.logaccess.elasticsearch
2222
import org.eclipse.apoapsis.ortserver.config.ConfigManager
2323
import org.eclipse.apoapsis.ortserver.config.Path
2424
import org.eclipse.apoapsis.ortserver.utils.config.getIntOrDefault
25+
import org.eclipse.apoapsis.ortserver.utils.config.getStringOrDefault
2526
import org.eclipse.apoapsis.ortserver.utils.config.getStringOrNull
2627

2728
/**
@@ -44,6 +45,15 @@ data class ElasticsearchConfig(
4445
*/
4546
val namespace: String,
4647

48+
/** The path of the log field that contains the Kubernetes namespace. */
49+
val namespaceField: String,
50+
51+
/**
52+
* A prefix to prepend to the log fields taken from the log lines. This is required if logstash is configured to add
53+
* a custom prefix to all fields during indexing.
54+
*/
55+
val fieldPrefix: String?,
56+
4757
/**
4858
* The maximum number of hits requested per search call.
4959
* If more hits are available, the provider continues with `search_after` pagination using the last hit's sort
@@ -70,6 +80,12 @@ data class ElasticsearchConfig(
7080
/** The configuration property that defines the namespace label used in search filters. */
7181
private const val NAMESPACE_PROPERTY = "elasticsearchNamespace"
7282

83+
/** The configuration property that defines the path of the namespace field. Defaults to "namespace". */
84+
private const val NAMESPACE_FIELD_PROPERTY = "elasticsearchNamespaceField"
85+
86+
/** The configuration property that defines the field prefix. */
87+
private const val FIELD_PREFIX_PROPERTY = "elasticsearchFieldPrefix"
88+
7389
/** The configuration property that defines the page size used for search requests. */
7490
private const val PAGE_SIZE_PROPERTY = "elasticsearchPageSize"
7591

@@ -106,6 +122,8 @@ data class ElasticsearchConfig(
106122
serverUrl = configManager.getString(SERVER_URL_PROPERTY),
107123
index = configManager.getString(INDEX_PROPERTY),
108124
namespace = configManager.getString(NAMESPACE_PROPERTY),
125+
namespaceField = configManager.getStringOrDefault(NAMESPACE_FIELD_PROPERTY, "namespace"),
126+
fieldPrefix = configManager.getStringOrNull(FIELD_PREFIX_PROPERTY),
109127
pageSize = configManager.getIntOrDefault(PAGE_SIZE_PROPERTY, DEFAULT_PAGE_SIZE),
110128
username = username,
111129
password = password,

0 commit comments

Comments
 (0)