New component: processor/drain — log template annotation via the Drain algorithm

## The problem

Large-scale deployments routinely ingest millions of log records per minute where a small number of structural patterns account for the majority of volume. Without a way to group logs by their underlying pattern, operators cannot:

- Identify which log classes are generating the most volume
- Write reliable filter rules that survive log message variations (e.g. different IP addresses, user names, request IDs)
- Build cardinality-safe dashboards — grouping by raw log body is impractical

Existing approaches require operators to write and maintain regular expressions by hand, which doesn't scale and misses patterns they haven't anticipated.

## Proposed solution

A new `processor/drain` that applies the [Drain log clustering algorithm](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf) to log records as they pass through the pipeline. Drain builds a parse tree from log token structure and automatically derives template strings (e.g. `"user <*> logged in from <*>"`) by replacing variable tokens with wildcards as similar lines accumulate.

The processor annotates each record with the following attribute:

| Attribute | Type | Example |
|-----------|------|---------|
| `log.record.template` | string | `"user <*> logged in from <*>"` |

This aligns with the proposed OTel semantic convention in [open-telemetry/semantic-conventions#1283](https://github.com/open-telemetry/semantic-conventions/issues/1283) and [#2064](https://github.com/open-telemetry/semantic-conventions/issues/2064).

The processor **annotates only** — it does not filter. Downstream processors (e.g. `filter`) act on the attributes, keeping concerns separated and the processor composable.

## Key features

- Configurable Drain parse tree parameters (depth, similarity threshold, max clusters with LRU eviction)
- Pre-seeding via known template strings or example log lines for stable templates across restarts
- `passthrough` warmup mode (default): annotates immediately from the first record
- `buffer` warmup mode: holds records until the tree has stabilized, then flushes with abstracted templates applied
- Optional `body_field` for pipelines where the log body is a structured map and the message field cannot be promoted to a plain string body upstream — pipelines that do have that control should use a `move` operator instead
- Internal telemetry: `processor_drain_clusters_active` gauge, `processor_drain_log_records_annotated` and `processor_drain_log_records_unannotated` counters

## Example

```yaml
processors:
  drain:
    log_cluster_depth: 4
    sim_threshold: 0.4
    seed_templates:
      - "user <*> logged in from <*>"
      - "connected to <*>"
    warmup_mode: buffer
    warmup_min_clusters: 20
    warmup_buffer_max_logs: 5000

  filter/drop_noisy:
    error_mode: ignore
    logs:
      log_record:
        - attributes["log.record.template"] == "heartbeat ping <*>"

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [drain, filter/drop_noisy]
      exporters: [otlp]
```

## Alternatives considered

- **transform processor + OTTL**: can match patterns but requires operators to enumerate every pattern manually as regex rules. Doesn't discover new patterns automatically.
- **attributes processor**: attribute renaming only; no clustering capability.

## Intentional scope limitations (deferred)

- `body_field` supports only a single top-level key. Full OTTL path expressions are a natural follow-on but are out of scope for the initial implementation.
- Snapshot persistence (save/restore the Drain tree across restarts) would eliminate the need for seeding. The internal `drain` package is designed to support this, but the plumbing into the collector lifecycle is deferred.
- Multi-instance synchronization for consistent templates across horizontally scaled deployments.

## Telemetry data types

Logs only.

## Code owners

@MikeGoldsmith

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New component: processor/drain — log template annotation via the Drain algorithm #47235

The problem

Proposed solution

Key features

Example

Alternatives considered

Intentional scope limitations (deferred)

Telemetry data types

Code owners

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New component: processor/drain — log template annotation via the Drain algorithm #47235

Description

The problem

Proposed solution

Key features

Example

Alternatives considered

Intentional scope limitations (deferred)

Telemetry data types

Code owners

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions