[processor/transform] - Log processing capabilities

This is a very rough draft analysis of how the transformprocessor could be enhanced to support the log processing capabilities of the log-collection library. Certainly more careful design would be warranted, but the suggestions and examples are a starting point for conversation.


#### Path expressions vs "field syntax"

log-collection defines a [field syntax](https://github.com/open-telemetry/opentelemetry-log-collection/blob/main/docs/types/field.md) that is very similar to `transformprocessor`'s "path expression". However, it allows for the ability to refer to nested fields in attributes and body. This would be a very important capability for parity.


#### Expressions

log-collection [exposes an expression engine](https://github.com/antonmedv/expr/blob/master/docs/Language-Definition.md). This could be represented as a new function called `expr()`, which would typically be composed into other functions:
- `set(attributes["some.attr"], expr(foo ? 'foo' : 'no foo'))`

Alternately, it may be possible to provide equivalent functions for the same capabilities.


#### Parsers

log-collection's generic parsers like `json`, `regex`, `csv`, and `keyvalue` could be represented as functions. These all produce a `map[string]interface{}`, which could then be set as desired:
- `parse_json(body, attributes)`
- `parse_regex(body, '/^.....$/', attributes["tmp"])`

A common pattern is to "embed" subparsers into these generic parsers, with the primary benefit being that they only execute if the primary parsing operation succeeded. Possibly this could be represented with some kind of conditional sub-query concept:
- `parse_json(body, attributes)`
  - `strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")`


#### Moving values around
- `set` is equivalent to `add`
- `retain` is roughly equivalent to `keep_keys`, but there appear to be some nuanced differences. Need to look into this more. 
- `copy(from, to)`
- `move(from, to)`
- `remove(attributes["one"], attributes["two"])`
- `flatten(attributes["multilayer.map"])`


#### Timestamps

log-collection supports multiple timestamp parsing syntaxes, namely `strptime`, `gotime`, and `epoch` (unix). These would translate fairly easily to functions:
- `strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")`
- `gotime(attributes["time"], Jan 2 15:04:05 MST 2006)`
- `unixtime(concat(attributes["seconds"], ".", attributes["seconds"]), "s.ns")`


#### Severity

log-collection provides a very robust mechanism for [interpreting severities](https://github.com/open-telemetry/opentelemetry-log-collection/blob/main/docs/types/severity.md#how-to-use-severity-parsing), which may be difficult to represent in the syntax of this processor. The main idea of the system is that severity is interpreted according to a mapping. Several out-of-the-box mapping are available, and the user can layer on additional mappings as needed. This gives a concise configuration, and the implementation can be highly optimized (single map lookup, instead of iteration over many queries).

One way to represent these same capabilities would require a class of functions that produce and/or mutate severity mappings:
- `sevmap_default()`
- `sevmap_with(sevmap_empty(), as_warn("w", "warn", "warning", "hey!"), as_error("e", "error", "err", "noo!"))`
- `sevmap_with(sevmap_http(), as_fatal(404))`


#### Conditionality

- This is likely planned, but additional matching operators would be necessary to reach parity, specifically regex matching. 
- Possibly would be nice to provide a way to apply a where clause over multiple related queries:
- `if(condition, run(q1, q2, q3))`
- `if(parse_json(body, attributes), run(strptime(...), severity(...)))`


#### Routing

log-collection supports a `router` capability, which allows users to apply alternate processing paths based on any criteria that can be evaluated against individual logs. A brute force equivalent would be to apply the same where clause repeatedly:
- `strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y") where body ~= some_regex`
- `severity(attributes["sev"], stdsevmap()) where body ~= some_regex`
- `gotime(attributes["timestamp"], "Jan 2 15:04:05 MST 2006") where body ~= other_regex`
- `severity(attributes["status.code"], httpsevmap()) where body ~= other_regex`


#### Resource and scope challenges

Logs often contain information about the resource and/or scope which must be parsed from text. Isolating and setting these values is fairly straightforward when working with a flat data model, such as is used in log-collection, but it's not clear to me whether the pdata model will struggle with this. 

For example, suppose a log format is shaped like `resource_name,scope_name,message`. Should/does `transformprocessor` create a new `pdata.ResourceLogs` for each time a resource attribute is isolated? Should it cross reference with existing resources in the `pdata.ResourceLogsSlice` and combine them? Could it do this performantly? How many log processing functions could trigger this kind of complication? (eg. `move(attributes["resource_name"], resource["name"])`). 

Need to give more thought to this area especially.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[processor/transform] - Log processing capabilities #9410

Path expressions vs "field syntax"

Expressions

Parsers

Moving values around

Timestamps

Severity

Conditionality

Routing

Resource and scope challenges

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[processor/transform] - Log processing capabilities #9410

Description

Path expressions vs "field syntax"

Expressions

Parsers

Moving values around

Timestamps

Severity

Conditionality

Routing

Resource and scope challenges

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions