Skip to content

[processor/transform] - Log processing capabilities #9410

Open
@djaglowski

Description

@djaglowski

This is a very rough draft analysis of how the transformprocessor could be enhanced to support the log processing capabilities of the log-collection library. Certainly more careful design would be warranted, but the suggestions and examples are a starting point for conversation.

Path expressions vs "field syntax"

log-collection defines a field syntax that is very similar to transformprocessor's "path expression". However, it allows for the ability to refer to nested fields in attributes and body. This would be a very important capability for parity.

Expressions

log-collection exposes an expression engine. This could be represented as a new function called expr(), which would typically be composed into other functions:

  • set(attributes["some.attr"], expr(foo ? 'foo' : 'no foo'))

Alternately, it may be possible to provide equivalent functions for the same capabilities.

Parsers

log-collection's generic parsers like json, regex, csv, and keyvalue could be represented as functions. These all produce a map[string]interface{}, which could then be set as desired:

  • parse_json(body, attributes)
  • parse_regex(body, '/^.....$/', attributes["tmp"])

A common pattern is to "embed" subparsers into these generic parsers, with the primary benefit being that they only execute if the primary parsing operation succeeded. Possibly this could be represented with some kind of conditional sub-query concept:

  • parse_json(body, attributes)
    • strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")

Moving values around

  • set is equivalent to add
  • retain is roughly equivalent to keep_keys, but there appear to be some nuanced differences. Need to look into this more.
  • copy(from, to)
  • move(from, to)
  • remove(attributes["one"], attributes["two"])
  • flatten(attributes["multilayer.map"])

Timestamps

log-collection supports multiple timestamp parsing syntaxes, namely strptime, gotime, and epoch (unix). These would translate fairly easily to functions:

  • strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")
  • gotime(attributes["time"], Jan 2 15:04:05 MST 2006)
  • unixtime(concat(attributes["seconds"], ".", attributes["seconds"]), "s.ns")

Severity

log-collection provides a very robust mechanism for interpreting severities, which may be difficult to represent in the syntax of this processor. The main idea of the system is that severity is interpreted according to a mapping. Several out-of-the-box mapping are available, and the user can layer on additional mappings as needed. This gives a concise configuration, and the implementation can be highly optimized (single map lookup, instead of iteration over many queries).

One way to represent these same capabilities would require a class of functions that produce and/or mutate severity mappings:

  • sevmap_default()
  • sevmap_with(sevmap_empty(), as_warn("w", "warn", "warning", "hey!"), as_error("e", "error", "err", "noo!"))
  • sevmap_with(sevmap_http(), as_fatal(404))

Conditionality

  • This is likely planned, but additional matching operators would be necessary to reach parity, specifically regex matching.
  • Possibly would be nice to provide a way to apply a where clause over multiple related queries:
  • if(condition, run(q1, q2, q3))
  • if(parse_json(body, attributes), run(strptime(...), severity(...)))

Routing

log-collection supports a router capability, which allows users to apply alternate processing paths based on any criteria that can be evaluated against individual logs. A brute force equivalent would be to apply the same where clause repeatedly:

  • strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y") where body ~= some_regex
  • severity(attributes["sev"], stdsevmap()) where body ~= some_regex
  • gotime(attributes["timestamp"], "Jan 2 15:04:05 MST 2006") where body ~= other_regex
  • severity(attributes["status.code"], httpsevmap()) where body ~= other_regex

Resource and scope challenges

Logs often contain information about the resource and/or scope which must be parsed from text. Isolating and setting these values is fairly straightforward when working with a flat data model, such as is used in log-collection, but it's not clear to me whether the pdata model will struggle with this.

For example, suppose a log format is shaped like resource_name,scope_name,message. Should/does transformprocessor create a new pdata.ResourceLogs for each time a resource attribute is isolated? Should it cross reference with existing resources in the pdata.ResourceLogsSlice and combine them? Could it do this performantly? How many log processing functions could trigger this kind of complication? (eg. move(attributes["resource_name"], resource["name"])).

Need to give more thought to this area especially.

Metadata

Metadata

Assignees

Labels

data:logsLogs related issuesnever staleIssues marked with this label will be never staled and automatically removedpkg/ottlpriority:p2Mediumprocessor/transformTransform processorroadmappingIssue describes several feature requests for a topic

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions