Description
This is a very rough draft analysis of how the transformprocessor could be enhanced to support the log processing capabilities of the log-collection library. Certainly more careful design would be warranted, but the suggestions and examples are a starting point for conversation.
Path expressions vs "field syntax"
log-collection defines a field syntax that is very similar to transformprocessor
's "path expression". However, it allows for the ability to refer to nested fields in attributes and body. This would be a very important capability for parity.
Expressions
log-collection exposes an expression engine. This could be represented as a new function called expr()
, which would typically be composed into other functions:
set(attributes["some.attr"], expr(foo ? 'foo' : 'no foo'))
Alternately, it may be possible to provide equivalent functions for the same capabilities.
Parsers
log-collection's generic parsers like json
, regex
, csv
, and keyvalue
could be represented as functions. These all produce a map[string]interface{}
, which could then be set as desired:
parse_json(body, attributes)
parse_regex(body, '/^.....$/', attributes["tmp"])
A common pattern is to "embed" subparsers into these generic parsers, with the primary benefit being that they only execute if the primary parsing operation succeeded. Possibly this could be represented with some kind of conditional sub-query concept:
parse_json(body, attributes)
strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")
Moving values around
set
is equivalent toadd
retain
is roughly equivalent tokeep_keys
, but there appear to be some nuanced differences. Need to look into this more.copy(from, to)
move(from, to)
remove(attributes["one"], attributes["two"])
flatten(attributes["multilayer.map"])
Timestamps
log-collection supports multiple timestamp parsing syntaxes, namely strptime
, gotime
, and epoch
(unix). These would translate fairly easily to functions:
strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")
gotime(attributes["time"], Jan 2 15:04:05 MST 2006)
unixtime(concat(attributes["seconds"], ".", attributes["seconds"]), "s.ns")
Severity
log-collection provides a very robust mechanism for interpreting severities, which may be difficult to represent in the syntax of this processor. The main idea of the system is that severity is interpreted according to a mapping. Several out-of-the-box mapping are available, and the user can layer on additional mappings as needed. This gives a concise configuration, and the implementation can be highly optimized (single map lookup, instead of iteration over many queries).
One way to represent these same capabilities would require a class of functions that produce and/or mutate severity mappings:
sevmap_default()
sevmap_with(sevmap_empty(), as_warn("w", "warn", "warning", "hey!"), as_error("e", "error", "err", "noo!"))
sevmap_with(sevmap_http(), as_fatal(404))
Conditionality
- This is likely planned, but additional matching operators would be necessary to reach parity, specifically regex matching.
- Possibly would be nice to provide a way to apply a where clause over multiple related queries:
if(condition, run(q1, q2, q3))
if(parse_json(body, attributes), run(strptime(...), severity(...)))
Routing
log-collection supports a router
capability, which allows users to apply alternate processing paths based on any criteria that can be evaluated against individual logs. A brute force equivalent would be to apply the same where clause repeatedly:
strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y") where body ~= some_regex
severity(attributes["sev"], stdsevmap()) where body ~= some_regex
gotime(attributes["timestamp"], "Jan 2 15:04:05 MST 2006") where body ~= other_regex
severity(attributes["status.code"], httpsevmap()) where body ~= other_regex
Resource and scope challenges
Logs often contain information about the resource and/or scope which must be parsed from text. Isolating and setting these values is fairly straightforward when working with a flat data model, such as is used in log-collection, but it's not clear to me whether the pdata model will struggle with this.
For example, suppose a log format is shaped like resource_name,scope_name,message
. Should/does transformprocessor
create a new pdata.ResourceLogs
for each time a resource attribute is isolated? Should it cross reference with existing resources in the pdata.ResourceLogsSlice
and combine them? Could it do this performantly? How many log processing functions could trigger this kind of complication? (eg. move(attributes["resource_name"], resource["name"])
).
Need to give more thought to this area especially.