Description
Description
Context
It's not uncommon for documents to have fields with dots in them:
{
"nested": {
"a.b.c": "This is a test"
}
}
When specifying a field in a processor (e.g. grok, rename or others), it's currently not possible to target these fields, because dots are always interpreted as nested objects. { "grok": { "field": "nested.a.b.c" }}
will only work on { "nested": { "a": { "b": { "c": "This is a test" } } } }
.
This is important for the streams effort, which aims to offer a simplified UI to manage data in the Elastic stack and simulates changes to ingest pipelines using the source of existing documents. With synthetic source, dots in field names or nested objects are not preserved. Allowing ingest pipelines to be agnostic to this makes it possible to gain confidence in pipeline changes before applying them.
Solution
While in some cases it can be preferable to explicitly specify which fields are dotted and which are nested, this is often cumbersome and doesn't matter much to a user.
A new syntax should be introduced to allow accessing these fields identical to how the fields API allows to do so in painless script processors. If the whole field name is wrapped in $('
and ')
, then the field name is interpreted similar to the fields API in a painless processor:
{ "grok": { "field": "$('nested.a.b.c')" }}
The syntax and behavior is completely identical with the fields API which makes it simple for users to understand.
It's possible to escape quotes within the quotes using \
to access field names with $('
in them:
$('$(\'fieldname\')') // matches { "$('fieldname')": "..." }
Open questions
How does this syntax play with mustache template which are supported in some cases? For the scope of the observability team, it would be OK to not support it initially - this could be added later on.
Breaking change
This feature constitutes a change of behavior - having a field name starting with $('
and ending with ')
in a field name specified in an ingest pipeline is currently allowed and treats these as regular characters. However, these cases are expected to be very rare.
Draft for breaking change proposal: https://github.com/elastic/dev/issues/3092
Why not dot_expander?
The dot_expander processor is addressing a similar need by normalizing the data instead of allowing the user to specify the difference. However, it has some downsides which are unacceptable in some cases:
- Not possible to have a prefix of a dotted field name as a primitive value (especially in OTel this is a common format):
{
"host": "abc",
"host.name": "def" // can't be dot-expanded without breaking host
}
- Possible collisions
{
"host": { "name": "abc" },
"host.name": "def"
}
- Different from OTTL, which allows this style of access
- Changes the shape of the data which loses information - it becomes impossible to tell the difference between dotted field names and nested field names
References
POC: #125804
Discussion: https://github.com/elastic/streams-program/discussions/224