Description
Problem
We use many different filtering interfaces in the collector configuration where we need to provide users with an option to limit an external set of data. The interfaces differ pretty significantly, making it confusing for the end user. Moreover, developers may find it unclear which interface to adopt for their components moving forward.
Here we are talking only about filtering external sets, not pdata objects which are supposed to be fully handled by OTTL.
There are two main use cases:
1. Just filter external data sets prior to further processing.
This scenario arises when users need to limit the external dataset before the collector proceeds with specific processing.
1.1 List of glob file paths:
Example: pkg/stanza/fileconsumer/matcher.Include/Exclude taking slices of glob strings to match against files to watch
include:
- /var/log/pods/*/*/*.log
exclude:
- /var/log/pods/*/otel-collector/*.log
While effective for file paths, this method might not be universally applicable to other data sources.
1.2 Filtering by strict/regex rules provided explicitly
This method is widely employed in the filtering processor, soon to be substituted by OTTL. Additionally, it has been implemented in the hostmetrics receiver with this structure:
<include_devices|exclude_devices>:
devices: [ <device name>, ... ]
match_type: <strict|regexp>
It was introduced in open-telemetry/opentelemetry-collector#522 not to break the interfaces that were already built by that time. This will be fully replaced by OTTL in the filter processor and by #25134 in hostmetrics receiver.
1.3 Dynamically deducting regexp/glob/strict type from the string
This approach is used in dockerstats receiver exclude_images
excluded_images:
- undesired-container
- /.*undesired.*/
- another-*-container
Strings wrapped with /
represent regex, strings prefixed with !
represent negation, glob strings are automatically deducted if any the *?[]{}!
characters are found in the string.
This approach is the least verbose but requires adding some additional escaping rules, e.g. if we want to do a strict match against /some-string/
or !string
. Also, the unclear distinction between glob and regular strings can cause confusion.
2. Convert external maps into a new set of attributes
This functionality is needed when users desire a configuration interface to map external maps into resource attributes. Examples include:
- A subset of Kubernetes pod/namespace annotations and labels mapped to resource attributes
- A subset of container labels and environment variables mapped to resource attributes
The filtering part of this is the same as in the use case (1), but we also need to provide a way for users to remap keys from the original map to the attribute keys. This cannot be done in the downstream component because different sources (like pod annotations and labels) can have the same key, so users need a way to avoid keys from one map being overridden by another map. The mapping part can also be solved as a separate configuration option, but it's nice to do it at once, especially when we need to map regex-mapped groups. For example, k8sattributes processor currently provides the following interface for this:
extract:
labels:
- key_regex: (service-.*)
from: pod
tag_name: "k8s.pod.labels.$1"
Goal
Resolving this issue does not require changing all the existing filtering interfaces. The goal is to establish a prescribed way to define the filtering interfaces for external datasets for any future use cases. This issue would also be a prerequisite for resolving the following ones:
- [receiver/dockerstats] Revisit configuration for setting attributes from container labels and env vars #13848
- [processor/k8sattributes] Refactor FieldExtractConfig #25135
- [cmd/mdatagen] Ability to filter by resource attribute values #25134
Possible solutions
Option A
Keep using approach 1.3 "Dynamically deducting regexp/glob/strict type from the string".
Pros
- It's concise. We can support both use cases (1) and (2), where, for the use case (1), we provide a config interface as a simple list:
filter:
- exact-string
- /service-.*/
For the use case (2), we cannot use a map because we need to preserve the order of the filtering/mapping rules evaluation, so we need to keep it as a list of structs with two fields (filter_and_map
is just a placeholder):
filter_and_map:
- source_key: source-string
target_key: target-string
- source_key: /service-(.*)/
target_key: $1
Cons
- Require a way to escape of wrapping
/
for regexp and the!
negation prefix. Potentially, we can avoid supporting the negation!
sign because it can always be handled by a regexp. Also, we can recommend using regex for strings wrapped with/
's, e.g./\/...\//
. - Unclear distinction between strict and glob modes, e.g. users can add a single
?
sign as part of the string they want to match, but it'll match any character. Potentially we can remove glob expression support and keep only/regex/
andstrict-string
matching
Option B
Make every filtering item a struct that would include a matching type as a field name: strict
, glob
regexp
. Setting more than one of them shouldn't pass validation. To support use case (2), we can simply add another optional field, target_key
, for example (note that filter
and filter_and_map
are just placeholders, they can be anything else like include_container_values
):
filter:
- strict: my-service
- regexp: service-v1.(.*)
filter_and_map:
# match only "source-key" key and assign its value to the "target-key" key in the target map
- strict: source-key
target_key: target-key
# identify keys that match the regular expression 'service-(.*)' and assign
# their corresponding values to keys named after the matched group names in the target map
- regexp: service-(.*)
target_key: $1
# identify keys matching the regular expression 'app-.*'
# and set the corresponding value under the same keys in the target map
- regexp: app-.*
Pros
- Explicit about a matching method
Cons
- More verbose than Option A
Option C
Utilize OTTLP for the external datasets additionally to OTLP data. This would require introducing additional identifiers for the external items. It needs more discussion to determine its pros and cons