Skip to content

Consistent config interface for filtering external datasets #25161

Open
@dmitryax

Description

@dmitryax

Problem

We use many different filtering interfaces in the collector configuration where we need to provide users with an option to limit an external set of data. The interfaces differ pretty significantly, making it confusing for the end user. Moreover, developers may find it unclear which interface to adopt for their components moving forward.

Here we are talking only about filtering external sets, not pdata objects which are supposed to be fully handled by OTTL.

There are two main use cases:

1. Just filter external data sets prior to further processing.

This scenario arises when users need to limit the external dataset before the collector proceeds with specific processing.

1.1 List of glob file paths:

Example: pkg/stanza/fileconsumer/matcher.Include/Exclude taking slices of glob strings to match against files to watch

include:
  - /var/log/pods/*/*/*.log
exclude:
  - /var/log/pods/*/otel-collector/*.log

While effective for file paths, this method might not be universally applicable to other data sources.

1.2 Filtering by strict/regex rules provided explicitly

This method is widely employed in the filtering processor, soon to be substituted by OTTL. Additionally, it has been implemented in the hostmetrics receiver with this structure:

<include_devices|exclude_devices>:
  devices: [ <device name>, ... ]
  match_type: <strict|regexp>

It was introduced in open-telemetry/opentelemetry-collector#522 not to break the interfaces that were already built by that time. This will be fully replaced by OTTL in the filter processor and by #25134 in hostmetrics receiver.

1.3 Dynamically deducting regexp/glob/strict type from the string

This approach is used in dockerstats receiver exclude_images

excluded_images:
  - undesired-container
  - /.*undesired.*/
  - another-*-container

Strings wrapped with / represent regex, strings prefixed with ! represent negation, glob strings are automatically deducted if any the *?[]{}! characters are found in the string.

This approach is the least verbose but requires adding some additional escaping rules, e.g. if we want to do a strict match against /some-string/ or !string. Also, the unclear distinction between glob and regular strings can cause confusion.

2. Convert external maps into a new set of attributes

This functionality is needed when users desire a configuration interface to map external maps into resource attributes. Examples include:

  • A subset of Kubernetes pod/namespace annotations and labels mapped to resource attributes
  • A subset of container labels and environment variables mapped to resource attributes

The filtering part of this is the same as in the use case (1), but we also need to provide a way for users to remap keys from the original map to the attribute keys. This cannot be done in the downstream component because different sources (like pod annotations and labels) can have the same key, so users need a way to avoid keys from one map being overridden by another map. The mapping part can also be solved as a separate configuration option, but it's nice to do it at once, especially when we need to map regex-mapped groups. For example, k8sattributes processor currently provides the following interface for this:

extract:
  labels:
    - key_regex: (service-.*)
      from: pod
      tag_name: "k8s.pod.labels.$1"

Goal

Resolving this issue does not require changing all the existing filtering interfaces. The goal is to establish a prescribed way to define the filtering interfaces for external datasets for any future use cases. This issue would also be a prerequisite for resolving the following ones:

Possible solutions

Option A

Keep using approach 1.3 "Dynamically deducting regexp/glob/strict type from the string".

Pros

  • It's concise. We can support both use cases (1) and (2), where, for the use case (1), we provide a config interface as a simple list:
filter:
  - exact-string
  - /service-.*/

For the use case (2), we cannot use a map because we need to preserve the order of the filtering/mapping rules evaluation, so we need to keep it as a list of structs with two fields (filter_and_map is just a placeholder):

filter_and_map:
  - source_key: source-string
    target_key: target-string
  - source_key: /service-(.*)/
    target_key: $1

Cons

  • Require a way to escape of wrapping / for regexp and the ! negation prefix. Potentially, we can avoid supporting the negation ! sign because it can always be handled by a regexp. Also, we can recommend using regex for strings wrapped with /'s, e.g. /\/...\//.
  • Unclear distinction between strict and glob modes, e.g. users can add a single ? sign as part of the string they want to match, but it'll match any character. Potentially we can remove glob expression support and keep only /regex/ and strict-string matching

Option B

Make every filtering item a struct that would include a matching type as a field name: strict, glob regexp. Setting more than one of them shouldn't pass validation. To support use case (2), we can simply add another optional field, target_key, for example (note that filter and filter_and_map are just placeholders, they can be anything else like include_container_values):

filter:
  - strict: my-service
  - regexp: service-v1.(.*)
filter_and_map:
  # match only "source-key" key and assign its value to the "target-key" key in the target map
  - strict: source-key
    target_key: target-key
  # identify keys that match the regular expression 'service-(.*)' and assign 
  # their corresponding values to keys named after the matched group names in the target map
  - regexp: service-(.*)
    target_key: $1
  # identify keys matching the regular expression 'app-.*' 
  # and set the corresponding value under the same keys in the target map
  - regexp: app-.*  

Pros

  • Explicit about a matching method

Cons

  • More verbose than Option A

Option C

Utilize OTTLP for the external datasets additionally to OTLP data. This would require introducing additional identifiers for the external items. It needs more discussion to determine its pros and cons

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussion neededCommunity discussion neededenhancementNew feature or requestnever staleIssues marked with this label will be never staled and automatically removedpkg/ottlroadmappingIssue describes several feature requests for a topic

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions