Skip to content

pkg/docker: Add event-based Docker container info caching#1990

Open
florianl wants to merge 21 commits into
open-telemetry:mainfrom
florianl:docker-metadata-cache
Open

pkg/docker: Add event-based Docker container info caching#1990
florianl wants to merge 21 commits into
open-telemetry:mainfrom
florianl:docker-metadata-cache

Conversation

@florianl

@florianl florianl commented May 4, 2026

Copy link
Copy Markdown
Member

Summary

Implement in-process caching of Docker container metadata in ContainerStore with event-driven invalidation. When containers die or are destroyed, Docker events trigger immediate cache eviction, eliminating stale metadata and reducing repeated API calls.

This reduces CPU consumption in span enrichment pipeline by avoiding redundant Docker API calls. Flamegraph showed ~13.4% CPU in dockerEnricher; caching eliminates most calls for containers that host multiple processes.

Validation

@florianl florianl requested a review from a team as a code owner May 4, 2026 09:45
@codecov

codecov Bot commented May 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 63.39869% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.24%. Comparing base (3929e7b) to head (600f825).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/docker/docker_api_client.go 62.93% 47 Missing and 6 partials ⚠️
pkg/transform/transform_docker.go 66.66% 2 Missing ⚠️
pkg/appolly/discover/docker_decorator.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1990      +/-   ##
==========================================
+ Coverage   69.23%   69.24%   +0.01%     
==========================================
  Files         320      320              
  Lines       42323    42460     +137     
==========================================
+ Hits        29301    29401     +100     
- Misses      11325    11349      +24     
- Partials     1697     1710      +13     
Flag Coverage Δ
integration-test 51.03% <19.04%> (+0.16%) ⬆️
integration-test-arm 28.16% <16.99%> (-0.22%) ⬇️
integration-test-vm-5.15-lts ?
integration-test-vm-6.18-lts ?
k8s-integration-test 38.05% <2.38%> (+0.13%) ⬆️
oats-test 35.95% <19.04%> (-0.03%) ⬇️
unittests 62.20% <70.63%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

CI Supervisor

Workflow Job Last state Re-running? Attempt
PR OATS test kafka failure Yes 1/2
PR OATS test ai failure Yes 1/2
PR OATS test amqp failure Yes 1/2
PR OATS test redis failure Yes 1/2
PR OATS test memcached failure Yes 1/2
PR OATS test http failure Yes 1/2
PR OATS test sql failure Yes 1/2
PR OATS test nats failure Yes 1/2
PR OATS test mongo failure Yes 1/2
Pull request checks Generate and checks failure Yes 1/2
Pull request integration tests ARM shard-0 (1 tests) failure Yes 1/2
Pull request integration tests ARM shard-1 (1 tests) failure Yes 1/2
Pull request integration tests ARM shard-2 (1 tests) failure Yes 1/2
Pull request integration tests ARM shard-3 (1 tests) failure Yes 1/2
Pull request K8s integration tests informer_cache failure Yes 1/2
Pull request K8s integration tests netolly failure Yes 1/2
Pull request K8s integration tests daemonset_multi_node failure Yes 1/2
Pull request K8s integration tests otel failure Yes 1/2
Pull request K8s integration tests restrict_local_node failure Yes 1/2
Pull request K8s integration tests disable_informers failure Yes 1/2
Pull request K8s integration tests netolly_multizone failure Yes 1/2
Pull request K8s integration tests netolly_promexport failure Yes 1/2
Pull request K8s integration tests sharedpidns failure Yes 1/2
Pull request K8s integration tests daemonset_python failure Yes 1/2
Pull request K8s integration tests netolly_multizone_prom failure Yes 1/2
Pull request K8s integration tests owners failure Yes 1/2
Pull request K8s integration tests netolly_tc_promexport failure Yes 1/2
Pull request K8s integration tests netolly_dropexternal failure Yes 1/2
Pull request K8s integration tests daemonset failure Yes 1/2
Pull request K8s integration tests prom failure Yes 1/2
Test Docker build build (Dockerfile, linux/arm64, ubuntu-24.04-arm) failure No 2/2
Pull request integration tests shard-6 (11 tests) failure No 2/2

@florianl florianl marked this pull request as draft May 4, 2026 10:58
@florianl florianl force-pushed the docker-metadata-cache branch from 2195482 to 6ae924c Compare May 4, 2026 11:33
Implement in-process caching of Docker container metadata in ContainerStore
with event-driven invalidation. When containers die or are destroyed, Docker
events trigger immediate cache eviction, eliminating stale metadata and
reducing repeated API calls.

This reduces CPU consumption in span enrichment pipeline by avoiding
redundant Docker API calls. Flamegraph showed ~13.4% CPU in dockerEnricher;
caching eliminates most calls for containers that host multiple processes.
@florianl florianl force-pushed the docker-metadata-cache branch from 6ae924c to 398570b Compare May 4, 2026 11:34
@florianl florianl marked this pull request as ready for review May 4, 2026 11:58

@grcevski grcevski left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Pretty cool! I'll let @mariomac review this since I'm only partially familiar with this code, but looks great from my side.

MrAlias
MrAlias previously requested changes May 4, 2026

@MrAlias MrAlias left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching direction makes sense, but I found one correctness issue in the invalidation path that looks blocking.

ContainerStore stores cache entries in byID using the shortened 12-character container ID derived from inspectInfo.ID, but the Docker event stream delivers the full container ID in msg.Actor.ID. In the current code that means the invalidation lookup never finds the cached entry, so die / destroy events do not actually evict stale metadata.

I think the fix should be to key invalidation with the full container ID internally, and only keep the 12-character form for exported metadata if that abbreviated form is still desired.

Comment thread pkg/docker/docker_api_client.go Outdated
Comment thread pkg/docker/docker_api_client_test.go Outdated
@mariomac

mariomac commented May 8, 2026

Copy link
Copy Markdown
Contributor

Nice! Thank you!

In principle LGTM! But I have a question: are the performance tests taken in production/dev scenarios, or are they micro-benchmarks ran with go test -bench?

I would expect that a cache at this level is not needed because the docker_decorator.go file already has a containerByPID cache that is populated when the container data is found, and removed when an EventDeleted event is received. Also keeping the cache there removes the need for any mutex.

florianl added 3 commits May 12, 2026 14:19
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
@florianl

Copy link
Copy Markdown
Member Author

But I have a question: are the performance tests taken in production/dev scenarios, or are they micro-benchmarks ran with go test -bench?

I noticed this issue, when experimenting with OBI and ebpf-profiler in the OTel demo using something like this config:

receivers:
  profiling:
    obi_process_ctx: true
    samples_per_second: 73
    off_cpu_threshold: 0.1
  obi:
    open_port: "1-65535"

So it's not a micro-benchmark but more a learning from a demo environment.

@MrAlias MrAlias left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One follow-up correctness concern on the shared Docker metadata cache: the new store-level PID cache can outlive the process lifecycle unless it also gets explicit PID-scoped eviction. The container-event invalidation path makes sense, but I think the store still needs a helper for process-deletion paths so a reused PID cannot pick up stale metadata from a previous container. After that, the existing lifecycle handlers in pkg/appolly/discover/docker_decorator.go on EventDeleted and pkg/transform/transform_docker.go on ProcessEventTerminated should call it so the shared cache stays aligned with PID lifetime.

Comment thread pkg/docker/docker_api_client.go Outdated
Comment thread pkg/docker/docker_api_client_test.go Outdated
florianl and others added 3 commits May 13, 2026 13:18
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

@MrAlias MrAlias left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two correctness issues still look unresolved on the current head, and I left one test suggestion to make the remaining PID-lifecycle bug reproducible.

Comment thread pkg/docker/docker_api_client.go Outdated
Comment thread pkg/docker/docker_api_client.go Outdated
Comment thread pkg/docker/docker_api_client.go
Comment thread pkg/docker/docker_api_client_test.go Outdated
florianl and others added 4 commits May 18, 2026 16:55
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
@florianl florianl force-pushed the docker-metadata-cache branch from f70d6b7 to 2e4dbef Compare May 19, 2026 08:26
florianl added 4 commits May 19, 2026 11:05
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
@florianl florianl requested a review from mariomac June 2, 2026 08:44

@mariomac mariomac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM despite I'd still remove the ID string field to ensure data consistency, as it's just a substring of FullID

@@ -167,3 +206,101 @@ func ContainerMetadata[T ~string](dst map[T]string, ci *ContainerMeta, stringer
out[stringer(attr.ContainerID)] = ci.ID

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for data consistency, I'd remove ID field and replace it here by attr.ContainerId[:abbreviationLength]

@MrAlias MrAlias dismissed their stale review June 3, 2026 17:12

Stale

@MrAlias MrAlias left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining non-blocking cache-efficiency concern inline.

Comment thread pkg/docker/docker_api_client.go
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

@MrAlias MrAlias left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two follow-up cache lifecycle comments inline. One is an adjacent stale-cache concern in the span path; the other is a current race in the new fullID cache write path.

Comment thread pkg/transform/transform_docker.go
Comment thread pkg/docker/docker_api_client.go
florianl and others added 2 commits June 5, 2026 15:07
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

@MrAlias MrAlias left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the review churn here. I should have caught these cache lifecycle issues in earlier passes, and I appreciate your patience as we get this tightened up. I found two remaining races that can leave stale Docker metadata in the shared store, plus one lower-severity watcher enablement issue.

Comment thread pkg/docker/docker_api_client.go Outdated
Comment thread pkg/docker/docker_api_client.go
Comment thread pkg/instrumenter/instrumenter.go Outdated
)

ctxInfo.DockerMetadata = docker.NewStore()
ctxInfo.DockerMetadata.Start(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting the Docker watcher here makes it run for every OBI process as soon as common context is built. In Kubernetes mode, or on hosts where Docker metadata is not otherwise used, watchContainerEvents() still retries Docker initialization once per second even though the Docker decorators later bypass themselves.

I think this should be started lazily only after Docker metadata is actually selected/enabled, or guarded by the same Docker-vs-Kubernetes enablement decision used by the decorators.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added K8sInformer.IsKubeEnabled(), like it is used in other places. Hope this is fine.

florianl added 2 commits June 15, 2026 14:03
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
@florianl

Copy link
Copy Markdown
Member Author

It seems like, the failing CI tests are related to #2304.

@grcevski

Copy link
Copy Markdown
Contributor

Hi @florianl, do you mind syncing with main, the test failures have been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants