Optimize fileset matching with hash-based indexing for performance improvement at high file counts #45203

iblancasa · 2025-12-31T16:55:20Z

Description

Linear scanning to match files by fingerprint became a bottleneck at high file counts. The reason is each poll iterated through all readers for every match operation.

I implemented some changes that saw could help to reduce the CPU usage like:

Common fingerprint sizes (~1000 bytes) get bucket maps preallocated with capacity 64.
Replaced reflect.ValueOf() comparison dispatch (allocating ~48 bytes per call) with simple CompareMode enum.
Fingerprint bytes converted to strings using unsafe.String without copying. Results cached per fingerprint.
Files indexed in buckets by fingerprint length and prefix. Match operations now do two map lookups instead of thousands of comparisons.

Maybe there are more improvements we can do. Or maybe this PR brings some ideas about some other enhancements.

Link to tracking issue

Fixes #27404

Testing

Added some testing and benchmarks

go test ./pkg/stanza/fileconsumer-bench BenchmarkPollManyFiles -benchmem

Files watched	Baseline ns/op	Optimized ns/op	CPU improvement
100	2,124,092	1,941,299	+8.6 %
500	13,721,717	8,458,379	+38.4 %
1,000	38,285,195	18,616,022	+51.4 %
2,000	116,922,611	38,755,824	+66.9 %
2,500	170,480,583	49,986,882	+70.7 %
3,000	226,774,400	60,671,268	+73.2 %

atoulme · 2026-01-01T23:03:59Z

Please address the CI and mark ready to review.

…provement at high file counts Signed-off-by: Israel Blancas <[email protected]>

github-actions · 2026-01-24T05:21:53Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

…tor-contrib into 27404

iblancasa requested review from a team and andrzej-stencel as code owners December 31, 2025 16:55

github-actions bot assigned braydonk Dec 31, 2025

github-actions bot added the pkg/stanza/fileconsumer label Dec 31, 2025

atoulme added the waiting-for-code-owners label Jan 1, 2026

atoulme marked this pull request as draft January 1, 2026 23:04

Optimize fileset matching with hash-based indexing for performance im…

b41855e

…provement at high file counts Signed-off-by: Israel Blancas <[email protected]>

iblancasa force-pushed the 27404 branch from 82705d8 to b41855e Compare January 7, 2026 18:09

Merge branch 'main' into 27404

1a8d3a7

iblancasa marked this pull request as ready for review January 8, 2026 09:30

github-actions bot assigned evan-bradley Jan 8, 2026

Merge branch 'main' into 27404

57c6679

github-actions bot added the Stale label Jan 24, 2026

paulojmdias removed the Stale label Jan 24, 2026

iblancasa added 5 commits January 26, 2026 17:54

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

475feaf

…tor-contrib into 27404

Merge branch 'main' into 27404

7703fdf

Merge branch 'main' into 27404

449c591

Merge branch 'main' into 27404

f75df71

Merge branch 'main' into 27404

c5e1bc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize fileset matching with hash-based indexing for performance improvement at high file counts #45203

Optimize fileset matching with hash-based indexing for performance improvement at high file counts #45203

Uh oh!

iblancasa commented Dec 31, 2025

Uh oh!

atoulme commented Jan 1, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Optimize fileset matching with hash-based indexing for performance improvement at high file counts #45203

Are you sure you want to change the base?

Optimize fileset matching with hash-based indexing for performance improvement at high file counts #45203

Uh oh!

Conversation

iblancasa commented Dec 31, 2025

Description

Link to tracking issue

Testing

Uh oh!

atoulme commented Jan 1, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants