Skip to content

fix: protect RepositoryTimesForMetrics from concurrent map access#314

Open
MartinBasti wants to merge 1 commit into
konflux-ci:mainfrom
MartinBasti:fix-race
Open

fix: protect RepositoryTimesForMetrics from concurrent map access#314
MartinBasti wants to merge 1 commit into
konflux-ci:mainfrom
MartinBasti:fix-race

Conversation

@MartinBasti

Copy link
Copy Markdown
Contributor

The global map was read/written/deleted from multiple reconciler goroutines without synchronization, which can cause a fatal panic. Replace the exported map with mutex-protected helper functions.

The global map was read/written/deleted from multiple reconciler
goroutines without synchronization, which can cause a fatal panic.
Replace the exported map with mutex-protected helper functions.

Signed-off-by: Martin Basti <mbasti@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Martin Basti <mbasti@redhat.com>
@MartinBasti MartinBasti requested a review from a team as a code owner May 20, 2026 16:32
@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Protect RepositoryTimesForMetrics from concurrent map access

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Replace exported map with mutex-protected helper functions
• Prevent concurrent map access race condition in metrics
• Add three new thread-safe accessor functions for map operations
• Include comprehensive unit tests for concurrency safety
Diagram
flowchart LR
  A["Exported map<br/>RepositoryTimesForMetrics"] -->|"Replace with"| B["Private map +<br/>sync.Mutex"]
  B -->|"Provide access via"| C["SetRepositoryTimeIfAbsent"]
  B -->|"Provide access via"| D["GetRepositoryTime"]
  B -->|"Provide access via"| E["DeleteRepositoryTime"]
  C -->|"Used by"| F["imagerepository_controller.go"]
  D -->|"Used by"| F
  E -->|"Used by"| F

Loading

File Changes

1. pkg/metrics/metrics.go 🐞 Bug fix +24/-1

Add mutex-protected accessor functions for metrics map

• Changed RepositoryTimesForMetrics from exported to private map
• Added sync.Mutex for thread-safe access synchronization
• Implemented SetRepositoryTimeIfAbsent() function with lock protection
• Implemented GetRepositoryTime() function with lock protection
• Implemented DeleteRepositoryTime() function with lock protection

pkg/metrics/metrics.go


2. internal/controller/imagerepository_controller.go 🐞 Bug fix +6/-9

Replace direct map access with helper functions

• Replaced direct map access with SetRepositoryTimeIfAbsent() call
• Replaced direct map read with GetRepositoryTime() call
• Replaced direct map delete with DeleteRepositoryTime() call
• Updated all four locations where metrics map was accessed

internal/controller/imagerepository_controller.go


3. pkg/metrics/metrics_test.go 🧪 Tests +77/-0

Add comprehensive unit tests for thread safety

• Added clearRepositoryTimes() helper function for test cleanup
• Added TestSetRepositoryTimeIfAbsent() to verify idempotent insertion
• Added TestGetRepositoryTime() to verify retrieval and missing keys
• Added TestDeleteRepositoryTime() to verify deletion behavior
• Added TestRepositoryTimeConcurrency() with 100 goroutines for race detection

pkg/metrics/metrics_test.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented May 20, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Advisory comments

1. Serialized metric reads 🐞 Bug ➹ Performance
Description
GetRepositoryTime uses an exclusive mutex Lock/Unlock, so concurrent reads are serialized and can
add avoidable contention in concurrently-running reconciles. Switching to sync.RWMutex and using
RLock/RUnlock in GetRepositoryTime preserves correctness while allowing parallel reads.
Code

pkg/metrics/metrics.go[R48-53]

Evidence
The metrics helper read path takes an exclusive lock, and the reconciler calls it from reconcile
flows that can run concurrently across resources, so read operations will block each other even
though they don’t mutate state.

pkg/metrics/metrics.go[36-53]
internal/controller/imagerepository_controller.go[208-215]
internal/controller/imagerepository_controller.go[469-477]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`GetRepositoryTime` currently takes an exclusive mutex lock, which serializes all readers and can create unnecessary contention when multiple reconciler goroutines call into metrics concurrently.

### Issue Context
This PR correctly fixes the fatal concurrent map access by introducing a mutex, but the current implementation uses `sync.Mutex` and `Lock()` even for reads.

### Fix Focus Areas
- pkg/metrics/metrics.go[36-59]
 - Change `repositoryTimesForMetricsMu` from `sync.Mutex` to `sync.RWMutex`.
 - Update `GetRepositoryTime` to use `RLock()/RUnlock()`.
 - Keep `SetRepositoryTimeIfAbsent` and `DeleteRepositoryTime` using `Lock()/Unlock()` (write lock).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@MartinBasti

Copy link
Copy Markdown
Contributor Author

More context from AI

Yes, this is a legitimate fix. Here's the analysis:

  The bug is real: RepositoryTimesForMetrics is a plain map[string]time.Time accessed without any synchronization from the Reconcile method.
  In Go, concurrent reads and writes to a map cause a fatal runtime panic (not just a data race — the runtime detects it and crashes the
  process).
  
  The codebase does use concurrent goroutines that hit this map. Even though SetupWithManager doesn't explicitly set MaxConcurrentReconciles,
  the controller-runtime default is 1 worker per controller. However, there are 3 separate controllers registered in cmd/main.go:213-232, and
  the ImageRepositoryReconciler.Reconcile method is called from the controller-runtime's worker goroutine pool. More importantly:

  1. controller-runtime's default of 1 concurrent reconcile only guarantees that one reconcile runs at a time for that specific controller —
  but the map is a package-level global, so if any other code path (or a future bump to >1 worker) touches it, you get a crash.
  2. Even with a single worker, the go func() goroutine in pkg/metrics/metrics.go:64 runs concurrently and could theoretically interact with
  the same package's state.
  3. The Go race detector would flag this regardless of actual concurrency — it's a correctness issue by Go's memory model.

  The fix is clean and correct: wrapping the map in a sync.Mutex with Set/Get/Delete helpers is the standard approach. Using sync.Mutex (not
  sync.RWMutex) is fine here since operations are cheap and contention is low. The test includes a proper concurrency stress test
  (TestRepositoryTimeConcurrency).

@konflux-ci-qe-bot

Copy link
Copy Markdown

Scenario: konflux-e2e-image-controller
@MartinBasti: The following test has Failed, say /retest to rerun failed tests.

PipelineRun Name Status Rerun command Build Log Test Log
konflux-e2e-image-controller-n725n Failed /retest View Pipeline Log View Test Logs

Inspecting Test Artifacts

To inspect your test artifacts, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/konflux-test-storage/konflux-team/image-controller:konflux-e2e-image-controller-n725n

Test results analysis

🚨 Error occurred while running the E2E tests, list of failed Spec(s):

➡️ [failed] [It] Image Controller E2E tests PaC component build when a new component without specified branch is created and with visibility private correctly targets the default branch (that is not named 'main') with PaC [image-controller, github-webhook, pac-build, pipeline, pac-custom-default-branch]

Click to view logs

Unexpected error:
    <*errors.errorString | 0xc0003f36b0>: 
    failed to create test namespace build-e2e-nozh: timeout waiting for service account konflux-integration-runner to be created in namespace build-e2e-nozh with error: context deadline exceeded
    {
        s: "failed to create test namespace build-e2e-nozh: timeout waiting for service account konflux-integration-runner to be created in namespace build-e2e-nozh with error: context deadline exceeded",
    }
occurred

OCI Artifact Browser URL

View in Artifact Browser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants