Skip to content

fix(informer): use ReadWriteLock in CacheImpl to prevent index read inconsistency#7558

Merged
manusa merged 7 commits intofabric8io:mainfrom
Desel72:fix/issue-#7265
Apr 2, 2026
Merged

fix(informer): use ReadWriteLock in CacheImpl to prevent index read inconsistency#7558
manusa merged 7 commits intofabric8io:mainfrom
Desel72:fix/issue-#7265

Conversation

@Desel72
Copy link
Copy Markdown
Contributor

@Desel72 Desel72 commented Mar 12, 2026

Description

Adds a disabled concurrency test that reproduces the race condition described in #7265.

CacheImpl.updateIndices() performs a two-step operation when updating an object's index entry: first removing the old entry, then adding the new one. While write methods (put(), remove()) are synchronized, read methods (byIndex(), indexKeys()) are not, allowing concurrent readers to observe the intermediate state where an item has been removed but not yet re-added.

This test verifies that index reads never observe partially-updated state during concurrent writes. It is @Disabled until a follow-up PR provides the fix.

Changes

  • CacheImplConcurrencyTest.java: New concurrent test that reproduces the race condition (4 writer + 8 reader threads)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change
  • Chore (non-breaking change which doesn't affect codebase;
    test, version modification, documentation, etc.)

Checklist

  • Code contributed by me aligns with current project license: Apache 2.0
  • I Added CHANGELOG entry regarding this change
  • I have implemented unit tests to cover my changes
  • I have added/updated the javadocs and other documentation accordingly
  • No new bugs, code smells, etc. in SonarCloud report
  • I tested my code in Kubernetes
  • I tested my code in OpenShift

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 12, 2026

@manusa Could you review this PR. I've added the test.

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 13, 2026

Hi @manusa Could you help me so that I can test? I can't find what the issue is. Thanks

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 14, 2026

HI @manusa Could you review this PR please?

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 15, 2026

Hi @manusa I've updated. Could you please review this?

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 17, 2026

Hi @manusa @ash-thakur-rh @shawkins
Could you please review this PR?

knative.dev/eventing-couchdb v0.28.0
knative.dev/eventing-github v0.46.3
knative.dev/eventing-gitlab v0.46.3
knative.dev/eventing-gitlab v0.48.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Desel72 Please revert the changes from go mod file. These changes are out of scope for this fix.

<jackson.bundle.version.annotations>2.21</jackson.bundle.version.annotations>
<jetty.version>11.0.26</jetty.version>
<maven-core.version>3.9.13</maven-core.version>
<maven-core.version>3.9.14</maven-core.version>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here too, revert the changes for dependency updates. These are also out of scope for this fix. Any specific reason for which you have done these changes?


Mockito.doAnswer(invocation -> {
assertTrue(Thread.holdsLock(podCache.getLockObject()));
assertTrue(((java.util.concurrent.locks.ReentrantReadWriteLock) podCache.getLock()).isWriteLockedByCurrentThread());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor request: use specific import instead of FQN import.

Copy link
Copy Markdown
Contributor Author

@Desel72 Desel72 Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ash-thakur-rh Thanks for your feedback. I will check soon.

Desel72 and others added 2 commits March 17, 2026 23:57
@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 17, 2026

Hi @ash-thakur-rh @manusa I've solved. Can you please review this?

Copy link
Copy Markdown
Member

@ash-thakur-rh ash-thakur-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Just a change request in test. Also add an entry for the fix in CHANGELOG file.

private static final String LABEL_INDEX = "label-index";

@Test
void byIndexShouldNeverMissObjectDuringConcurrentUpdates() throws InterruptedException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two tests are nearly identical, the tests can be parameterized tests or can share a helper method.

Copy link
Copy Markdown
Contributor Author

@Desel72 Desel72 Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ash-thakur-rh Thanks for your feedback. I've done. Can you please review this?

Desel72 and others added 2 commits March 18, 2026 16:32
- Replaced two nearly-identical test methods with a single parameterized
  test using @valuesource and a shared helper method
- Added CHANGELOG entry for issue fabric8io#7265

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@ash-thakur-rh ash-thakur-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 19, 2026

Thanks @ash-thakur-rh

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 19, 2026

@manusa @shawkins Could you please review this?

@shawkins
Copy link
Copy Markdown
Contributor

The changes look good if we're ok with introducing more locking.

It was an intentional choice to initially keep the implementation as lock free as possible. If we want to remain as lock free as possible, we should use something like #7575 - the read of an index will still not be fully consistent but there won't be the problem described in #7265

The test is a slippery slope the number of iterations here are small, but we'll generally want to avoid trying to test concurrency this way in unit tests.

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 20, 2026

The changes look good if we're ok with introducing more locking.

It was an intentional choice to initially keep the implementation as lock free as possible. If we want to remain as lock free as possible, we should use something like #7575 - the read of an index will still not be fully consistent but there won't be the problem described in #7265

The test is a slippery slope the number of iterations here are small, but we'll generally want to avoid trying to test concurrency this way in unit tests.

@shawkins Do you mean this should be updated? If then, can you please let me know what I should do? I will never give up and solve it.

@shawkins
Copy link
Copy Markdown
Contributor

@shawkins Do you mean this should be updated? If then, can you please let me know what I should do? I will never give up and solve it.

We just need to have concensus on what direction to go. @manusa @csviri @metacosm @ash-thakur-rh do you want to keep the minimally locking behavior via the draft shown #7575

Or go with fully consistent index reads via a read/write lock?

I vaguely remember some user complaints the old fully locking behavior, and don't know if that would have been satisfied with a read/write lock (obviously that should depend on how frequent the events are).

@csviri
Copy link
Copy Markdown
Contributor

csviri commented Mar 20, 2026

Not sure how feasible is to measure the performance degradation. But without having there consistency in those index made it quite hard to reason about, so basically rarely missing resources from index pretty much made it unusable for josdk internal purposes; therefore we ended up having our own index.

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 20, 2026

@shawkins @csviri @manusa
I think this PR can be merged. Thanks for your effort.

@shawkins
Copy link
Copy Markdown
Contributor

shawkins commented Mar 20, 2026

Not sure how feasible is to measure the performance degradation.

Adapting the test included here a little (jdk 25, 8000000 iterations, 8 readers, 1 writer - because that matches our usage - writing indefinitely with 2 ms between writes):

this pr - ~ 6 - 8.5 seconds per test

Under this scenario a readwrite lock seems to perform worse than full synchronization - that probably relates to the cost of the indexing function and how many values it returns - this example is simple so the index function cost is minimal. Since we have no control over what users may be doing with those functions, if we want full consistency it's best to stick with a read/write lock.

#7575 - ~ .4 - 1.1 seconds per test

But without having there consistency in those index made it quite hard to reason about, so basically rarely missing resources from index pretty much made it unusable for josdk internal purposes; therefore we ended up having our own index.

Sorry I didn't pay attention to this before.

To double check, are you good to just address the ephemeral removal, or do you want fully consistent reads from the indexes?

edit: to elaborate the current state of #7575:

  • there can be a delta (depending on the cost of indexing functions) between when items that need to be newly added to indexes are seen in the indexes vs when the item is put into the ItemStore. This was the intent of the original changes for Reduce CacheImpl lock contention #5973 but it was not implemented correctly.
  • If an item has already been indexed we double check before returning when the resource version does not match with the current cache item that it belongs in the index - this prevents stale reads wrt to the current state of the cache.

@csviri I believe this level of consistency is good enough for your usage - events aren't emitted until after the CacheImpl put call completes, at which point the indexes are up-to-date. If that is correct, then we should proceed with #7575.

If not, then this PR is good.

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 23, 2026

Hi @csviri I think this PR can be merged. Could you please review this?

@csviri
Copy link
Copy Markdown
Contributor

csviri commented Mar 24, 2026

@shawkins @Desel72 sorry, having bit busy days, will take a look tomorrow.

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 25, 2026

Hi @csviri, how are you doing? Are you busy now?

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 27, 2026

Hi @csviri how are you? Are you busy now?

Copy link
Copy Markdown
Contributor

@csviri csviri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but @shawkins @manusa should make the final decision regarding which version to proceed with.

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 30, 2026

Hi @shawkins @manusa @ash-thakur-rh I think this can be merged. Please check this. thanks @csviri

@shawkins
Copy link
Copy Markdown
Contributor

@Desel72 see #7575 (comment) I believe that @csviri is okay with the concurrency described there, so I will refine that PR to close #7265

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 30, 2026

@shawkins Do you mean my PR will be closed?

@shawkins
Copy link
Copy Markdown
Contributor

@shawkins Do you mean my PR will be closed?

How about I squash the changes from other pr, then you cherry-pick that into this one and add whatever other tests seem appropriate?

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 31, 2026

Thanks @shawkins. I totally agree with you. I think this PR should be merged perfectly. I appreciate this.

@manusa
Copy link
Copy Markdown
Member

manusa commented Mar 31, 2026

Thanks for working on this @Desel72, the concurrency test you've put together is really valuable — it clearly reproduces the race condition from #7265.

However, after reviewing both this PR and the alternative approach in #7575, and considering the performance implications (@shawkins' benchmarks show a 6-10x regression with ReadWriteLock), I think we should go with the approach in #7575.

That said, the CacheImplConcurrencyTest you wrote is exactly what we need to validate the fix. Could you reduce this PR to just the reproducer test (removing the ReadWriteLock changes)? The test should be disabled (e.g., @Disabled("https://github.com/fabric8io/kubernetes-client/issues/7265")) since it will fail until #7575 lands the actual fix.

To summarize:

This way your contribution is preserved and provides the foundation for validating the actual fix. Thanks for your patience and persistence on this!

@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Mar 31, 2026

Hi @manusa I got it. Thanks

Revert ReadWriteLock changes to CacheImpl, ProcessorStore, and
ProcessorStoreTest. Remove CHANGELOG entry. Keep CacheImplConcurrencyTest
as a disabled reproducer for fabric8io#7265, to be enabled once fabric8io#7575 lands.
@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Apr 2, 2026

Hi @manusa @shawkins How are you? I've done. Is this right approach? Please check and let me know. Thanks for your feedback.

Copy link
Copy Markdown
Member

@manusa manusa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting this out @Desel72, the reproducer test looks good overall.

One thing to address: doneLatch.await(30, TimeUnit.SECONDS) and executor.awaitTermination(5, TimeUnit.SECONDS) return values are not checked. If threads hang or deadlock, the test silently passes because missDetected defaults to false. This matters especially when the follow-up fix enables the test — a deadlock would go unnoticed.

Please assert completion, e.g.:

assertThat(doneLatch.await(30, TimeUnit.SECONDS))
    .as("All threads should complete within timeout")
    .isTrue();

Same for awaitTermination.

Check return values of doneLatch.await() and executor.awaitTermination()
so that a deadlock or hung thread fails the test instead of silently
passing.
@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Apr 2, 2026

Is this okay?

Copy link
Copy Markdown
Member

@manusa manusa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx!

@manusa manusa added this to the 7.7.0 milestone Apr 2, 2026 — with automated-tasks
@manusa manusa merged commit 19d49dc into fabric8io:main Apr 2, 2026
18 of 19 checks passed
@Desel72
Copy link
Copy Markdown
Contributor Author

Desel72 commented Apr 2, 2026

Perfect!!! @manusa Is there any other issue more? If then, please let me know what should I do more? I really want to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants