Skip to content

Resolve concurrency issues that were discovered when upgrading the ws package to the latest version. #7068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kryon-7
Copy link
Contributor

@kryon-7 kryon-7 commented Apr 9, 2025

This PR contains:

  1. update ws package to 18.8.1 problems found during the ws package upgrade that make ci error #7067
  2. resolve concurrency issues

Describe the problem you have without this PR

After update ws to 18.81 version, it will make test case doing insert after subscribe should end with the correct results error.
Through my investigation, this is a concurrency issue that has existed since the beginning. The upgrade of the ws package only changed the message transfer speed, making the concurrency problem more likely to surface. When I manually slowed down the query execution to an extremely low speed, the original version of ws also exhibited the same concurrency issue, causing the client to lose necessary data when processing the changesStream.

The cause of the problem

image First, let me explain the root cause of the concurrency issue. We categorize the data in the server database into three distinct phases: v1, v2, and v3.
  1. During database v1 phase, the client sends a query request to the server.
  2. The server receives the query request and begins fetching data from the now-updated v2 version of the database.
  3. Between the v2 to v3 transition period, the server may send query results to the client (this is possible because server operations are non-atomic - for example, after obtaining v2 data from storage, time-consuming operations like filtering and sorting occur). By the time the client actually receives the query results, the server's database version has already progressed to v3.
  4. Between database versions v1 to v2 and v2 to v3, other modification requests were received and the changesStream was sent.

So by the time the client obtained the query results and began processing the changesStream, it was already too late. The query results we obtained were from version 2, but I started getting the changesStream from the version 3 database.

What I do

The most ideal situation is that we get the changeStream starting from v2. However, the client and the server are two separate entities, so we can only find a checkPoint close to v2 to start merging data.

  1. Starting from v3 would lead to data loss.
  2. It is also possible for the server to return the query results along with the v2 checkPoint. However, this is equivalent to changing the transmission protocol between queries, and concurrency issues would still exist. This is because obtaining the checkPoint and obtaining the query results are two separate operations, not an atomic operation combined together. There could still be edge cases.
  3. Therefore, we can only start getting the changesStream from v1.

When rxQuery mustReExec, it records the changePoint. Let changePoint = _changeEventBuffer.getCounter(), which records the current database version. When the query results are obtained, the changesEvent is obtained starting from the recorded changePoint. Of course, this will cause data redundancy, but data redundancy is much easier to handle than data loss. All we need to do is to implement idempotency checks.

And then the issue has been resolved, and the CI is currently functioning normally. CI Result

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 14, 2025

Hi @pubkey. I've submitted a PR #7075 to reproduce this bug, but the test case requires at least 6 seconds to complete. However, some test scripts have a timeout setting of only 3 seconds. During my testing, if I adjust parameters (like insertion count and query time) to make the test complete within 3 seconds, the bug might not be reproducible.

Could you clarify: Were these 3-second timeout configurations intentionally set this way, or was it perhaps an oversight and they should be 10 seconds?

I'm happy to address any questions about this PR and make modifications.

@pubkey
Copy link
Owner

pubkey commented Apr 16, 2025

For now I "fixed" the issue by making the ws not bundling messages together and confusing the order: #7082

This at least lets us upgrade to the latest ws version without the security issue.
You described problem might still exists and should be investigated/reproduced/fixed.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 16, 2025

For now I "fixed" the issue by making the ws not bundling messages together and confusing the order: #7082

This at least lets us upgrade to the latest ws version without the security issue. You described problem might still exists and should be investigated/reproduced/fixed.

Got it. I wasn't fully aware of the community guidelines before, but I'll make sure to follow them going forward.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 16, 2025

hi @pubkey However, the issue still persists in the latest version and has been successfully reproduced. You should take a look at this. CI

This is based on the code from this PR: #7075 . The issue can be reliably reproduced in MongoDB because it operates slightly slower than local storage due to additional disk I/O and network latency.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 18, 2025

Hi @pubkey , I've set a flag for computing the changes. The idempotency check will only be triggered when recreate queries to prevent data modification loss, while remaining inactive in other scenarios without any system performance impact. The tests have passed successfully, including the merged longquery tests.

Regarding the denoky storage issue that also occasionally occurs on the main branch: since I have limited knowledge about this database, I haven't identified the root cause yet.

Would you mind checking if I've missed anything or if anything needs tweaking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants