Skip to content

fix(ui): prevent race conditions after move row by only replacing state for latest change #8961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dpnolte
Copy link

@dpnolte dpnolte commented Oct 31, 2024

What?

Race conditions occur when dragging blocks around with the result the wrong data is stored in the block.

To replicate this issue:

  1. Create a collection with two blocks and auto save on. The blocks should preferably have different field names in groups to make the issue more noticeable. And it seems that the chance of this happening increases when you add a conditional field on a group field (see also the reproducing video & code below)
  2. Create document with two blocks containing different data and make sure it is saved
  3. Set throttling to 3g
  4. Drag the block so that the order is switched. You might need to do this multiple times to catch the race condition.
  5. Drag again quickly so that the blocks are again in original order. It should be dragged fast, other wise the race conditions doesn't occur.
  6. When expanding the block, we've seen the wrong contents in the block: block 1 contains the content of block 2 and vice versa.

To illustrate this, please watch the following video:
https://github.com/user-attachments/assets/671b0488-16ef-4ab0-b52c-289d6986efe9

At the start of the video, each block the properly field values stored.. After the dragging with throttling, the top two blocks contain no values anymore. And the bottom block contains incorrect values; these incorrect values are actually originally from a different block.

Code used in video (just an empty Payload project withFruits collection added):
https://github.com/dpnolte/payload-blocks-drag-issue

Why?

This can have serious impact by loosing the canges you've made. We've seen this happening a couple of times in our setup. This happens more often if pages contain more blocks since many blocks will be moved in a longer list and when a block contains additional async fetches to get more data. So it would be really nice to have this fixed.

How?

The underlying cause seems to be that the execute on change processes any handlers as promises. And when the on change handlers are processed, the state is updated again with REPLACE_STATE to incorporate any form data from the server (e.g., validation). Since the on change handlers are async, other form state change can be processed which updates the form fields in the mean time. It seems that this can happen at least by dispatching multiple MOVE_ROW in short timespan. The consqeuence is that when changing the order of the blocks with MOVE_ROW in the race condition time window, the replace state will overwrite the contents of the blocks based on the incorrect positioning of the blocks .

To fix this, this PR checks if the form state has not changed after the async change handlers are processed. If the form state is still the same, it will continue as it does now and dispatch a REPLACE_STATE. If not, it will early exit and no REPLACE_STATE will be dispatched to prevent setting a new state that is derived from an outdated source. Since the form fields have been changed, the onchange event will be triggered again so it will dispatch REPLACE_STATE to merge server form state into it.

@dpnolte dpnolte force-pushed the fix/prevent-race-conditions-after-move-row branch 3 times, most recently from 5e4ce72 to 015c17f Compare October 31, 2024 10:21
@dpnolte dpnolte force-pushed the fix/prevent-race-conditions-after-move-row branch 3 times, most recently from 96c1675 to a562670 Compare November 2, 2024 13:43
@dpnolte
Copy link
Author

dpnolte commented Nov 2, 2024

As mentioned in the PR description, i've set up minimal reproducible example at https://github.com/dpnolte/payload-blocks-drag-issue/tree/main .
On the main branch, I can break it while dragging a couple of times as shown in the video.
I've also added now a patch branch. This branch includes a patch with the fix in this PR. Using this branch, I can no longer reproduce the issue. I've also added some console logs to get an idea what the impact of the fix is:
image

@paulpopus
Copy link
Contributor

I'm struggling to replicate your issue on the latest beta. I've followed the steps and the exact configuration and I can't see any data loss when moving items around. I tried variations of this and made sure to return the items to the original position but it all looks fine

@dpnolte
Copy link
Author

dpnolte commented Nov 6, 2024

I'm struggling to replicate your issue on the latest beta. I've followed the steps and the exact configuration and I can't see any data loss when moving items around. I tried variations of this and made sure to return the items to the original position but it all looks fine

Thanks for looking into it. The video that I recorded was on the latest beta, see also the code here: https://github.com/dpnolte/payload-blocks-drag-issue/blob/main/package.json. The issue becomes more prevalent when using group fields and conditional field on the groups (as the fruits collection in the repro). But you said that you incorporated the same config right?

Perhaps it has more to do with quickly dragging the blocks around and the right throttlign. That is, it helps to expand all blocks first, drag the blocks around multiple times (like 20x times) and quickly with throttling to 3g to replicate the issue.

This is a bug in the category that most of the time it will work fine. But our users already experienced this a couple of times. And it has then serious impact because it loses their changes with autosave. If you want, I can show it to you in a video call?

@dpnolte
Copy link
Author

dpnolte commented Nov 6, 2024

I've just recorded a new video after a fresh pnpm install with Payload's version set to "beta" to illustrate that the issue still persists:

Screen.Recording.2024-11-06.at.07.15.06.mp4

This video also starts by creating the document to show which data I've used. After dragging a lot and expanding/collapsing, you'll see that the blocks contain the wrong values. The Pear block has no values at all anymore and the apple block contains the wrong values.
PNPM version:
image

@paulpopus
Copy link
Contributor

Thanks for the follow up video, managed to reproduce! Will check the PR and see if it's the root cause

@dpnolte
Copy link
Author

dpnolte commented Nov 6, 2024 via email

@paulpopus
Copy link
Contributor

I can confirm the PR works, though I'll put it on ice for a few days until we the upcoming RSC PR Is merged, since it will touch many files including this one, it may also fix the problem in another way. I'll revisit this at that point, thank you for your patience

@dpnolte
Copy link
Author

dpnolte commented Nov 8, 2024

All right, I can imagine you want to let the assumingly big PR land first :) Could you please link the PR for RSC into this conversation so that it can be traced?
It should be easy to check if this fix is still needed. The race condition is still there if async onChange handlers are followed by the REPLACE_STATE dispatch in the client component.

@paulpopus
Copy link
Contributor

All right, I can imagine you want to let the assumingly big PR land first :) Could you please link the PR for RSC into this conversation so that it can be traced? It should be easy to check if this fix is still needed. The race condition is still there if async onChange handlers are followed by the REPLACE_STATE dispatch in the client component.

This is the PR #8364 , we're expecting to merge it very soon so one of the other reasons we've put some PRs on hold is to avoid merge conflicts

@dpnolte
Copy link
Author

dpnolte commented Dec 9, 2024

@paulpopus the PR has ben merged for a while now. But the issue still persist since there respective code has not changed. Could we please get this PR merged in the so that we can get this issue resolved?

@paulpopus
Copy link
Contributor

@dpnolte Following the same video I haven't been able to reproduce this on the latest of v3. The data does eventually show as expected, it's just a bit slow in doing that versus what it did before with never showing up

@dpnolte
Copy link
Author

dpnolte commented Dec 11, 2024

@paulpopus I've just reproduced it again with payload 3.5.0. Please see the video below. There is no code change so I also don't expected it to fixed?

Screen.Recording.2024-12-11.at.21.34.08.mov

@dpnolte
Copy link
Author

dpnolte commented Dec 11, 2024

Just also reproduced on payload 3.6.0. The code has not changed. I really hope we can get this merged now please.

Screen.Recording.2024-12-11.at.21.52.29.mov

@paulpopus
Copy link
Contributor

Thank you for checking again. We're looking into this ASAP

@paulpopus paulpopus added status: verified If an issue has been reproduced prioritized labels Dec 13, 2024
@jacobsfletch
Copy link
Member

We're going to need some tests written for this to prevent future regression, @dpnolte would be able to do this?

@dpnolte
Copy link
Author

dpnolte commented Dec 19, 2024

I could have a go at it. Hopefully, I'll find the time soon.

I think it will be tricky to create the right preconditions to test the race condition properly. Namely, at the edit page, the Edit's view abort controller in the on change handler often prevents the race condition by aborting the previous form state call (see relevant code).
However, the race condition can still occur because useDebouncedEffect hook is used to process the on change event handlers (as seen here). This hook uses a setTimeout which makes it being run in the next event loop tick and run after a certain period of time (see relevant code line). This leaves the time window open for the race condition to occur. That is, the fields could have already changed while a previous onChange handler function is still being processed and this ultimately overrides it with the wrong state.

It could therefore be better to not test this with a default edit page, but with a custom view that uses the Form component with a custom onChange handler as property. It would circumvent the race condition prevention that the Edit's abort controller provides. Does that make sense? Any recommendations perhaps to setup such a custom view (perhaps there is a similar test already that I can refer to)? Thanks!

@jacobsfletch
Copy link
Member

@dpnolte I'm unable to reproduce this no matter how hard I try. I have your repo pulled down and running locally. Still no luck. I've been playing with this for quite some time, following your instructions as exactly as possible. I even bumped down to v3.5.0 to ensure that #9892 wasn't the fix for this. Here's a screen recording from my end:

Untitled.mp4

Regardless, thank you for your diligence on this issue so far. I really appreciate all the detail you've provided. It's very helpful. Hopefully we can achieve a consistent reproduction here and get this one squared away.

@leon-marzahn
Copy link

leon-marzahn commented Dec 25, 2024

I have a decently complex page in payload and on the latest version i get this literally all the time
image

It never loads, it just stuck like this until i refresh the page. It seems to only happen when i add a new block rather than move them.

The difference for me is, i use quite a lot of drafts, so maybe reproducing it with drafts would be easier?

@jacobsfletch
Copy link
Member

@leon-marzahn your issue sounds more like #10070, which is that your blocks are stuck in an infinite loading state when adding new rows. This is fixed with #10175. This particular issue, while similar, describes how rearranging blocks can randomly misplace the block state.

Can anyone confirm that this issue still exists? I'm so far unable to recreate this, see my comment here.

@dpnolte
Copy link
Author

dpnolte commented Dec 28, 2024

Thanks @jacobsfletch for delving into it and a belated merry christmas!

I understand this is hard to reproduce. And admittedly, it now way harder to reproduce than the beta version we were on previously. So things are definitely improving 🚀

I've made quite some effort now to showcase the issue. I hope you appreciate it and we can finally land this.

First, I would like to point out. That I could still reproduce this issue even with the latest Payload's 3.11.0 version. I could only reproduce it though manually if I drag the blocks when they are expanded:

Screen.Recording.2024-12-25.at.10.59.16.mov

Another thing what would have helped if every block type was different in the example repo. But even then, the reproducing steps are difficult because the race condition is caused by out of order form state processing that is caused by dragging blocks in the time window that the race condition can occur. This is the time window that the onChange handler is not yet processed and the next MOVE_ROW is already processed. This time window is now even smaller since the onChange handler is now debounced at 250ms and earlier it was 150ms. The time window has thus gotten smaller, but it is still there. Because of this small time window and there were only 2 block types instead of 3, it takes some time to reproduce it with manual dragging. It took me 30+ minutes or so.

Small diagram to illustrate what I'm trying to save above:
image

EDIT: please see comment below for a smaller test case as reproducible that is part of this PR.

To be able to reproduce the issue everytime and to showcase the issue, I've created an automated e2e test in my payload fork that recreates the right pre-conditions for this race condition to occur. Please checkout branch fix/drag-issue:
https://github.com/dpnolte/payload/tree/fix/drag-issue

This test drags the blocks around randomly and with random time wait time in between. The following video demonstrates the test with first fix applied and then without fix:

Screen.Recording.2024-12-28.at.14.34.22.mov

The relevant e2e test can be found here:
https://github.com/dpnolte/payload/blob/fix/drag-issue/test/fields-drag-blocks/e2e.spec.ts

You can run the test yourself by checking out that branch. This branch is rebased on latest of main (at the moment of writing) and includes the fix that is included with this PR.

To run the test yourself, please follow the following steps:

I've run this test countless times. In my results, the test fails every time as some of the form fields have missing values. If you run this test in headed mode, be careful not to move the mouse inside browser as that affects the test.

To review that the fix solves this issue, uncomment the lines 668-670 in @payloadcms/ui's Form component.
If you run it, the test should pass. I've also run this test with fix countless times. And when the test passes, it means that the issue didn't occur anymore even after dragging 100 times the blocks around. The test also verifies whether the data is correctly stored since it verifies the block field contain the latest values after awaiting an autosave and reloading the page.

All in all, with your improvements of Payload, the probability of this race condition happening is getting lower and lower. But it can still happen and the likelihood goes up if you have a page with many blocks. It also increases if you deploy it on vercel/serverless since the get form state request could experience a cold start that delays the on change rquest handling. And, unfortunately, if it happens, the impact is high because the data can become lost and the wrong form state can be stored with an autosave. This has happened to some of our users :(

Please let me know if anything is unclear. As a next step, if you want, I could also add the test to this PR but then simplified so that it takes less time to run.

@dpnolte dpnolte force-pushed the fix/prevent-race-conditions-after-move-row branch from a562670 to 069c086 Compare December 30, 2024 08:38
@dpnolte dpnolte requested a review from denolfe as a code owner December 30, 2024 08:38
@dpnolte
Copy link
Author

dpnolte commented Dec 30, 2024

@jacobsfletch I've tried to find the smallest reproducible test case for you. I've refined the test quite a bit and have managed to get a reproducible test case without 3g network emulation and only 3 drags. I've added it to this PR as an e2e test, called "should be able to drag blocks during get form state without race condition". You can find it in the fixup commit. It is the smallest test case that I could find to reproduce the issue. The key part of the test is that it perform a drag at the moment the form state is requested.

To reproduce the issue, please follow the following steps:

  • Checkout my PR branch
  • Disable the fix by commenting out lines the lines 668-670 in @payloadcms/ui's Form component.
  • Run the dev server pnpm dev drag-and-drop-blocks
  • Run the specific e2e test. I did this by running npx playwright test --headed -g "should be able to drag blocks during get form state without race condition" in the tests folder. It should fail. If you undo the comment, the test should pass.

Just in case I've also recorded it in a video:

Screen.Recording.2024-12-30.at.09.28.35.mov

Note: I've also added an e2e test case for dragging polymorphic blocks around since I couldn't find such a test yet. And I thought it would be a nice addition?

@dpnolte
Copy link
Author

dpnolte commented Jan 10, 2025

@paulpopus @jacobsfletch could you please have a look? With this merged PR, the probability of this happening went up again since it now more often to be set to be changed.

@dpnolte dpnolte force-pushed the fix/prevent-race-conditions-after-move-row branch from 069c086 to 38e982f Compare March 1, 2025 13:32
@dpnolte
Copy link
Author

dpnolte commented Mar 1, 2025

I've just checked again by running the e2e test without the if (contextRef.current.fields !== stateAtStartOfChange) check. And the issue still persist after rebasing on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prioritized status: verified If an issue has been reproduced
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants