Skip to content

RTC: Fix race condition on room creation which can cause a split update log#77675

Open
danluu wants to merge 3 commits intoWordPress:trunkfrom
danluu:danluu/fix-rtc-room-split
Open

RTC: Fix race condition on room creation which can cause a split update log#77675
danluu wants to merge 3 commits intoWordPress:trunkfrom
danluu:danluu/fix-rtc-room-split

Conversation

@danluu
Copy link
Copy Markdown

@danluu danluu commented Apr 25, 2026

What?

This is part of a series of bug reports and PRs from an AI fuzzing project. See #77532 for more details on that. I'm filing a few more of these after a discussion with @alecgeatches on how the fuzzer found a bug that I hadn't file that ended up taking some developer time to chase down and repro.

first-room-storage-live-repro.mp4
first-room-storage-document-edit-repro.mp4

BEGIN AI GENERATED TEXT

The post-meta sync storage backend creates one wp_sync_storage post per sync room and uses md5( $room ) as the post slug. On first access to a new room, two concurrent writers can both miss the existing-storage lookup and both call wp_insert_post() with the same slug. WordPress keeps slugs unique, so one request receives the canonical slug and the other receives a suffixed slug such as <hash>-2.

There is one trunk/product bug covered by this fix:

First-writer split storage bug. The original storage creation path trusted the inserted post ID without checking whether WordPress kept the exact room-hash slug. If a request received a suffixed post, it could acknowledge awareness or sync updates written to the suffixed lineage. A fresh storage instance later looks up only the canonical room hash, so those acknowledged first-access writes are unreachable. Source: get_storage_post_id() creates and caches the storage post. Repro/test: test_first_access_race_does_not_split_room_storage().

Branch Layout

The PR branch danluu/fix-rtc-room-split intentionally contains only the PHPUnit regression tests and the storage fix. This explanation is kept on the separate danluu/rtc-room-split-explanation branch so it does not appear in the PR diff.

There is no top-level Playwright/e2e test in the PR branch.

Reproduction Layers

PHPUnit storage repro

phpunit/tests/collaboration/wpSyncPostMetaStorage.php::test_first_access_race_does_not_split_room_storage

This test installs a wp_insert_post_data filter that injects a competing wp_insert_post() exactly between the storage layer's miss and its insert. It then asserts:

  • only one published wp_sync_storage lineage exists for the room hash;
  • the surviving lineage uses the exact room hash slug;
  • a fresh WP_Sync_Post_Meta_Storage instance can read the acknowledged update.

On the vulnerable implementation, the test fails because WordPress creates both <hash> and <hash>-2, and the update can be written to the suffixed post.

Browser/video repro

The local browser/video harness uses editor-visible presence to expose the split. It creates a draft post, enables a test-only race injector for the post's sync room, opens the post as admin, and delays the initial empty sync polls until the admin's presence payload contains visible collaborator metadata. The first meaningful admin presence poll then travels through the production /wp-sync/v1/updates endpoint and triggers the storage race.

After that acknowledged first write, the test pauses admin polling so the normal retry loop cannot immediately heal the symptom. A collaborator opens the same post and polls the canonical room. The test then holds the collaborator's next poll and releases the admin poll. On the vulnerable implementation, the admin screen learns that the collaborator joined, but the collaborator screen still has no Collaborators list button, because the admin's acknowledged first presence write was written to the suffixed storage lineage.

The deterministic browser harness still drives the production editor and sync REST path. The artificial parts are the scheduler that forces the otherwise timing-dependent first-writer race and the short admin-poll pause that keeps the user-visible presence loss observable long enough for the assertion and video. The diagnostic REST endpoint verifies the supporting storage invariant. The browser/video harness is not part of the PR branch.

Annotated videos for this bug:

  • /Users/danluu/conductor/workspaces/gutenberg-v1/providence/.context/rtc-issue-08/first-room-storage-live-repro.mp4
  • /Users/danluu/conductor/workspaces/gutenberg-v1/providence/.context/rtc-issue-08/first-room-storage-document-edit-repro.mp4

Observed vulnerable-branch result:

Expected lineage_count: 1
Received lineage_count: 2
Admin screen shows collaborator presence
Collaborator screen has no Collaborators list button

Real-vs-False-Positive Analysis

Oracle validity

The oracle is valid. A sync room must have one reachable storage lineage because fresh readers find storage by the canonical room hash. If an acknowledged write lands in a suffixed lineage, it is not reachable through normal get_storage_post_id() lookup in a new request.

Generated shape validity

The generated room shape is valid. The low-level repro uses postType/post:<id>:first-access-race, which matches the production REST room pattern. The browser repro uses the normal editor room postType/post:<id>.

Helper misuse

The failing path does not rely on private test-only storage calls for the browser repro. It opens normal editor sessions and lets the production sync endpoint call WP_Sync_Post_Meta_Storage. The PHPUnit repro uses the storage class directly to minimize the race window, but the same get-or-insert code is used by production REST requests.

Environment contamination

The tests reset the static room cache before fresh-reader checks. The room hash includes the test post ID and a unique suffix, and lineage counting filters by post type, publish status, exact room-hash slug, and suffixed room-hash slugs. The failure is not caused by stale cache state or older storage rows.

Race-injection-only behavior

The injection forces an interleaving that can occur with two concurrent HTTP requests first accessing a new room: both miss lookup, both insert, and WordPress uniquifies one slug. The injected writer uses wp_insert_post() with the same post type, status, title, and slug as production storage creation.

Known-fixed bug masking

The issue is independent of previously fixed cursor and compaction races. Those bugs affect update retrieval and deletion after storage already exists. This bug happens before the room has canonical storage and can lose the very first awareness or sync update into a suffixed lineage.

Fix Plan

The storage layer must treat the exact room-hash slug as the only canonical storage lineage:

  • After inserting a storage post, resolve the canonical exact-slug post before returning or caching a post ID.
  • If the inserted post is the only candidate and no exact slug exists, promote that post to the exact room-hash slug before using it.
  • If a suffixed duplicate is observed while resolving the race, merge its metadata into the exact-slug post.
  • Do not delete a suffixed storage post if moving its postmeta rows fails.
  • Cache only the resolved canonical post ID.
  • Cover the first-access race in PHPUnit.

The PR branch intentionally does not include this explanation, the browser-only diagnostic mu-plugin, or an e2e repro harness.

END AI GENERATED TEXT

Use of AI Tools

The code here is all AI generated. As noted above, the bug finding came from an AI fuzzing project.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: danluu <danluu@git.wordpress.org>
Co-authored-by: dmsnell <dmsnell@git.wordpress.org>
Co-authored-by: peterwilsoncc <peterwilsoncc@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions github-actions Bot added the First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository label Apr 25, 2026
@github-actions
Copy link
Copy Markdown

👋 Thanks for your first Pull Request and for helping build the future of Gutenberg and WordPress, @danluu! In case you missed it, we'd love to have you join us in our Slack community.

If you want to learn more about WordPress development in general, check out the Core Handbook full of helpful information.

@danluu danluu changed the title Fix sync storage first room race RTC: Fix race condition on room creation which can cause a split update log Apr 25, 2026
@danluu danluu force-pushed the danluu/fix-rtc-room-split branch from 6a3168f to a02b370 Compare April 25, 2026 23:32
@t-hamano t-hamano added [Type] Bug An existing feature does not function as intended [Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration labels Apr 27, 2026
return null;
}

clean_post_cache( $post_id );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t followed the post-cache issues too closely, but I wonder if this is going to be problematic as well.

I would expect that promoting a non-canonical storage meta will occur rarely (vs. multiple times per second as was occurring in trunk for edits), but I would like some double-checking on this to make sure it would not cause thrashing on the site with other posts.

@peterwilsoncc this PR might seem a bit random, but are you able to speak to the use of clean_post_cache() here? the situation is that two sessions opened a post and due to a race condition, created two separate wp_sync_storage rows. this code is resolving that race.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can use wp_update_post()

self::POST_TYPE,
$room_hash,
$wpdb->esc_like( $room_hash . '-' ) . '%'
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern may be minor here, but I wanted to ask if this OR clause is hiding a needless table scan. I asked an LLM to analyze it and it suggested that a UNION instead could introduce two index-scans to replace a single table scan.

Here is MariaDB’s EXPLAIN for the two versions:

MariaDB [wordpress]> EXPLAIN SELECT ID FROM wp_posts WHERE post_type = "wp_sync_storage" AND post_status = "publish" AND ( post_name = "abc123" OR post_name LIKE "abc123-" ) ORDER BY ID ASC;
+------+-------------+----------+-------+-----------------------------------------------+---------+---------+------+------+-------------+
| id   | select_type | table    | type  | possible_keys                                 | key     | key_len | ref  | rows | Extra       |
+------+-------------+----------+-------+-----------------------------------------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | wp_posts | index | post_name,type_status_date,type_status_author | PRIMARY | 8       | NULL | 35   | Using where |
+------+-------------+----------+-------+-----------------------------------------------+---------+---------+------+------+-------------+
1 row in set (0.003 sec)

MariaDB [wordpress]> EXPLAIN SELECT ID FROM wp_posts WHERE post_type = "wp_sync_storage" AND post_status = "publish" AND post_name = "abc123" UNION SELECT ID FROM wp_posts WHERE post_type = "wp_sync_storage" AND post_status = "publish" AND post_name LIKE "abc123-" ORDER BY ID ASC;
+------+--------------+------------+-------+-----------------------------------------------+-----------+---------+-------+------+----------------+
| id   | select_type  | table      | type  | possible_keys                                 | key       | key_len | ref   | rows | Extra          |
+------+--------------+------------+-------+-----------------------------------------------+-----------+---------+-------+------+----------------+
|    1 | PRIMARY      | wp_posts   | ref   | post_name,type_status_date,type_status_author | post_name | 766     | const | 1    | Using where    |
|    2 | UNION        | wp_posts   | range | post_name,type_status_date,type_status_author | post_name | 766     | NULL  | 1    | Using where    |
| NULL | UNION RESULT | <union1,2> | ALL   | NULL                                          | NULL      | NULL    | NULL  | NULL | Using filesort |
+------+--------------+------------+-------+-----------------------------------------------+-----------+---------+-------+------+----------------+
3 rows in set (0.002 sec)

Granted, for my little test site with two posts both are going to be roughly equivalent, but if we hit a site with thousands of posts, I wonder if this could introduce noticeable DB overhead.

Not being sure how to interpret the Using filesort I also asked about eliminating that by sorting in PHP and there was no clear guidance other than to say that when the expected set is small (which it is here) there’s no practical difference.

@peterwilsoncc
Copy link
Copy Markdown
Contributor

@danluu Are you able to provide more information about how the race condition is being forced, so I can do so locally.

The reason I am asking is that I'd like to know how real world this is, the additional work being done will slow down requests which would make any race condition more likely to occur.

@dmsnell The clear_post_cache call is correct as so get_posts() queries up-to-date data on follow up calls.

Copy link
Copy Markdown
Contributor

@peterwilsoncc peterwilsoncc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some first view notes inline but my main question is in the comment above re: reproduction steps.

I know from conversations with Matt that he is pretty keen on avoiding the post/post_meta approach so I'd really prefer to focus our energies there rather than on code that's unlikely to make a WordPress release.

As the updates are added to a row identified by the room name, the race condition isn't able to occur with the improved storage structure.

Aside: while post_meta has been the prompt for Core-64696 (don't read all of it), using posts for rooms is an abuse of both the posts and posts_meta tables in my view.

Comment on lines +364 to +370
$move_result = $wpdb->update(
$wpdb->postmeta,
array( 'post_id' => $canonical_post_id ),
array( 'post_id' => $duplicate_id ),
array( '%d' ),
array( '%d' )
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may result in two copies of awareness post_meta stored against a single post ID.

Comment on lines +396 to +402
$post_id = $wpdb->get_var(
$wpdb->prepare(
"SELECT ID FROM {$wpdb->posts} WHERE post_type = %s AND post_status = 'publish' AND post_name = %s ORDER BY ID ASC LIMIT 1",
self::POST_TYPE,
$room_hash
)
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can use the get_posts API.

$posts = get_posts(
				array(
					'post_type'      => self::POST_TYPE,
					'posts_per_page' => 1,
					'post_status'    => 'publish',
					'name'           => $room_hash,
					'fields'         => 'ids',
					'orderby'        => 'ID',
					'order'          => 'ASC',
				)
			);

return null;
}

clean_post_cache( $post_id );
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can use wp_update_post()

@danluu
Copy link
Copy Markdown
Author

danluu commented Apr 28, 2026

@peterwilsoncc Thanks for taking a look! Specifically on the question of the repro, the checked in e2e test forced a race by creating a wp_insert_post_data filter that does a wp_insert_post (with a check to prevent infinite recursion).

For a less artificial (but non-deterministic) repro, this script attempts to demonstrate the race: https://github.com/danluu/gutenberg/blob/danluu/rtc-room-split-explanation/docs/explanations/fuzzer-bugs/rtc-issue-08-normal-rest-repro.sh. The script creates concurrent draft posts across users and then checks if there was a room split. It works on my machine fairly consistently, but it may be less consistent in a different environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository [Type] Bug An existing feature does not function as intended

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants