RTC: Fix race condition on room creation which can cause a split update log#77675
RTC: Fix race condition on room creation which can cause a split update log#77675danluu wants to merge 3 commits intoWordPress:trunkfrom
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
👋 Thanks for your first Pull Request and for helping build the future of Gutenberg and WordPress, @danluu! In case you missed it, we'd love to have you join us in our Slack community. If you want to learn more about WordPress development in general, check out the Core Handbook full of helpful information. |
6a3168f to
a02b370
Compare
| return null; | ||
| } | ||
|
|
||
| clean_post_cache( $post_id ); |
There was a problem hiding this comment.
I haven’t followed the post-cache issues too closely, but I wonder if this is going to be problematic as well.
I would expect that promoting a non-canonical storage meta will occur rarely (vs. multiple times per second as was occurring in trunk for edits), but I would like some double-checking on this to make sure it would not cause thrashing on the site with other posts.
@peterwilsoncc this PR might seem a bit random, but are you able to speak to the use of clean_post_cache() here? the situation is that two sessions opened a post and due to a race condition, created two separate wp_sync_storage rows. this code is resolving that race.
There was a problem hiding this comment.
I think this can use wp_update_post()
| self::POST_TYPE, | ||
| $room_hash, | ||
| $wpdb->esc_like( $room_hash . '-' ) . '%' | ||
| ) |
There was a problem hiding this comment.
I think the concern may be minor here, but I wanted to ask if this OR clause is hiding a needless table scan. I asked an LLM to analyze it and it suggested that a UNION instead could introduce two index-scans to replace a single table scan.
Here is MariaDB’s EXPLAIN for the two versions:
MariaDB [wordpress]> EXPLAIN SELECT ID FROM wp_posts WHERE post_type = "wp_sync_storage" AND post_status = "publish" AND ( post_name = "abc123" OR post_name LIKE "abc123-" ) ORDER BY ID ASC;
+------+-------------+----------+-------+-----------------------------------------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+-----------------------------------------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | wp_posts | index | post_name,type_status_date,type_status_author | PRIMARY | 8 | NULL | 35 | Using where |
+------+-------------+----------+-------+-----------------------------------------------+---------+---------+------+------+-------------+
1 row in set (0.003 sec)
MariaDB [wordpress]> EXPLAIN SELECT ID FROM wp_posts WHERE post_type = "wp_sync_storage" AND post_status = "publish" AND post_name = "abc123" UNION SELECT ID FROM wp_posts WHERE post_type = "wp_sync_storage" AND post_status = "publish" AND post_name LIKE "abc123-" ORDER BY ID ASC;
+------+--------------+------------+-------+-----------------------------------------------+-----------+---------+-------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+------------+-------+-----------------------------------------------+-----------+---------+-------+------+----------------+
| 1 | PRIMARY | wp_posts | ref | post_name,type_status_date,type_status_author | post_name | 766 | const | 1 | Using where |
| 2 | UNION | wp_posts | range | post_name,type_status_date,type_status_author | post_name | 766 | NULL | 1 | Using where |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort |
+------+--------------+------------+-------+-----------------------------------------------+-----------+---------+-------+------+----------------+
3 rows in set (0.002 sec)Granted, for my little test site with two posts both are going to be roughly equivalent, but if we hit a site with thousands of posts, I wonder if this could introduce noticeable DB overhead.
Not being sure how to interpret the Using filesort I also asked about eliminating that by sorting in PHP and there was no clear guidance other than to say that when the expected set is small (which it is here) there’s no practical difference.
|
@danluu Are you able to provide more information about how the race condition is being forced, so I can do so locally. The reason I am asking is that I'd like to know how real world this is, the additional work being done will slow down requests which would make any race condition more likely to occur. @dmsnell The |
peterwilsoncc
left a comment
There was a problem hiding this comment.
I've added some first view notes inline but my main question is in the comment above re: reproduction steps.
I know from conversations with Matt that he is pretty keen on avoiding the post/post_meta approach so I'd really prefer to focus our energies there rather than on code that's unlikely to make a WordPress release.
As the updates are added to a row identified by the room name, the race condition isn't able to occur with the improved storage structure.
Aside: while post_meta has been the prompt for Core-64696 (don't read all of it), using posts for rooms is an abuse of both the posts and posts_meta tables in my view.
| $move_result = $wpdb->update( | ||
| $wpdb->postmeta, | ||
| array( 'post_id' => $canonical_post_id ), | ||
| array( 'post_id' => $duplicate_id ), | ||
| array( '%d' ), | ||
| array( '%d' ) | ||
| ); |
There was a problem hiding this comment.
This may result in two copies of awareness post_meta stored against a single post ID.
| $post_id = $wpdb->get_var( | ||
| $wpdb->prepare( | ||
| "SELECT ID FROM {$wpdb->posts} WHERE post_type = %s AND post_status = 'publish' AND post_name = %s ORDER BY ID ASC LIMIT 1", | ||
| self::POST_TYPE, | ||
| $room_hash | ||
| ) | ||
| ); |
There was a problem hiding this comment.
This can use the get_posts API.
$posts = get_posts(
array(
'post_type' => self::POST_TYPE,
'posts_per_page' => 1,
'post_status' => 'publish',
'name' => $room_hash,
'fields' => 'ids',
'orderby' => 'ID',
'order' => 'ASC',
)
);| return null; | ||
| } | ||
|
|
||
| clean_post_cache( $post_id ); |
There was a problem hiding this comment.
I think this can use wp_update_post()
|
@peterwilsoncc Thanks for taking a look! Specifically on the question of the repro, the checked in e2e test forced a race by creating a For a less artificial (but non-deterministic) repro, this script attempts to demonstrate the race: https://github.com/danluu/gutenberg/blob/danluu/rtc-room-split-explanation/docs/explanations/fuzzer-bugs/rtc-issue-08-normal-rest-repro.sh. The script creates concurrent draft posts across users and then checks if there was a room split. It works on my machine fairly consistently, but it may be less consistent in a different environment. |
What?
This is part of a series of bug reports and PRs from an AI fuzzing project. See #77532 for more details on that. I'm filing a few more of these after a discussion with @alecgeatches on how the fuzzer found a bug that I hadn't file that ended up taking some developer time to chase down and repro.
first-room-storage-live-repro.mp4
first-room-storage-document-edit-repro.mp4
BEGIN AI GENERATED TEXT
The post-meta sync storage backend creates one
wp_sync_storagepost per sync room and usesmd5( $room )as the post slug. On first access to a new room, two concurrent writers can both miss the existing-storage lookup and both callwp_insert_post()with the same slug. WordPress keeps slugs unique, so one request receives the canonical slug and the other receives a suffixed slug such as<hash>-2.There is one trunk/product bug covered by this fix:
First-writer split storage bug. The original storage creation path trusted the inserted post ID without checking whether WordPress kept the exact room-hash slug. If a request received a suffixed post, it could acknowledge awareness or sync updates written to the suffixed lineage. A fresh storage instance later looks up only the canonical room hash, so those acknowledged first-access writes are unreachable. Source:
get_storage_post_id()creates and caches the storage post. Repro/test:test_first_access_race_does_not_split_room_storage().Branch Layout
The PR branch
danluu/fix-rtc-room-splitintentionally contains only the PHPUnit regression tests and the storage fix. This explanation is kept on the separatedanluu/rtc-room-split-explanationbranch so it does not appear in the PR diff.There is no top-level Playwright/e2e test in the PR branch.
Reproduction Layers
PHPUnit storage repro
phpunit/tests/collaboration/wpSyncPostMetaStorage.php::test_first_access_race_does_not_split_room_storageThis test installs a
wp_insert_post_datafilter that injects a competingwp_insert_post()exactly between the storage layer's miss and its insert. It then asserts:wp_sync_storagelineage exists for the room hash;WP_Sync_Post_Meta_Storageinstance can read the acknowledged update.On the vulnerable implementation, the test fails because WordPress creates both
<hash>and<hash>-2, and the update can be written to the suffixed post.Browser/video repro
The local browser/video harness uses editor-visible presence to expose the split. It creates a draft post, enables a test-only race injector for the post's sync room, opens the post as admin, and delays the initial empty sync polls until the admin's presence payload contains visible collaborator metadata. The first meaningful admin presence poll then travels through the production
/wp-sync/v1/updatesendpoint and triggers the storage race.After that acknowledged first write, the test pauses admin polling so the normal retry loop cannot immediately heal the symptom. A collaborator opens the same post and polls the canonical room. The test then holds the collaborator's next poll and releases the admin poll. On the vulnerable implementation, the admin screen learns that the collaborator joined, but the collaborator screen still has no
Collaborators listbutton, because the admin's acknowledged first presence write was written to the suffixed storage lineage.The deterministic browser harness still drives the production editor and sync REST path. The artificial parts are the scheduler that forces the otherwise timing-dependent first-writer race and the short admin-poll pause that keeps the user-visible presence loss observable long enough for the assertion and video. The diagnostic REST endpoint verifies the supporting storage invariant. The browser/video harness is not part of the PR branch.
Annotated videos for this bug:
/Users/danluu/conductor/workspaces/gutenberg-v1/providence/.context/rtc-issue-08/first-room-storage-live-repro.mp4/Users/danluu/conductor/workspaces/gutenberg-v1/providence/.context/rtc-issue-08/first-room-storage-document-edit-repro.mp4Observed vulnerable-branch result:
Real-vs-False-Positive Analysis
Oracle validity
The oracle is valid. A sync room must have one reachable storage lineage because fresh readers find storage by the canonical room hash. If an acknowledged write lands in a suffixed lineage, it is not reachable through normal
get_storage_post_id()lookup in a new request.Generated shape validity
The generated room shape is valid. The low-level repro uses
postType/post:<id>:first-access-race, which matches the production REST room pattern. The browser repro uses the normal editor roompostType/post:<id>.Helper misuse
The failing path does not rely on private test-only storage calls for the browser repro. It opens normal editor sessions and lets the production sync endpoint call
WP_Sync_Post_Meta_Storage. The PHPUnit repro uses the storage class directly to minimize the race window, but the same get-or-insert code is used by production REST requests.Environment contamination
The tests reset the static room cache before fresh-reader checks. The room hash includes the test post ID and a unique suffix, and lineage counting filters by post type, publish status, exact room-hash slug, and suffixed room-hash slugs. The failure is not caused by stale cache state or older storage rows.
Race-injection-only behavior
The injection forces an interleaving that can occur with two concurrent HTTP requests first accessing a new room: both miss lookup, both insert, and WordPress uniquifies one slug. The injected writer uses
wp_insert_post()with the same post type, status, title, and slug as production storage creation.Known-fixed bug masking
The issue is independent of previously fixed cursor and compaction races. Those bugs affect update retrieval and deletion after storage already exists. This bug happens before the room has canonical storage and can lose the very first awareness or sync update into a suffixed lineage.
Fix Plan
The storage layer must treat the exact room-hash slug as the only canonical storage lineage:
The PR branch intentionally does not include this explanation, the browser-only diagnostic mu-plugin, or an e2e repro harness.
END AI GENERATED TEXT
Use of AI Tools
The code here is all AI generated. As noted above, the bug finding came from an AI fuzzing project.