fix: ensure navigation sidebar serves fresh data after course publish#38785
fix: ensure navigation sidebar serves fresh data after course publish#38785wgu-taylor-payne wants to merge 1 commit into
Conversation
|
Thanks for the pull request, @wgu-taylor-payne! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
After a course publish in Studio, the CourseNavigationBlocksView can cache stale block structure data for up to 1 hour. This happens because the block structure rebuild task runs with a 30-second delay, but the navigation view may be hit during that window, read the old block structure from its cache, and store the stale result under the new course_version key. The fix adds an update_collected_if_needed() call on cache miss, ensuring the block structure is fresh before we build and cache the navigation tree. This only runs on cache misses and adds negligible overhead for the common case (block structure already up-to-date).
5103ff8 to
4374146
Compare
Summary
After a course publish, there is a ~30-second window where the navigation sidebar endpoint serves stale block structure data. Worse, the stale response is cached for 1 hour, extending the staleness far beyond the initial window.
This PR adds a synchronous staleness check before reading the block structure on a navigation cache miss.
Problem
This issue was brought to light while testing an internal Open edX instance. A unit was deleted in Studio, a refresh in the course in the learning MFE still showed the unit in the course outline.
In a Verawood sandbox, I deleted a unit in Studio, waited 30+ seconds and then refreshed and the unit was removed from the outline in the learning MFE. I deleted another unit in Studio and refreshed the course page within a few seconds, and the deleted unit still showed. Refreshing again before an hour passed, it still showed. Refreshing after an hour, it was no longer present in the outline.
Three components interact to create a race condition:
1. Block structure rebuild is delayed 30s after publish
The
course_publishedsignal handler queues the rebuild task with a 30s countdown:2. Navigation cache key includes course version — causing a miss on publish
The cache key uses
course_version, so after a publish the version changes and the previous cached response no longer matches. This is a cache miss.3. On cache miss, stale block structure data is read and cached for 1 hour
The
if not course_blocksbranch callsget_course_outline_block_tree(), which reads from the (still stale) block structure cache, then stores the result for 1 hour.Timeline:
Fix
Before reading the block structure on a cache miss, call
update_collected_if_needed(). This compares the cached block structure version against the modulestore version and synchronously rebuilds only if stale.Performance impact
is_up_to_datecheck (DB read + version comparison).Testing
To run the automated test:
Manual steps:
AI Usage
Used Kiro with model set to auto to aid in the discovery of the root cause of the error, talk through potential fixes, come up with a test that addresses the issue being fixed, and write this PR summary.