Skip to content

Commit c060a97

Browse files
Cdn cache improvements 3 (#2860)
* Doc: propose model-callback CDN purge as future work Section 9.3 honestly noted that purging is wired into the controller update/destroy actions and the bulk recalculation, but not a model callback, so a future write path that changes a displayed project field could leave the cached anonymous show page stale until Surrogate-Control max-age expires. Add Section 11 recording, as a deliberate (not-yet-implemented) proposal, an after_commit model callback that purges on any project change. Includes the exact callback, the implementation details that keep it correct (notably gating on skip_callbacks so the bulk recalc's single purge_all is not turned into thousands of per-project purges, after_commit vs after_save, enqueue vs synchronous purge, scope to update/destroy, removing the now-redundant controller purges, and purge-on-any-update over a fragile column allowlist), the pros and cons, the residual cross-model (user_display_name) limitation, and tests. Doc-only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com> * Make CDN re-purge unconditional on project update/destroy projects#update previously scheduled the delayed re-purge only inside the "@project.save succeeded" branch. But update_additional_rights writes the AdditionalRight table (rendered on the anonymous /permissions page) in its own transaction, which commits independently of @project.save and is not rolled back if the save later fails. On a save-fails-after-rights-changed path, only the early immediate purge fired -- losing the delayed re-purge that closes the TCP-reorder race, so the CDN could hold obsolete collaborator data until max-age. Schedule the delayed re-purge unconditionally on entry to update (after the request is authorized), alongside the early immediate purge, and drop the now -redundant post-save delayed purge (keeping the post-save immediate purge for freshest-data eviction). Extra purges are harmless and the request is already authorized by that point. Apply the same immediate+delayed pair to destroy, which previously did only an immediate purge: a stale re-cache of a deleted project's page is exactly the case worth the extra re-purge. Also correct an outdated comment: ActiveJob is backed by solid_queue (a database queue) in production, so the delayed purge is durable and survives a restart mid-wait -- it is not "stored in RAM" as the old comment claimed. Add a regression test (CdnCachingTest): a permissions-only update (changing only additional_rights, project row unchanged) must enqueue PurgeCdnProjectJob with the project's record_key. Guards against re-gating the purge on @project.saved_changes?, which would silently serve stale rights. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com> --------- Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 441e7f4 commit c060a97

3 files changed

Lines changed: 216 additions & 24 deletions

File tree

app/controllers/projects_controller.rb

Lines changed: 41 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -690,8 +690,34 @@ def create
690690
def update
691691
# Only accept updates if there's no repo_url change OR if change is ok
692692
if repo_url_unchanged_or_change_allowed?
693-
# Send CDN purge early, to give it time to distribute purge request
693+
# Purge this project from the CDN as soon as the request is authorized,
694+
# BEFORE saving, doing BOTH an immediate purge and a delayed re-purge --
695+
# and do BOTH unconditionally here, not only after a successful save.
696+
#
697+
# Why a delayed re-purge at all: the server always has the newest data
698+
# once committed, but TCP/IP does not guarantee the CDN receives our
699+
# replies in send order. The CDN can receive an *old* in-flight response
700+
# *after* our purge and cache it, holding stale data until max-age. A
701+
# second purge a few seconds later evicts that straggler; the next
702+
# request then repopulates the cache with correct data.
703+
#
704+
# Why unconditionally (rather than only on @project.save success):
705+
# update_additional_rights (below) writes the AdditionalRight table in
706+
# its own transaction -- data the anonymous /permissions page renders --
707+
# and that write commits independently of @project.save and is NOT
708+
# rolled back if the save later fails. So a save-fails-after-rights-
709+
# changed path still changes what anonymous users see; scheduling the
710+
# delayed re-purge here closes the TCP-reorder race for that path too.
711+
# Extra purges are harmless (no long-term effect), and by this point the
712+
# request is already authorized.
713+
#
714+
# Note: in production ActiveJob is backed by solid_queue (a database
715+
# queue), so the delayed purge is durable -- it survives a restart
716+
# during its wait, and the race-closer is not lost.
694717
@project.purge_cdn_project
718+
PurgeCdnProjectJob.set(
719+
wait: BADGE_PURGE_DELAY.seconds
720+
).perform_later(@project.record_key)
695721
# Capture the level being worked on (baseline or traditional badge)
696722
old_badge_level = current_working_level(@criteria_level, @project)
697723
final_project_params = project_params
@@ -744,29 +770,11 @@ def update
744770
# after saving.
745771
if @project.save
746772
successful_update(format, old_badge_level, @criteria_level)
747-
# We must send a purge later, not just now, due to a subtle race
748-
# condition. Here's what is going on.
749-
# The server and the CDN communicate over TCP/IP. This *server*
750-
# will always produce the newest information once it's committed.
751-
# However, TCP/IP does *NOT* guarantee that different replies
752-
# from a server will be received (by the CDN) in the same order that
753-
# they were sent. This means that the CDN can receive *old* data
754-
# after # receiving a purge request and newer data, resulting in
755-
# a CDN caches with obsolete data that will be held for a long time.
756-
# A solution: Wait a short time, then send *another* purge. That way
757-
# even if the CDN receives updates out-of-order, that old data will
758-
# be purged. The next request following this additional purge will
759-
# receive the updated data, and then the CDN will have correct data.
760-
#
761-
# Note: ActiveJob by default stores jobs in RAM. If the system is
762-
# restarted while a job is active, and jobs are stored in RAM, the
763-
# job will be lost and not executed. The long-term solution is to put
764-
# jobs in the database.
765-
PurgeCdnProjectJob.set(
766-
wait: BADGE_PURGE_DELAY.seconds
767-
).perform_later(@project.record_key)
768-
# Also send CDN purge last, to increase likelihood of being purged
769-
# and replaced with correct data even before the delayed purpose.
773+
# Final immediate purge after the commit, so the freshest data
774+
# evicts any stale copy as soon as possible. The delayed re-purge
775+
# that closes the TCP-reorder race was already scheduled
776+
# unconditionally on entry (see the comment there), so we do not
777+
# schedule another one here.
770778
@project.purge_cdn_project
771779
else
772780
format.html { render :edit, criteria_level: @criteria_level }
@@ -811,7 +819,16 @@ def destroy
811819
end
812820
format.json { head :no_content }
813821
end
822+
# Purge the deleted project from the CDN, both immediately and on a delay.
823+
# The delayed re-purge closes the same TCP-reorder race as in update (see
824+
# the long comment there): an old, in-flight show response could reach the
825+
# CDN just after the immediate purge and re-cache a deleted project's page
826+
# for up to max-age. record_key still resolves after destroy! (the id
827+
# remains in memory). solid_queue makes the delayed job durable.
814828
@project.purge_cdn_project
829+
PurgeCdnProjectJob.set(
830+
wait: BADGE_PURGE_DELAY.seconds
831+
).perform_later(@project.record_key)
815832
end
816833
# rubocop:enable Metrics/MethodLength, Metrics/AbcSize
817834

docs/cdn-cache-not-logged-in.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -993,6 +993,18 @@ that exact key.
993993
end
994994
```
995995

996+
* **Known limitation — non-controller writes.** Purging is wired into the
997+
controller `update`/`destroy` actions
998+
([`projects_controller.rb:694,765-770,814`](../app/controllers/projects_controller.rb))
999+
and the bulk recalculation (`Project.update_all_badge_percentages`, which
1000+
ends with `FastlyRails.purge_all`) — **not** into a model callback. Any
1001+
*other* path that changes a displayed project field via `save` /
1002+
`update_column` without purging would leave the cached show page stale until
1003+
`Surrogate-Control`'s `max-age` expires (10 days). No such path is known
1004+
today, but the safety net is fragile. Moving the purge into an `after_commit`
1005+
model callback would make purging robust against *any* write path; the design,
1006+
with its pros and cons, is in **Section 11**.
1007+
9961008
### 9.4 Page variance between users
9971009

9981010
Two distinct variance axes must both be safe.
@@ -1251,3 +1263,143 @@ new paths (e.g. `/en` and `/en/criteria`), confirming first-`MISS`-then-`HIT`,
12511263
no `Set-Cookie`, and that a session-cookie request returns `private, no-store`.
12521264
If anything misbehaves, set `BADGEAPP_CACHE_MISC_PAGES=false` to disable
12531265
instantly.
1266+
1267+
---
1268+
1269+
## 11. Potential future work: purge the CDN from a model callback
1270+
1271+
> **Status: not implemented; a proposal.** Today CDN purging is triggered
1272+
> explicitly from the controller `update`/`destroy` actions and from the bulk
1273+
> recalculation (`Project.update_all_badge_percentages`). This section records
1274+
> the design for making purging a model-level `after_commit` callback — a more
1275+
> robust "single source of truth" — together with its pros, cons, and the
1276+
> pitfalls that make a *naive* callback worse than the status quo. It is written
1277+
> down so the decision can be made deliberately later; it is **not** required
1278+
> for the project-show caching to be correct as shipped.
1279+
1280+
### 11.1 Motivation
1281+
1282+
Purging is currently wired into specific code paths rather than the data model:
1283+
1284+
* Controller `update`/`destroy` call `@project.purge_cdn_project` and schedule
1285+
a delayed `PurgeCdnProjectJob`
1286+
([`projects_controller.rb:694,765-770,814`](../app/controllers/projects_controller.rb)).
1287+
* The bulk recalculation issues one `FastlyRails.purge_all`
1288+
([`project.rb`](../app/models/project.rb), `update_all_badge_percentages`).
1289+
1290+
Both known write paths are covered, so the shipped feature is correct. The
1291+
weakness is structural: nothing *guarantees* that a **future** code path which
1292+
changes a displayed project field (name, badge percentages,
1293+
`achieved_passing_at` / `lost_passing_at`, any criterion answer or
1294+
justification) also purges. Such a path would leave the cached anonymous show
1295+
page stale until `Surrogate-Control`'s `max-age` expires (10 days). Because the
1296+
show page renders almost every project field, the set of "fields that matter" is
1297+
effectively "the whole record", so the safe rule is *purge whenever the project
1298+
changes*. An `after_commit` callback expresses exactly that, independent of who
1299+
performed the save.
1300+
1301+
### 11.2 Proposed implementation
1302+
1303+
Add a single callback to [`app/models/project.rb`](../app/models/project.rb),
1304+
reusing the existing `record_key` and `PurgeCdnProjectJob`:
1305+
1306+
```ruby
1307+
# Purge this project's cached CDN resources (badge, JSON, and the anonymous
1308+
# show HTML -- all tagged with record_key) whenever the project changes by
1309+
# ANY path, not just controller edits. Skipped during bulk recalculation,
1310+
# which issues a single purge_all itself (update_all_badge_percentages).
1311+
after_commit :enqueue_cdn_purge, on: %i[update destroy], unless: :skip_callbacks
1312+
1313+
private
1314+
1315+
def enqueue_cdn_purge
1316+
PurgeCdnProjectJob.perform_later(record_key)
1317+
# Delayed re-purge closes the rolling-deploy / read-repopulation race --
1318+
# the same recovery the controller schedules today.
1319+
PurgeCdnProjectJob
1320+
.set(wait: ApplicationController::BADGE_PURGE_DELAY.seconds)
1321+
.perform_later(record_key)
1322+
end
1323+
```
1324+
1325+
The implementation details that make this correct (a naive callback is *worse*
1326+
than today without them):
1327+
1328+
1. **Gate on `skip_callbacks` — the critical pitfall.**
1329+
`update_all_badge_percentages` sets the `cattr_accessor :skip_callbacks`
1330+
true, saves 10,000+ projects, and then issues **one** `FastlyRails.purge_all`.
1331+
Without `unless: :skip_callbacks`, the callback would fire on every one of
1332+
those commits and enqueue 20,000+ individual purge jobs in place of that
1333+
single purge_all — a severe regression. The flag is still true during each
1334+
in-loop commit, so the guard suppresses it correctly.
1335+
2. **Use `after_commit`, not `after_save`.** Purge only after the data is
1336+
durably committed; purging inside the transaction lets a concurrent anonymous
1337+
read re-populate the cache with pre-commit (or about-to-roll-back) content.
1338+
3. **Enqueue the job; do not purge synchronously in the callback.** A
1339+
synchronous `FastlyRails.purge_by_key` adds a Fastly network round-trip (10 s
1340+
timeout) to *every* commit on *every* path. `PurgeCdnProjectJob` is async
1341+
with retry/backoff and is the better fit now that purging fires everywhere.
1342+
(This is a slight behavior change from the controller's current synchronous
1343+
pre/post-save purge.)
1344+
4. **Scope to `update`/`destroy`.** A newly created project has no cached page
1345+
yet, so a create-time purge is wasted work. `record_key` still resolves in
1346+
`after_commit` on destroy (the id remains in memory).
1347+
5. **Remove the now-redundant controller purges** (the calls at
1348+
`projects_controller.rb:694,765-770,814`), or every controller edit purges
1349+
twice. Collapsing three call sites into one definition is the main
1350+
maintainability payoff.
1351+
6. **Purge on *any* update; do not build a "displayed-columns changed"
1352+
allowlist.** Because the show page renders nearly every field, an allowlist
1353+
is fragile and would risk re-introducing the exact staleness bug this change
1354+
exists to prevent. Reads vastly outnumber writes here, so an occasional purge
1355+
for a bookkeeping-only update is cheap insurance — correctness over hit rate.
1356+
1357+
### 11.3 Pros and cons
1358+
1359+
**Pros**
1360+
1361+
* **Robust by construction.** Any current or future write path — console
1362+
fix-ups, rake tasks, background jobs, new controllers — purges automatically.
1363+
Closes the Section 9.3 "non-controller writes" limitation.
1364+
* **DRY.** One definition replaces three controller call sites; the purge
1365+
policy lives next to the data it protects.
1366+
* **Reuses existing machinery** (`record_key`, `PurgeCdnProjectJob`,
1367+
`BADGE_PURGE_DELAY`); no new infrastructure.
1368+
1369+
**Cons**
1370+
1371+
* **A new global invariant with a sharp edge.** The correctness of the
1372+
high-volume bulk path now depends on the `skip_callbacks` guard. If a future
1373+
refactor introduces another bulk write without that flag, it could trigger a
1374+
purge storm. (Mitigated by a test; see Section 11.4.)
1375+
* **More purges overall.** Every update — including ones that touch only
1376+
non-displayed bookkeeping columns — now enqueues two purge jobs. Negligible
1377+
given the read/write ratio, but non-zero load on the job queue and Fastly API.
1378+
* **Slight timing change.** Moving from synchronous to job-based purging adds
1379+
sub-second job-pickup latency before eviction; the delayed re-purge already
1380+
tolerates this.
1381+
* **Not a full solution to cross-model staleness** (next subsection), so it can
1382+
create a false sense of "all staleness is handled".
1383+
1384+
### 11.4 Residual limitation: cross-model changes
1385+
1386+
The show page renders `@project.user_display_name`. A model callback on
1387+
`Project` does **not** fire when the owning `User` renames themselves, so that
1388+
project's cached page would still show the old name until the next project edit
1389+
or `max-age` expiry. Fully closing this would need an `after_commit` on `User`
1390+
that purges every `record_key` of that user's projects — a larger change with
1391+
its own performance considerations (a prolific owner could trigger many
1392+
purges). It is almost certainly not worth it for a display-name change, but it
1393+
should be acknowledged rather than silently assumed handled.
1394+
1395+
### 11.5 Tests
1396+
1397+
* Assert a plain change enqueues the purge:
1398+
`assert_enqueued_with(job: PurgeCdnProjectJob, args: [project.record_key])`
1399+
after `project.update!(name: 'x')`.
1400+
* Assert the **bulk path does not** enqueue per-project purges (guarding the
1401+
`skip_callbacks` pitfall in 11.2.1): run `update_all_badge_percentages` and
1402+
assert zero `PurgeCdnProjectJob` enqueues from the loop (the single
1403+
`purge_all` is a separate call).
1404+
* Keep the Section 9.3 `Surrogate-Key` test, which ties the cached page to the
1405+
key the callback purges.

test/integration/cdn_caching_test.rb

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77

88
# rubocop:disable Metrics/ClassLength
99
class CdnCachingTest < ActionDispatch::IntegrationTest
10+
include ActiveJob::TestHelper
11+
1012
setup do
1113
@project = projects(:one)
1214
end
@@ -158,6 +160,27 @@ class CdnCachingTest < ActionDispatch::IntegrationTest
158160
response.headers['Surrogate-Key']
159161
end
160162

163+
# A permissions-only edit changes the AdditionalRight table -- which the
164+
# anonymous /permissions page renders via additional_rights_to_s -- without
165+
# necessarily changing the projects row. The cached page must still be
166+
# purged. projects#update schedules the delayed re-purge UNCONDITIONALLY on
167+
# entry (not gated on @project.save), so even a rights-only change (and the
168+
# save-fails-after-rights-changed path) enqueues a purge of the project's
169+
# surrogate key. This guards against a future refactor that re-gates the
170+
# purge on @project.saved_changes? and would silently serve stale rights.
171+
test 'permissions-only update enqueues a CDN purge of the project key' do
172+
log_in_as(@project.user)
173+
assert_enqueued_with(
174+
job: PurgeCdnProjectJob, args: [@project.record_key]
175+
) do
176+
patch "/en/projects/#{@project.id}", params: {
177+
# Leave the project row unchanged; exercise the rights-only path.
178+
project: { name: @project.name },
179+
additional_rights_changes: "+ #{users(:test_user_mark).id}"
180+
}
181+
end
182+
end
183+
161184
def with_forgery_protection
162185
original = ActionController::Base.allow_forgery_protection
163186
ActionController::Base.allow_forgery_protection = true

0 commit comments

Comments
 (0)