Skip to content

DM-54684: Docverse Cloudflare cache purging #183

@jonathansick

Description

@jonathansick

Metadata

Field Value
Jira Key DM-54684
Jira URL https://rubinobs.atlassian.net/browse/DM-54684

Problem Statement

When Docverse publishes an edition, it writes a new pointer to Cloudflare Workers KV so the Worker serves a new build for that edition. The Cloudflare CDN, however, also caches Worker responses at the edge. Today the Worker sets Cache-Control: public, max-age=60, which keeps stale content to a minute but also prevents us from raising edge TTLs to reduce R2 egress and Worker invocation cost. Without cache purging on publish, raising the edge TTL would mean readers see stale documentation for the full TTL window after every edition update.

Separately, some editions (development branches, preview editions) change frequently and don't benefit from edge caching — purging on each of these publishes would waste Cloudflare purge API calls, while other editions (stable releases, the project's main edition) are read-heavy and would benefit most from long edge caching with targeted purge-on-update.

The work depends on DM-54685 (project and edition dashboard files) to be testable end-to-end for dashboard purging; that ticket should land before this one.

Solution

From the documentation maintainer's perspective: when I publish a new build to the main edition of my project, readers see the new content within seconds globally, even though the edge was serving the previous build with a long cache lifetime. From a development or preview edition, publishes happen just as quickly as today — no behavior change for me. My stable-release documentation loads faster worldwide because more requests are served from nearby Cloudflare POPs instead of hitting R2.

From the platform operator's perspective: the Cloudflare stack serves the flagship Docverse organization with tiered edge caching — stable editions cache for up to a day at the edge, dev editions cache for a minute. Publishing a stable edition triggers an automatic hostname-wide Cloudflare cache purge. Test organizations that run on workers.dev (with no zone) continue to work — the purge step becomes a logged no-op. Edition publishing and cache purging are separate concerns behind separate protocols, so the CDN layer can evolve independently of the KV layer.

User Stories

  1. As a documentation maintainer, I want newly published main-edition builds to be visible globally within seconds, so I don't have to wait for the edge TTL to expire.
  2. As a documentation reader, I want stable release pages to load fast from a nearby edge POP, so I don't wait on R2 origin fetches.
  3. As a documentation reader, I want browser-cached pages to refresh frequently (within minutes), so I see corrections quickly without hard-refresh.
  4. As a platform operator, I want to raise Cloudflare's edge TTL for stable editions to a day, so R2 egress and Worker invocation counts drop sharply.
  5. As a platform operator, I want development-edition publishes to skip the purge API call, so we don't spend purge quota on editions that aren't long-cached.
  6. As a platform operator, I want purges to happen automatically on every qualifying publish, so no manual intervention is required.
  7. As a platform operator, I want purge failures to not fail the overall publish, so a transient Cloudflare issue produces stale cache rather than a publishing outage.
  8. As a platform operator, I want purge failures to be logged with full project/edition/build context, so I can diagnose and remediate them.
  9. As a platform operator, I want to run test organizations on workers.dev without configuring a zone_id, so I can spin up ephemeral sandboxes quickly.
  10. As a platform operator, I want absence of zone_id to be a silent (logged) no-op rather than an error, so missing-zone orgs are usable without special casing.
  11. As a developer, I want the cache purger to be a separate protocol from the edition publisher, so CDN and KV concerns can evolve independently (e.g. swapping providers, reusing the purger for rollback).
  12. As a developer, I want the cache purger to resolve its configuration from the existing cloudflare_workers service row, so operators manage one Cloudflare service rather than two.
  13. As a developer, I want the purger factory to return a no-op implementation for configurations that don't support purging, so callers don't need to branch on zone_id presence.
  14. As a developer, I want the cache_profile (long vs short) to be derived from the edition's existing EditionKind, so no new schema columns or admin UI are required to roll this out.
  15. As an operator of the flagship deployment, I want the main edition to cache long at the edge (up to a day), so its popular paths are served from the edge.
  16. As an operator, I want pinned release editions (e.g. v24.1.0) to cache long because their content never changes in practice.
  17. As an operator, I want development and preview editions to cache short (~1 minute) so maintainers see their changes quickly without waiting for a purge.
  18. As a developer, I want the Worker to derive Cache-Control from a cache_profile flag in the KV value, so policy changes on the Python side don't require Worker redeploys.
  19. As a developer, I want the Worker to continue emitting the ETag header, so browsers can do cheap conditional revalidation.
  20. As a developer, I want the KV value schema change to be additive (new cache_profile field), so older worker code continues to serve valid responses if cache_profile is missing.
  21. As a platform operator, I want the Worker to default to the short profile if cache_profile is missing from the KV value, so any ambiguous state is safe.
  22. As a documentation maintainer, I want the project dashboard to refresh quickly when a new edition is added, so readers see the new edition.
  23. As a documentation maintainer, I want the project dashboard to refresh quickly when a build is promoted on an edition, so the dashboard lists the latest build info.
  24. As an operator, I want dashboard refreshes to use purge-by-URL rather than hostname-wide purges, so they don't invalidate long-cached release documentation.
  25. As a developer, I want hostname derivation (for the purge scope) to be a pure function honoring url_scheme and slug_rewrite_rules, so it's trivially testable and matches what the Worker actually serves.
  26. As a developer, I want cache-profile derivation to be a pure function, so it's easy to unit-test across every EditionKind.
  27. As a developer, I want purge orchestration to live in EditionPublishingService, so the publish worker remains a thin wrapper and the domain logic stays in one place.
  28. As a developer, I want the CdnCachePurger to be an async context manager matching the existing EditionPublisher pattern, so it's familiar to contributors.
  29. As a developer, I want to unit-test the Cloudflare purger against a mocked httpx transport, so we don't need a live Cloudflare account in CI.
  30. As an operator, I want the purger to reuse the cloudflare_workers service credential (api_token), so we don't store a second token per org.
  31. As a developer, I want clear structured logs on every purge attempt (success or failure), so we can build dashboards on purge latency and failure rates later.
  32. As a developer, I want the Cache-Control TTL values to be constants in one place (Worker + Python), so tuning them is a one-liner.
  33. As a platform operator, I want to defer Enterprise-only purge strategies (by prefix, by Cache-Tag) to a future story, so we ship on all Cloudflare plan tiers now.
  34. As a developer, I want the story to explicitly note its dependency on DM-54685, so implementation ordering is clear.

Implementation Decisions

Purge mechanism. Purge-by-hostname for editions (available on all Cloudflare plan tiers). Purge-by-URL for dashboard files. Enterprise-only mechanisms (purge-by-prefix, purge-by-Cache-Tag) are explicitly out of scope.

Protocol split. A new CdnCachePurger protocol is introduced alongside the existing EditionPublisher protocol. EditionPublisher keeps its current shape (KV pointer writes only). CdnCachePurger exposes purge_hostname(hostname) and purge_urls(urls), both async context-manager methods. Concrete implementation: CloudflareCachePurger hitting https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache with a Bearer token.

Factory and graceful degradation. A create_cdn_cache_purger() factory resolves a purger from the org's cloudflare_workers service config. When zone_id is missing (the workers.dev case), it returns a NoopCdnCachePurger that logs at INFO and does nothing on purge_hostname/purge_urls. This keeps zone_id optional in _REQUIRED_CONFIG_KEYS and avoids any error path for test orgs.

Purge orchestration. EditionPublishingService.publish() is extended to:

  1. Call publisher.publish() (unchanged).
  2. Compute cache_profile from the edition's EditionKind.
  3. If profile is "long", resolve a CdnCachePurger via a new CdnCachePurgerProvider callable (mirroring EditionPublisherProvider), compute the project hostname from the org, and call purge_hostname().
  4. Log and swallow purge exceptions — publish does not fail on purge failure.
  5. Mark edition/history as published.

Short-profile editions skip steps 3 entirely — no purge API call.

Cache profile derivation. Pure function compute_cache_profile(edition_kind: EditionKind) -> str:

  • EditionKind.main"long"
  • EditionKind.release"long"
  • all others → "short"

Cache TTL values. Worker emits:

  • "long": Cache-Control: public, max-age=300, s-maxage=86400 (5 min browser, 1 day edge)
  • "short": Cache-Control: public, max-age=60 (current behavior)

TTL constants live in one module each on the Python side (for consistency in logs/metrics) and the TypeScript side (Worker).

KV value schema extension. The KV JSON payload written by the publisher gains a cache_profile field of type "long" | "short". This is additive — if the Worker encounters a KV value without cache_profile (e.g., during deploy transition), it defaults to "short", which is safe.

Hostname derivation. Pure function compute_project_hostname(org: Organization, project_slug: str) -> str:

  • If org.url_scheme == subdomain: {maybe_rewritten_slug}.{org.base_domain} — honors slug_rewrite_rules if any.
  • If org.url_scheme == path_prefix: org.base_domain.

Service config. zone_id remains an optional key on the cloudflare_workers service config. It is not added to _REQUIRED_CONFIG_KEYS. The existing api_token credential is reused for purge API calls.

Dashboard purge. The dashboard-generation service (from DM-54685) will, on completing a dashboard rebuild, resolve a CdnCachePurger and invoke purge_urls() with the specific dashboard URLs for that project/org. This PRD defines the purger interface and the purge_urls integration point; the set of URLs and the generation trigger are determined by DM-54685.

Failure handling. Purge failures are logged at ERROR with full context (org, project, edition, build, zone, status code, response body) but do not raise. Retries are not implemented in this story.

Worker code change. cloudflare-worker/src/resolver.ts reads cache_profile from the parsed KV value and selects the matching Cache-Control string. types.ts extends EditionMapping with the new field.

Testing Decisions

A good test targets external observable behavior — what URL the purger hits, what body it sends, what Cache-Control the Worker emits — and does not assert on private state, call counts on mocked internals, or exception stack traces.

Python modules with tests:

  • compute_cache_profile — parameterized test across every EditionKind value.
  • compute_project_hostnamesubdomain and path_prefix schemes, with and without slug_rewrite_rules applied.
  • CloudflareCachePurger — via mocked httpx.AsyncClient transport: verify POST target, body for hostname purge, body for URL-list purge, Bearer auth header, behavior on 4xx/5xx responses (logs error, raises for caller to swallow).
  • NoopCdnCachePurger — calls return None without side-effects.
  • create_cdn_cache_purger — returns a real CloudflareCachePurger when zone_id is present; returns NoopCdnCachePurger when absent; raises on unsupported provider.
  • EditionPublishingService.publish() — integration-style test with fake publisher and fake purger. Assert: purger called exactly once for main/release editions; purger not called for other kinds; purge failure does not propagate; edition/history transitioned to published in both cases.

Worker module with tests (TypeScript, cloudflare-worker/test/):

  • Worker emits Cache-Control: public, max-age=300, s-maxage=86400 when KV value carries cache_profile: "long".
  • Worker emits Cache-Control: public, max-age=60 when cache_profile is "short" or absent.
  • ETag and Content-Type behavior unchanged across both profiles.

Prior art:

  • tests/storage/editionpublisher/ — existing tests for the KV publisher (use same httpx mocking patterns).
  • tests/services/test_edition_publishing.py — existing orchestrator tests (extend with purger fakes).
  • tests/storage/editionpublisher/factory_test.py — factory validation patterns.
  • cloudflare-worker/test/resolver.test.ts — existing Worker tests (extend for cache_profile branches).

Out of Scope

  • Purge-by-prefix (Cloudflare Enterprise).
  • Purge-by-Cache-Tag (Cloudflare Enterprise) and the corresponding Worker Cache-Tag emission.
  • Generation of project/edition dashboard files themselves (tracked in DM-54685) — this PRD only integrates with them via CdnCachePurger.purge_urls().
  • Automatic retry of failed purges (first pass is log-and-move-on).
  • Operator-facing API to trigger manual cache purges.
  • Per-organization or per-edition overrides of the cache-profile mapping.
  • Fingerprinted or content-hashed URLs for browser cache busting.
  • Metrics/alerting on purge latency or failure rate (can be layered on the structured logs later).
  • Rollback-triggered purges — the same EditionPublishingService call path is expected to cover rollback-as-publish, but no separate rollback-specific purge semantics are added here.

Further Notes

  • Dependency on DM-54685. That ticket creates the actual dashboard files. This story defines the purge_urls() interface the dashboard service will call, but end-to-end testing of dashboard purging is blocked on DM-54685 landing first. Recommend scheduling DM-54685 before (or in parallel with) implementation of this story.
  • Workers.dev sandboxes. The flagship Docverse org will use a Cloudflare zone; test and sandbox orgs running on workers.dev subdomains have no zone and therefore cannot purge. The no-op purger makes this invisible to callers.
  • Future Enterprise upgrade path. If the Cloudflare plan is upgraded to Enterprise, CloudflareCachePurger can be swapped (or internally branched) to use purge-by-prefix or purge-by-tag. Because the protocol exposes only purge_hostname and purge_urls, callers don't need to change.
  • Cache TTL tuning. The chosen values (max-age=300, s-maxage=86400 for long; max-age=60 for short) are a starting point informed by SQR-112's guidance. They can be tuned post-rollout based on observed R2 egress / Worker invocation metrics.
  • Additive KV schema. Because the Worker defaults to "short" when cache_profile is missing, rollout can proceed safely: deploy the new Worker first (still-safe default), then deploy the Python change that starts writing cache_profile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    prdProduct Requirements Document

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions