-
Notifications
You must be signed in to change notification settings - Fork 1
Add ADR to precompute API cache for resilience #677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
edavey
wants to merge
1
commit into
main
Choose a base branch
from
add-adr-14-to-precompute-api-cache-for-resilience
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+172
−0
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
172 changes: 172 additions & 0 deletions
172
docs/architecture/decisions/0014-precompute-api-cache-for-resilience.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,172 @@ | ||
| # 14. Precompute API cache for resilience | ||
|
|
||
| Date: 2026-05-19 | ||
|
|
||
| Status: Draft | ||
|
|
||
| ## Context | ||
|
|
||
| [ADR 13: Add an API for Content Block Manager][] establishes that Content Block Manager (CBM) will expose a | ||
| rendering API (`POST /api/blocks/render`) which Publishing API and other applications will call instead of rendering | ||
| content blocks locally via the `content_block_tools` gem. | ||
|
|
||
| This will introduce a network dependency where none exists today. There are two republication scenarios that | ||
| exercise this dependency differently: | ||
|
|
||
| ### 1. Routine republishing: a new edition of a block | ||
|
|
||
| When an editor updates a content block, all documents embedding it are republished with freshly rendered HTML. | ||
| Typically this affects 5-100 documents, currently containing 1-5 embed codes, though as Content Block Manager | ||
| becomes more popular these counts will increase significantly. This republishing must complete promptly so editors see their changes propagate. | ||
|
|
||
| ### 2. Rare bulk republishing: publishing app performs mass republishing | ||
|
|
||
| From time to time, a publishing app like Whitehall republishes a large category of its documents. Whitehall has approximately 500k documents of which ~400k are live. A category for republication might be "all publications" (~250k documents). See [Whitehall's republishing helper][whitehall-republishing]. | ||
|
|
||
| In a few years, once content blocks are widely adopted, a mass republication might involve 100k documents with between 1 and 50 embed codes each — up to 5 million embed code resolutions in a single burst. | ||
|
|
||
| ### The publish-render mechanism | ||
|
|
||
| Each republication triggers a `DownstreamLiveJob` in Publishing API, which calls `ContentEmbedPresenter` to render | ||
| embedded content. Today this renders locally via the gem (sub-millisecond). Under the new architecture, each job | ||
| will make a network call to CBM's render API. | ||
|
|
||
| ### Objective | ||
|
|
||
| This ADR proposes a strategy to ensure the render API is: | ||
|
|
||
| 1. performant under both routine and bulk republication, and is | ||
| 2. resilient to transient failures | ||
|
|
||
| ## Decision | ||
|
|
||
| ### 1. Precompute rendered output on publish, with lazy fallback | ||
|
|
||
| When an edition is published, CBM will render all known embed code variants for that block and store the results in | ||
| cache. The render API will resolve each embed code from cache. Cache misses will fall back to live rendering and | ||
| will be cached for subsequent requests. | ||
|
|
||
| #### Cache granularity | ||
|
|
||
| The grain of the cache will be the individual embed code, rather than the entire HTTP request. A `POST | ||
| /api/blocks/render` request will contain an arbitrary combination of embed codes (whichever appear in a particular | ||
| document). Caching whole requests would result in many misses since most documents have will have a unique | ||
| combination of embed codes. Caching per embed code means a rendered `{ | ||
| {embed:content_block_pension:state-pension/rates/weekly-rate/amount}}` is stored once and reused across all | ||
| documents that embed it. e.g. | ||
|
|
||
| ``` | ||
| Cache key: "render:#{embed_code}:#{edition_id}" | ||
| TTL: no expiry (invalidated explicitly on publish) | ||
| ``` | ||
|
|
||
| #### Cache invalidation | ||
|
|
||
| When a new edition is published, we will bust the cache for that block's embed codes and repopulate it via a | ||
| precompute routine. | ||
|
|
||
| #### Rationale | ||
|
|
||
| CBM owns the schemas (including field paths and format metadata), so it knows all valid embed code variants at publish time. The number of unique embed codes remains small relative to the number of documents that reference them — this asymmetry will make caching effective. | ||
|
|
||
| ### 2. Use Redis as the cache backing store | ||
|
|
||
| We will use `ActiveSupport::Cache::RedisCacheStore` via `Rails.cache`. Redis is already in use for Sidekiq, so this | ||
| adds no new infrastructure. A shared cache will allow the precompute process to populate in a single place for all CBM processes/pods to benefit immediately. | ||
|
|
||
| ### 3. Rely on Sidekiq retry for resilience (initially) | ||
|
|
||
| If the render API is unavailable or times out, `DownstreamLiveJob` in Publishing API will fail and retry via | ||
| Sidekiq's built-in exponential backoff (25 retries over ~21 days). This provides resilience without application-level retry logic. | ||
|
|
||
| More sophisticated patterns (e.g. client retry, circuit breaker) are noted as future options if monitoring reveals frequent transient failures causing unacceptable republication delays. However, the complexity of implementing and tuning these strategies may not be required. | ||
|
|
||
| ### 4. Strategy pattern for gem migration | ||
|
|
||
| We will retain the `ContentBlockTools::ContentBlock#render` interface, which applications (Publishing API, Whitehall, | ||
| Mainstream Publisher, Smart Answers) are using via the `content_block_tools` gem. We will use a "strategy" pattern to switch between: | ||
|
|
||
| - `LegacyRenderingStrategy` — current behaviour (local gem rendering) | ||
| - `ApiRenderingStrategy` — calls CBM's render API via `gds-api-adapters` | ||
|
|
||
| `ContentEmbedPresenter` in Publishing API will continue to call `ContentBlock#render` unchanged. The switch is a configuration change, not a code change in consuming apps. | ||
|
|
||
| #### Migration sequence | ||
|
|
||
| 1. Deploy with `LegacyRenderingStrategy` (no change to current behaviour) | ||
| 2. Switch to `ApiRenderingStrategy` via config/feature flag | ||
| 3. Monitor: compare render output, latency, error rates | ||
| 4. Once proven, remove `LegacyRenderingStrategy` and all rendering code from the gem (ViewComponents, FieldPresenters, etc) | ||
|
|
||
| ### 5. Observability | ||
|
|
||
| We will include instrumentation on the rendering API to monitor: | ||
|
|
||
| - **Per-call latency**: P50, P95, P99 response times, distinguishing cache hits from cache misses | ||
| - **Cache hit rate**: ratio of cache hits to total lookups | ||
| - **Error rate**: 5xx responses, timeouts | ||
|
|
||
| These metrics should provide early warning of degradation and inform decisions: | ||
|
|
||
| - in the transition period: whether to switch from the `ApiRenderingStrategy` back to the `LegacyRenderingStrategy` | ||
| - once fully migrated: whether to investigate client-level retry or circuit breaker patterns | ||
|
|
||
| ## Consequences | ||
| ### Performance characteristics | ||
|
|
||
| With precomputed caching and Redis we expect to observe: | ||
|
|
||
| - **Cache hit (expected case):** Redis lookup + network round-trip ≈ 5-20ms | ||
| - **Cache miss (rare, lazy fallback):** DB query + component render + cache write ≈ 50-200ms | ||
|
|
||
| **Routine republishing** (5-100 documents after a block edit): will complete in seconds. Precompute will run at | ||
| publish time, so all embed codes for the updated block will have been cached before republication begins. This should result in a near-100% cache hit rate. | ||
|
|
||
| **Bulk republishing** (100k documents, 1-50 embed codes each, 20 Sidekiq threads): the cache should be effective | ||
| because the number of **unique** embed codes is small relative to the number of documents. However, bulk | ||
| republishing is not triggered by a block being published, so the precompute process may not have run for every block | ||
| referenced. Cache misses will be filled lazily and cached for subsequent requests. In the worst case (cold cache), the first request for each unique embed code will take 50-200ms. Subsequent requests will hit cache at 5-20ms. | ||
|
|
||
| ### What stays in the gem vs moves to CBM | ||
|
|
||
| Once the API is proven and the legacy strategy is removed: | ||
|
|
||
| **Stays in the gem** (thin client): | ||
| - `ContentBlockReference` — embed code detection/parsing | ||
| - `EmbedCode`, `InternalContentPath`, `Format` — supporting parse classes | ||
| - `ApiRenderingStrategy` — HTTP client wrapping `gds-api-adapters` | ||
|
|
||
| **Moves to CBM**: | ||
| - `Renderer` and all rendering orchestration | ||
| - All ViewComponents (Contact, Pension, Tax, TimePeriod, etc.) | ||
| - All FieldPresenters | ||
| - Helpers (Govspeak, OverrideClasses) | ||
| - Engine and stylesheets | ||
|
|
||
| ### Availability dependency | ||
|
|
||
| Content Block Manager will become a runtime dependency for publishing. This is accepted per [ADR 13: Add an API for Content Block Manager][]. This ADR describes these mitigations: | ||
|
|
||
| - Precomputed cache means the API will serve from Redis, rather than by doing expensive computation on each request | ||
| - Sidekiq retry will handle transient outages automatically | ||
| - A "strategy" pattern in the `content_block_tools` gem will allow us to swtich to local rendering during extended outages if needed | ||
|
|
||
| ### Future considerations | ||
|
|
||
| - **Client-level retry** (1-2 immediate attempts) could be added if monitoring shows frequent transient timeouts | ||
| causing unnecessary Sidekiq retry delays | ||
|
|
||
| - **Circuit breaker** could be added if CBM experiences sustained outages that cause Sidekiq queue buildup in Publishing API | ||
|
|
||
| - **Cache warming on deploy**: after a CBM deploy that clears Redis, a background job could re-precompute all published blocks to avoid a burst of cache misses | ||
|
|
||
| - **Long-term volume growth**: as the number of documents using content blocks grows, Sidekiq concurrency and CBM scaling may need attention. The caching strategy should remain sound, assuming that unique embed codes grow slowly relative to document count. | ||
|
|
||
| ## Related decisions | ||
|
|
||
| - [ADR 13: Add an API for Content Block Manager](0013-add-an-api-for-content-block-manager.md) | ||
|
|
||
| [whitehall-republishing]: https://github.com/alphagov/whitehall/blob/587206dbae505eb4fee4034164dbe62ab58ef6d8/app/helpers/admin/republishing_helper.rb | ||
|
|
||
| [ADR 13: Add an API for Content Block Manager]: | ||
| https://github.com/alphagov/content-block-manager/blob/9d9f52659f2c5724848795ae260ee827ce5cbdaa/docs/architecture/decisions/0013-add-an-api-for-content-block-manager.md | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal seems to depend heavily on processing of documents in Publishing API being asynchronous. However we are moving away from this with GraphQL, where links are expanded and content blocks are embedded at the point of a user requesting the page.
Have you taken this into consideration, as it'll mean we'll need to additionally make a request to Content Block Manager when rendering content to the public?