Skip to content

CFODEV-1722 / CFODEV-1726: Add SignalR Redis Backplane + Worker background service#1144

Merged
Carl Sixsmith (carlsixsmith-moj) merged 25 commits into
developfrom
feature/cloud
Jun 22, 2026
Merged

CFODEV-1722 / CFODEV-1726: Add SignalR Redis Backplane + Worker background service#1144
Carl Sixsmith (carlsixsmith-moj) merged 25 commits into
developfrom
feature/cloud

Conversation

@samgibsonmoj

Copy link
Copy Markdown
Contributor

🔗 Related Work


📌 Summary

What does this PR do? Keep it short but clear.

Introduces the infrastructure needed to run CATS as multiple horizontally-scaled replicas behind a load balancer. Adds a Redis-backed distributed cache + SignalR backplane, a standalone Worker service that owns the Quartz scheduler, and cross-replica user-presence notifications. All of this sits behind three new feature flags that default to off, so existing single-replica deployments are completely unaffected.


🎯 Purpose / Motivation

Why does this change exist? What problem are we solving?

CATS is a Blazor Server app that currently assumes a single process. Several things break (or silently misbehave) when you run more than one replica:

  • In-process state — FusionCache and the user-presence dictionary live in one process, so each replica has its own view.
  • SignalR — without a backplane, hub messages only reach clients connected to the same replica.
  • Quartz jobs — every replica would run its own scheduler, double-firing scheduled jobs.

This PR makes CATS scale-out ready on Cloud Platform/Kubernetes while keeping the single-instance deployment path unchanged.


🧠 Approach

Key implementation details, trade-offs, or design decisions.

Everything new is opt-in via configuration. Each flag independently turns on one piece of scale-out behaviour, and when a flag is off the app falls back to the original in-process implementation. The flags are off in appsettings.json and switched on only in the Kubernetes deployment (infra/cats-deployment.yml).

1. New feature flags (Features:*)

Flag Default What it does Single-replica fallback
UseSignalRBackplane false Registers the Redis SignalR backplane, an IConnectionMultiplexer, the Redis-backed distributed cache/backplane for FusionCache, and RedisUsersStateContainer. InMemoryUsersStateContainer + in-process SignalR + in-memory FusionCache. No Redis dependency.
UseWorkerForJobs false CATS stops running Quartz in-process and delegates job inspection/control to the Worker via its REST API (WorkerJobManagementService). Quartz runs in-process in CATS (QuartzJobManagementService), exactly as today.
RelayUserPresenceNotifications false PresenceConnector subscribes to UserOnline and shows a snackbar to other users when someone in their area logs in. Presence is still tracked, but no "user is now online" notifications are raised.

Because each flag is read independently with a safe default, an existing single-replica environment that sets none of them behaves exactly as before — no Redis, no Worker, no extra notifications.

2. Redis distributed cache + SignalR backplane

  • When UseSignalRBackplane is on:
    • SignalR uses AddStackExchangeRedis with channel prefix Cats so hub messages fan out across replicas.
    • FusionCache is upgraded from a pure in-memory cache to L1 (memory) + L2 (Redis distributed cache) + Redis backplane, keeping caches coherent across replicas.
    • RedisUsersStateContainer stores presence in a Redis hash (cats:presence:connections) and announces changes on a pub/sub channel (cats:presence:changed), so a login on replica A refreshes the Users grid and notifications on replica B.
  • A dedicated IConnectionMultiplexer is registered for presence pub/sub, reusing the same redis connection string the backplane validates.
  • SmartEnum serialization fix — distributed (Redis) caching forced proper serialization of SmartEnum values; cache serializer options are now isolated (CacheJsonSerializerOptions) with dedicated converter tests.
  • Redis is provisioned as an ephemeral container in Aspire and as a redis-service in Kubernetes — it's a cache/transport, not a source of truth.

3. Worker service (src/Worker)

  • New minimal-API ASP.NET Core project that hosts the Quartz scheduler out-of-process, registered via AddWorkerInfrastructure (a slimmed-down infrastructure registration with no web/Rebus-consumer/full-Identity baggage).
  • Exposes a small REST surface consumed by CATS when UseWorkerForJobs is on:
    • GET /api/jobs, POST /api/jobs/{name}/trigger|pause|resume
    • GET /api/scheduler, POST /api/scheduler/standby|start
  • CATS resolves the Worker base URL via Aspire service discovery (https+http://cats-worker) or WorkerOptions:BaseUrl for non-Aspire/Kubernetes (http://cats-worker-service:8080).
  • Running the scheduler in a single Worker means scheduled jobs fire once regardless of how many CATS UI replicas are running.

👀 Reviewers please focus on: the DI branching in Server.UI/DependencyInjection.cs, Infrastructure/DependencyInjection.cs (AddInfrastructure job branch + the FusionCache/backplane block), and the Worker's job-management API contract.


🔄 Changes

  • Added:
    • src/Worker/* — standalone Quartz Worker service (Program, csproj, appsettings, launchSettings).
    • WorkerJobManagementService + IJobManagementService remote implementation.
    • RedisUsersStateContainer / InMemoryUsersStateContainer split behind IUsersStateContainer.
    • Redis distributed cache + backplane wiring for FusionCache; CacheJsonSerializerOptions + SmartEnumJsonConverterFactoryTests.
    • Three feature flags: UseWorkerForJobs, UseSignalRBackplane, RelayUserPresenceNotifications.
    • Aspire Redis container + Kubernetes redis-service and cats-worker deployment.
  • Updated:
    • Server.UI/DependencyInjection.cs — conditional SignalR backplane / presence container registration.
    • Infrastructure/DependencyInjection.cs — conditional in-process vs. remote Quartz; FusionCache L2/backplane.
    • PresenceConnector.razor — flag-gated online notifications.
    • infra/cats-deployment.yml, Aspire AppHost — enable flags + provision Redis/Worker in Cloud.
    • appsettings.json — new Features entries (all default false).
  • Removed:
    • The leaky ConcurrentDictionary presence API from IUsersStateContainer.

🧪 How to Test

Step-by-step instructions to verify this works.

A. Single replica / local (flags off — regression check)

  1. Run Cats.AppHost with all three flags false (default).
  2. Confirm the app starts with no Redis required, Quartz jobs run in-process (Job Management dashboard works), and presence/online list still function within the instance.

B. Scaled out (flags on)

  1. Enable UseSignalRBackplane, UseWorkerForJobs, RelayUserPresenceNotifications and run multiple CATS replicas + Redis + the Worker (Aspire or Kubernetes).
  2. Log in as User A on replica 1 and User B on replica 2.
  3. Confirm User B sees a "{A} is now online" snackbar and the Users grid online status updates across replicas.
  4. Trigger/pause/resume a job from the CATS Job Management dashboard and confirm it is actioned in the Worker (and fires only once).

Expected result:

Single-replica behaviour is identical to main. With flags on, presence/cache stay coherent across replicas and scheduled jobs run exactly once in the Worker.


📸 Screenshots / Output (if applicable)

UI changes, logs, API responses, etc.

N/A


⚠️ Risks & Impact

  • Breaking change
  • Database change
  • Performance impact
  • Security impact

Details:

New flags default to off, so this is non-breaking for existing single-replica deployments. When the backplane is on, a redis connection string is required (fails fast with a clear message if missing). Cache values now cross a serialization boundary (Redis) — SmartEnum serialization is covered by new tests, but reviewers should sanity-check other cached types.


🙋 Notes for Reviewers

Anything specific you want feedback on.

N/A

@samgibsonmoj Sam Gibson (samgibsonmoj) added Needs Review This pull request is awaiting review Feature This is a new feature (or extending an existing one) Large Change This pull request is a large change labels Jun 17, 2026
@samgibsonmoj Sam Gibson (samgibsonmoj) changed the title Add SignalR Redis Backplane + Worker background service CFODEV-1722 / CFODEV-1726: Add SignalR Redis Backplane + Worker background service Jun 17, 2026
@samgibsonmoj Sam Gibson (samgibsonmoj) force-pushed the feature/cloud branch 2 times, most recently from d4a31ec to b93c085 Compare June 18, 2026 09:37
@carlsixsmith-moj Carl Sixsmith (carlsixsmith-moj) force-pushed the feature/cloud branch 3 times, most recently from 51d4fa2 to 9687083 Compare June 19, 2026 11:10

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@carlsixsmith-moj Carl Sixsmith (carlsixsmith-moj) merged commit 6fbe8d0 into develop Jun 22, 2026
1 check passed
@carlsixsmith-moj Carl Sixsmith (carlsixsmith-moj) deleted the feature/cloud branch June 22, 2026 12:57
Carl Sixsmith (carlsixsmith-moj) added a commit that referenced this pull request Jun 25, 2026
…round service (#1144)

* WIP: worker service

* Remove Worker's dependency on Identity + move notify user to command

* Configure appsettings

* Add cats worker to infra

* Publish worker container

* This is a test branch that will allow users to be notified when someone in their area logs in.

* Fix users

* Add signlr backplain

* Add k8s redis container

* Update redis connection string name

* Add Redis memory cache + supporting infrastructure

Excludes application user claims caching

* Conditionally register presence connector depending on SignalR config

* Configure backplane in Aspire and cloud deployment

* Fix SmartEnum cache serialization and isolate cache serializer options

* Configure redis as ephemeral

* Register in memory and redis user state container

* Add redis container when required in Aspire

* Add feature to relay user presence notifications

* Relay notifications in CP

* Remove delete job step from deploy and use image SHA's for migrator/seed

* Deploy migrator/seed jobs as pods

* Cleanup migrator/seed pod before next deploy

* Remove obsolete worker extension

* rename ICommand -> IExternalCommand as this now conflicts with the Mediator package

* Add PresenceHub feature gating

---------

Co-authored-by: Carl Sixsmith <carl.sixsmith1@justice.gov.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature This is a new feature (or extending an existing one) Large Change This pull request is a large change Needs Review This pull request is awaiting review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants