|
| 1 | +# GitHub Copilot Instructions |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +HMPPS CFO Data Management System (DMS) — a .NET 10 distributed microservices application that processes p-NOMIS (Offloc) and nDelius offender data to supply CATS (Case Assessment and Tracking System) with accurate offender records. Targets Windows Server EC2 (deployed as Windows Services) and runs locally via .NET Aspire. |
| 6 | + |
| 7 | +## Build, Test & Run |
| 8 | + |
| 9 | +```bash |
| 10 | +# Build the solution |
| 11 | +dotnet build |
| 12 | + |
| 13 | +# Run all tests |
| 14 | +dotnet test --configuration Release |
| 15 | + |
| 16 | +# Run tests for a specific project |
| 17 | +dotnet test tests/Matching.Engine.Tests/Matching.Engine.Tests.csproj --configuration Release |
| 18 | + |
| 19 | +# Run a single test |
| 20 | +dotnet test tests/Api.Tests/Api.Tests.csproj --filter "FullyQualifiedName~SearchAsync_WithNoMatches_ReturnsNotFound" |
| 21 | + |
| 22 | +# Run locally via Aspire (recommended — starts all services + dependencies) |
| 23 | +# VS Code: F5 → Default Configuration |
| 24 | +# Visual Studio: select "Aspire.AppHost" debug config |
| 25 | + |
| 26 | +# Deploy databases to a test environment |
| 27 | +export SERVER="..." DB_USER="..." DB_PASS="..." |
| 28 | +python3 publish_db.py # deploy |
| 29 | +python3 publish_db.py --dry-run # preview only |
| 30 | + |
| 31 | +# Seed test data |
| 32 | +dotnet run --project ./src/FakeDataSeeder/FakeDataSeeder.csproj |
| 33 | +``` |
| 34 | + |
| 35 | +## Architecture |
| 36 | + |
| 37 | +### Data Pipeline |
| 38 | + |
| 39 | +``` |
| 40 | +FileSync → [Offloc.Cleaner → Offloc.Parser | Delius.Parser] → Import → DbInteractions → Blocking → Matching.Engine → API / Visualiser |
| 41 | +``` |
| 42 | + |
| 43 | +All inter-service communication is **asynchronous via RabbitMQ** using the [Rebus](https://github.com/rebus-org/Rebus) library. Each stage publishes a `*FinishedMessage` that triggers the next stage. |
| 44 | + |
| 45 | +### Services |
| 46 | + |
| 47 | +| Service | Type | Role | |
| 48 | +|---------|------|------| |
| 49 | +| `FileSync` | Worker | Monitors MinIO/S3/filesystem for incoming files | |
| 50 | +| `Offloc.Cleaner` | Worker | Cleans raw Offloc (p-NOMIS) files | |
| 51 | +| `Offloc.Parser` | Worker | Parses cleaned Offloc files into DB records | |
| 52 | +| `Delius.Parser` | Worker | Parses nDelius files into DB records | |
| 53 | +| `Import` | Worker | Coordinates staging → running picture migration | |
| 54 | +| `DbInteractions` | Worker | Executes DB staging/merge operations (runs in SQL container) | |
| 55 | +| `Blocking` | Worker | Generates candidate record pairs for matching | |
| 56 | +| `Matching.Engine` | Worker | Compares pairs (Comparator), scores (Scorer, Bayesian), clusters | |
| 57 | +| `Cleanup` | Worker | Data maintenance | |
| 58 | +| `Logging` | Worker | Centralised log aggregation | |
| 59 | +| `Meow` | Worker | CATS RabbitMQ integration (different broker config) | |
| 60 | +| `API` | ASP.NET Core | REST endpoints for downstream consumers | |
| 61 | +| `Visualiser` | ASP.NET Core | Blazor web UI for exploring offender relationships | |
| 62 | + |
| 63 | +### Databases (SQL Server) |
| 64 | + |
| 65 | +Seven separate databases: `OfflocStagingDb`, `OfflocRunningPictureDb`, `DeliusStagingDb`, `DeliusRunningPictureDb`, `MatchingDb`, `ClusterDb`, `AuditDb`. Database schemas are managed as SQL Database Projects under `src/Database/`. |
| 66 | + |
| 67 | +### Shared Libraries (`src/Libraries/`) |
| 68 | + |
| 69 | +- **`Messaging`** — RabbitMQ integration via Rebus; all message types; `Exchanges` constants; `AddDmsRabbitMQ()` extension |
| 70 | +- **`Infrastructure`** — EF Core `DbContext`s (`OfflocContext`, `DeliusContext`, `ClusteringContext`, `AuditContext`), entity models, repositories, shared DTOs |
| 71 | +- **`Matching.Core`** — `IMatcher<T, Result>` interface and concrete matchers (Jaro-Winkler, Levenshtein, Caver, Date, Postcode, Equality); `[Matcher("key")]` attribute for dynamic discovery |
| 72 | +- **`EnvironmentSetup`** — `AddDmsCoreWorkerService()` and `UseDmsSerilog()` extension methods shared by all worker services; `FileLocations` / `FilePatterns` |
| 73 | + |
| 74 | +## Key Conventions |
| 75 | + |
| 76 | +### Service Bootstrap Pattern |
| 77 | + |
| 78 | +All worker services follow the same bootstrap pattern in `Program.cs`: |
| 79 | + |
| 80 | +```csharp |
| 81 | +var builder = Host.CreateApplicationBuilder(args); |
| 82 | +builder.AddDmsCoreWorkerService(); // Serilog + Windows Service + file locations |
| 83 | +builder.Services.AddDmsRabbitMQ(builder.Configuration); |
| 84 | +// ... register additional services |
| 85 | +var app = builder.Build(); |
| 86 | +await app.RunAsync(); |
| 87 | +``` |
| 88 | + |
| 89 | +`Meow` and `API`/`Visualiser` are exceptions — they configure messaging or hosting differently. |
| 90 | + |
| 91 | +### Messaging |
| 92 | + |
| 93 | +- All messages implement `IMessage` from `Messaging.Messages` |
| 94 | +- Messages are grouped by pipeline stage: `BlockingMessages`, `DbMessages`, `ImportMessages`, `MatchingMessages`, `StagingMessages`, `MergingMessages`, `StatusMessages` |
| 95 | +- Exchange names are string constants in `Messaging.Exchanges` (lowercase: `staging`, `merging`, `database`, etc.) |
| 96 | +- RabbitMQ connection string is pulled from `ConnectionStrings:RabbitMQ` in config |
| 97 | + |
| 98 | +### Dependency Injection |
| 99 | + |
| 100 | +- Worker services use standard `Microsoft.Extensions.DependencyInjection` |
| 101 | +- `Matching.Engine` additionally uses **Autofac** (via `AutofacServiceProviderFactory`) for registering matchers and scorers dynamically via reflection |
| 102 | + |
| 103 | +### Matching Engine |
| 104 | + |
| 105 | +- `[Matcher("key")]` attribute decorates matcher classes for dynamic registration |
| 106 | +- Three hosted services run in parallel within one process: `ComparatorService`, `ScorerService`, `ClusteringService` |
| 107 | +- `MatchingQueue` is a singleton in-memory queue between comparator and scorer |
| 108 | +- Scoring uses Bayesian probability; matchers include string similarity algorithms (Jaro-Winkler, Levenshtein) and phonetic matching (Caver/Soundex) |
| 109 | + |
| 110 | +### API Authentication |
| 111 | + |
| 112 | +The API supports two auth schemes via a `"Smart"` policy scheme: |
| 113 | +- **JWT Bearer** (Entra ID / Microsoft Identity) — used for `dms.read`, `dms.write`, `visualiser.read`, `visualiser.write` scopes |
| 114 | +- **Legacy API Key** (`X-API-KEY` header) — for backward compatibility |
| 115 | + |
| 116 | +Swagger UI is only enabled when `IsDevelopment=true` (passed via Aspire parameter). |
| 117 | + |
| 118 | +### Package Management |
| 119 | + |
| 120 | +All NuGet package versions are centrally managed in `Directory.Packages.props` — never specify versions in individual `.csproj` files. |
| 121 | + |
| 122 | +### Test Patterns |
| 123 | + |
| 124 | +- Framework: **xunit** with `EF Core InMemory` for repository/endpoint tests |
| 125 | +- Tests use `IDisposable` to call `context.Database.EnsureDeleted()` in teardown |
| 126 | +- Each test creates a database with a unique name (`$"TestDb_{Guid.NewGuid()}"`) to avoid cross-test contamination |
| 127 | +- Integration tests for messaging use **Testcontainers** (RabbitMQ) |
| 128 | +- Arrange/Act/Assert comment blocks are used consistently |
| 129 | + |
| 130 | +### Aspire Configuration |
| 131 | + |
| 132 | +`Parameters:startCoreServices` (bool) controls whether RabbitMQ, MinIO, and all worker services are started — set to `false` to run API + Visualiser only. `Parameters:seedData` triggers `FakeDataSeeder` on startup. |
| 133 | + |
| 134 | +### Filesystem Layout & Constraints |
| 135 | + |
| 136 | +All file I/O is rooted at `DMSFilesBasePath` (from config, `~` is expanded to the user profile). The `IFileLocations` / `FileLocations` abstraction (registered by `AddDmsFileLocations()`) exposes four derived paths: |
| 137 | + |
| 138 | +| Property | Path | |
| 139 | +|----------|------| |
| 140 | +| `deliusInput` | `{basePath}/Delius/Input/` | |
| 141 | +| `deliusOutput` | `{basePath}/Delius/Output/` | |
| 142 | +| `offlocInput` | `{basePath}/Offloc/Input/` | |
| 143 | +| `offlocOutput` | `{basePath}/Offloc/Output/` | |
| 144 | + |
| 145 | +**File naming patterns** (enforced by `FileConstants` / `FilePatterns`): |
| 146 | + |
| 147 | +| File type | Regex | |
| 148 | +|-----------|-------| |
| 149 | +| Delius extract | `cfoextract_\d{1,5}_(full\|diff)_\d{14}\.txt` | |
| 150 | +| Offloc data | `C_NOMIS_OFFENDER_\d{8}_.+\.dat` | |
| 151 | +| Offloc archive | `\d{8}\.zip` | |
| 152 | + |
| 153 | +**Processing flow:** |
| 154 | + |
| 155 | +1. `FileSync` downloads raw files into `*Input/` and publishes trigger messages. |
| 156 | +2. `Offloc.Cleaner` / `Delius.Parser` parse files and write pipe-delimited (`|`) output files with CRLF line endings into `*Output/{fileNameWithoutExtension}/`, one `.txt` file per entity (e.g. `Offenders.txt`, `EventDetails.txt`). |
| 157 | +3. `DbInteractions` calls the staging stored procedures (`DeliusStaging.StageDelius`, `OfflocStaging.Import`) which execute dynamic SQL `BULK INSERT` statements that read these output files **directly from disk into SQL Server**. |
| 158 | + |
| 159 | +**Critical BULK INSERT constraint:** because SQL Server's `BULK INSERT` resolves file paths from SQL Server's own process perspective, the parsed output files in `*Output/` **must be on a path that SQL Server can access directly**. The `RUNNING_IN_CONTAINER` config flag exists in `DbInteractionService` but is currently unused. Never move staging output files or change the path format without ensuring SQL Server can still reach them. |
| 160 | + |
| 161 | +**Matching/Clustering bulk inserts** (in `MatchingRepository` and `ClusteringRepository`) are different — they use batched parameterised Dapper `ExecuteAsync` calls (batch size 1000, concurrency 16), not `BULK INSERT`, so they have no filesystem dependency. |
0 commit comments