-
Notifications
You must be signed in to change notification settings - Fork 579
Atomic Search Param operations #5433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 66 commits
acdf1c6
a76ad96
b9e1733
efc97f3
ce3d1af
5d806a2
ef9b530
35cd84b
0d7ca29
64fef20
3ba9312
fd27c7b
318591e
d994a32
ea6c08f
59e0acb
0a24ed5
9b74e37
67bef92
04aa622
999ac0c
c5fe031
f1e5520
c9957c7
7f912b9
5784146
b14f3cd
b00ce93
be7f304
47cc2ac
09a1c10
7cae0ad
5d83f96
a9239b7
213164c
82245bb
d0daa62
ade3cfe
2734794
46df45c
899dc01
cd0d5c9
c4c8bb8
8dda219
a3290ea
58f9b6f
0f5145b
00d7275
459adee
dddee07
baedead
4d97074
224b41f
192e009
9b6402c
0f2d317
92e7c98
a076be6
7583792
2e31e83
c47c185
6d442cc
95ddad7
9ad7a76
01ea4db
da818c1
c2033df
a1e6977
218172b
aebd351
42a1235
dd30efc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| # ADR 2603: Atomic SearchParameter CRUD Operations | ||
| Labels: [SQL](https://github.com/microsoft/fhir-server/labels/Area-SQL) | [Core](https://github.com/microsoft/fhir-server/labels/Area-Core) | [SearchParameter](https://github.com/microsoft/fhir-server/labels/Area-SearchParameter) | ||
|
|
||
| ## Context | ||
| SearchParameter create, update, and delete operations require two coordinated writes: the SearchParameter status row (`dbo.SearchParam` table) and the resource itself (`dbo.Resource` via `dbo.MergeResources`). Previously, these writes occurred in separate steps in the request pipeline, creating partial-failure windows where one could succeed while the other fails — producing orphaned or inconsistent state. | ||
|
|
||
| Composing these writes into a single atomic SQL operation introduced a new problem for transaction bundles: `dbo.MergeSearchParams` acquires an exclusive lock on `dbo.SearchParam` per entry, and that lock is held by the outer bundle transaction. When the next entry's behavior pipeline calls `GetAllSearchParameterStatus` on a separate connection, it blocks on the same table, causing a timeout. This required a deferred-flush approach for transaction bundles. | ||
|
|
||
| Additionally, SearchParameter CRUD behaviors previously performed direct in-memory cache mutations (`AddNewSearchParameters`, `DeleteSearchParameter`, etc.) during the request pipeline. This duplicated responsibility with the `SearchParameterCacheRefreshBackgroundService`, which already polls the database and applies cache updates across all instances. | ||
|
|
||
| Key considerations: | ||
| - Eliminating partial-commit windows between status and resource persistence. | ||
| - Handling the lock contention on `dbo.SearchParam` introduced by composed writes in transaction bundles. | ||
| - Simplifying cache ownership by removing direct cache mutations from the CRUD request path. | ||
| - Preserving existing behavior for non-SearchParameter resources and Cosmos DB paths. | ||
|
|
||
| ## Decision | ||
| We implement three complementary changes: | ||
|
|
||
| ### 1. Composed writes for single operations (SQL) | ||
| For individual SearchParameter CRUD, behaviors queue pending status updates in request context (`SearchParameter.PendingStatusUpdates`) instead of persisting directly. `SqlServerFhirDataStore` detects pending statuses and calls `dbo.MergeSearchParams` (which internally calls `dbo.MergeResources`) so both writes execute in one stored-procedure-owned transaction. | ||
|
|
||
| ### 2. Deferred flush for transaction bundles | ||
| For transaction bundles, per-entry resource upserts call `dbo.MergeResources` only (no SearchParam table touch), avoiding the exclusive lock. `BundleHandler` accumulates pending statuses across all entries and flushes them in a single `dbo.MergeSearchParams` call at the end of the bundle, still within the outer transaction scope. | ||
|
|
||
| ```mermaid | ||
| graph TD; | ||
| A[SearchParameter Operation] -->|Single operation| B[Behavior queues pending status in request context]; | ||
| B --> C[SqlServerFhirDataStore calls MergeSearchParams]; | ||
| C --> D[MergeSearchParams: status + MergeResources in one transaction]; | ||
| A -->|Transaction bundle| E[Per-entry: Behavior queues status, UpsertAsync calls MergeResources only]; | ||
| E --> F[BundleHandler drains pending statuses after each entry]; | ||
| F --> G[After all entries: single MergeSearchParams flush within outer transaction]; | ||
| G --> H[Transaction commits atomically]; | ||
| ``` | ||
|
|
||
| ### 3. Cache update removal from CRUD path | ||
| SearchParameter CRUD behaviors no longer perform direct in-memory cache mutations. All cache updates (`AddNewSearchParameters`, `DeleteSearchParameter`, `UpdateSearchParameterStatus`) are now solely owned by the `SearchParameterCacheRefreshBackgroundService`, which periodically polls the database via `GetAndApplySearchParameterUpdates`. This simplifies the CRUD path and ensures consistent cache convergence across distributed instances. | ||
|
|
||
| ### Scope | ||
| - **SQL Server**: Full atomic guarantees for create, update, and delete of SearchParameter resources, both single and in transaction bundles. | ||
| - **Cosmos DB**: Pending statuses are flushed after resource upsert (improved sequencing, not a single transactional unit). | ||
| - **Unchanged**: Non-SearchParameter CRUD, existing SearchParameter status lifecycle states, cache convergence model. | ||
|
|
||
| ## Status | ||
| Pending acceptance | ||
|
|
||
| ## Consequences | ||
| - **Positive Impacts:** | ||
| - Eliminates orphaned status/resource records from partial commits. | ||
| - Clarifies ownership: behaviors queue intent, data stores persist atomically, background service owns cache. | ||
| - Deferred-flush approach avoids lock contention introduced by composed writes in transaction bundles. | ||
| - Removing cache mutations from CRUD simplifies the request path and eliminates a class of cache-divergence bugs. | ||
|
|
||
| - **Potential Drawbacks:** | ||
| - Increased complexity in request context, data store, and bundle handler coordination. | ||
| - SQL schema migration required (`MergeSearchParams` expanded to accept resource TVPs). | ||
| - Eventual consistency window: cache may lag behind database until the next background refresh cycle. | ||
| - Cosmos DB path remains best-effort sequencing rather than true atomic commit. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,7 +19,9 @@ | |
| using Microsoft.Health.Fhir.Core.Features.Operations.BulkDelete.Messages; | ||
| using Microsoft.Health.Fhir.Core.Features.Persistence; | ||
| using Microsoft.Health.Fhir.Core.Features.Search; | ||
| using Microsoft.Health.Fhir.Core.Features.Search.Parameters; | ||
| using Microsoft.Health.Fhir.Core.Messages.Delete; | ||
| using Microsoft.Health.Fhir.Core.Models; | ||
| using Microsoft.Health.JobManagement; | ||
| using Newtonsoft.Json; | ||
|
|
||
|
|
@@ -31,19 +33,22 @@ public class BulkDeleteProcessingJob : IJob | |
| private readonly Func<IScoped<IDeletionService>> _deleterFactory; | ||
| private readonly RequestContextAccessor<IFhirRequestContext> _contextAccessor; | ||
| private readonly IMediator _mediator; | ||
| private readonly ISearchParameterOperations _searchParameterOperations; | ||
| private readonly Func<IScoped<ISearchService>> _searchService; | ||
| private readonly IQueueClient _queueClient; | ||
|
|
||
| public BulkDeleteProcessingJob( | ||
| Func<IScoped<IDeletionService>> deleterFactory, | ||
| RequestContextAccessor<IFhirRequestContext> contextAccessor, | ||
| IMediator mediator, | ||
| ISearchParameterOperations searchParameterOperations, | ||
| Func<IScoped<ISearchService>> searchService, | ||
| IQueueClient queueClient) | ||
| { | ||
| _deleterFactory = EnsureArg.IsNotNull(deleterFactory, nameof(deleterFactory)); | ||
| _contextAccessor = EnsureArg.IsNotNull(contextAccessor, nameof(contextAccessor)); | ||
| _mediator = EnsureArg.IsNotNull(mediator, nameof(mediator)); | ||
| _searchParameterOperations = EnsureArg.IsNotNull(searchParameterOperations, nameof(searchParameterOperations)); | ||
| _searchService = EnsureArg.IsNotNull(searchService, nameof(searchService)); | ||
| _queueClient = EnsureArg.IsNotNull(queueClient, nameof(queueClient)); | ||
| } | ||
|
|
@@ -78,6 +83,13 @@ public async Task<string> ExecuteAsync(JobInfo jobInfo, CancellationToken cancel | |
| Exception exception = null; | ||
| List<string> types = definition.Type.SplitByOrSeparator().ToList(); | ||
|
|
||
| if (types.Count > 0 | ||
| && string.Equals(types[0], KnownResourceTypes.SearchParameter, StringComparison.OrdinalIgnoreCase) | ||
| && !(definition.ExcludedResourceTypes?.Any(x => string.Equals(x, KnownResourceTypes.SearchParameter, StringComparison.OrdinalIgnoreCase)) ?? false)) | ||
| { | ||
| await _searchParameterOperations.EnsureNoActiveReindexJobAsync(cancellationToken); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are several problems with this implementation of "running reindex" check:
Implementations that is free from the above problems is to move this check to MergeSearchParams stored procedure in the same SQL transaction that runs write, and rollback this transaction if check does not pass. For reindex orchestrator to bypass this check we can add its job id as input. MergeSearchParam is already rewritten in this PR significantly, so it is right time to add above logic and get robust processing.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change closes the BulkDelete gap and ensures this path now checks for active reindex instead of bypassing the rule entirely. Moving the enforcement further down the stack is a larger follow-up: we would need to handle Cosmos correctly and define an internal override mechanism for flows like reindex, which must be able to update search parameter status while reindex is running. I think those extra considerations warrant a separate follow-up item rather than adding that additional improvement now as the PR is already pretty large. It is at least centralized in SearchParameterOperations which clarifies the intent and creates a pattern of centralized check.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that moving down the stack is more work, but it is not that more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feel free to create a follow-up item for this work. Below is the test for bulk delete reindex coordination. src/Microsoft.Health.Fhir.Core.UnitTests/Features/Operations/BulkDelete/BulkDeleteProcessingJobTests.cs
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will add repair item.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not e2e, I will add an e2e test to cover this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added this task https://microsofthealth.visualstudio.com/Health/_workitems/edit/186796/ BTW Would we absolutely need e2e tests if above task was implemented? Could not we just rely on current exception trapping? |
||
| } | ||
|
|
||
| try | ||
| { | ||
| resourcesDeleted = await deleter.Value.DeleteMultipleAsync( | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update search param workflow calls validate logic that triggers cache update, so, currently, cache update is run 2 times for an update workflow. How will proposed change work in regards of validate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently these checks are still required due to the limitations in Cosmos. I did look to removing them and handle with Sql Deadlock / OptimisticConcurrency error handling and throw a 409, but Cosmos would still require these upper-level checks. For now, it's acceptable to ensure consistency at this layer until we are able to remove the Cosmos code path and move this all to SQL layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think situation is worse than you describe :-) as any validation (does not matter with Cosmos or with SQL) requires cache being in sync. I don't think sync will go away after gen 1 is retired.
This means, we need to add work item to ensure that sync via background worker and on demand sync required by validation can coexist without cache pollution.