Improved search indices persistence to reduce chance for collection scanning

**Describe the bug**
We've used Microsoft FHIR Server's search architecture (well documented [here](https://github.com/microsoft/fhir-server/blob/main/docs/SearchArchitecture.md#Persistence)) for over 2+ years now with great success in a high traffic production environment.  Our business has grown to include over 50 Cosmos databases, many reaching the 20gb+ size range.  

I would like to share one of our findings in how to improve the search architecture slightly so as to decrease the chance for queries to cause a scan.  

**Data provider?**
CosmosDB

### Move the Parameter property ('p')
Currently, you persist each search index where the parameter/field path is part of the index entry.
**Storage**:
```
"searchIndices": [
  { "p": "related-type", "s": "http://hl7.org/fhir/observation-relationshiptypes", "c": "derived-from" },
  { "p": "_lastUpdated", "st": "2018-08-22T23:37:56.1289012+00:00", "et": "2018-08-22T23:37:56.1289012+00:00" },
]
```
**Example query**:
```
WHERE
  EXISTS (
    SELECT VALUE si FROM c.searchIndices
    WHERE si.p = "_lastUpdated" AND si.st >= "2018-08-22T23:37:56.1289012+00:00"
  )
```

If you needed to perform an equality search, Cosmos does well in these situations.  An EXISTS subquery will usually cost very little.

However, scans can potentially occur if you attempt to use a greater-than or lesser-than comparison.  At first, we didn't see any signs of this until our database size grew into the many-gb ranges.  Even then, it was difficult to narrow down the precise conditions that causes scans.  What we could see was very high RU costs and a notable high count of "Retrieved document count" in the query metrics, [which is a clear sign of scanning occurring](https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-query-performance#queries-where-retrieved-document-count-exceeds-output-document-count).

We changed how we were storing our indices slightly:

**Storage:**
```
"searchIndices": [
  "related-type": [{ "s": "http://hl7.org/fhir/observation-relationshiptypes", "c": "derived-from" }],
  "_lastUpdated": [{ "st": "2018-08-22T23:37:56.1289012+00:00", "et": "2018-08-22T23:37:56.1289012+00:00" }],
]
```
**Example query**:
```
WHERE
  EXISTS (
    SELECT VALUE si FROM c.searchIndices["_lastUpdated"]
    WHERE si.st >= "2018-08-22T23:37:56.1289012+00:00"
  )
```
This seemed to clear up the problem _dramatically_.  Note that each entry in `searchIndices.xyz` is an array so that we still support multiple search index entries at the same path in case the path included collections of data.  The improvement here is now the query has one less condition in the subquery clause, which apparently helps the cosmos engine deal better with the greater-than/less-than comparisons (although I am unsure why).


[AB#93126](https://microsofthealth.visualstudio.com/f8da5110-49b1-4e9f-9022-2f58b6124ff9/_workitems/edit/93126)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved search indices persistence to reduce chance for collection scanning #2686

Move the Parameter property ('p')

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improved search indices persistence to reduce chance for collection scanning #2686

Description

Move the Parameter property ('p')

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions