Description
Describe the bug
We've used Microsoft FHIR Server's search architecture (well documented here) for over 2+ years now with great success in a high traffic production environment. Our business has grown to include over 50 Cosmos databases, many reaching the 20gb+ size range.
I would like to share one of our findings in how to improve the search architecture slightly so as to decrease the chance for queries to cause a scan.
Data provider?
CosmosDB
Move the Parameter property ('p')
Currently, you persist each search index where the parameter/field path is part of the index entry.
Storage:
"searchIndices": [
{ "p": "related-type", "s": "http://hl7.org/fhir/observation-relationshiptypes", "c": "derived-from" },
{ "p": "_lastUpdated", "st": "2018-08-22T23:37:56.1289012+00:00", "et": "2018-08-22T23:37:56.1289012+00:00" },
]
Example query:
WHERE
EXISTS (
SELECT VALUE si FROM c.searchIndices
WHERE si.p = "_lastUpdated" AND si.st >= "2018-08-22T23:37:56.1289012+00:00"
)
If you needed to perform an equality search, Cosmos does well in these situations. An EXISTS subquery will usually cost very little.
However, scans can potentially occur if you attempt to use a greater-than or lesser-than comparison. At first, we didn't see any signs of this until our database size grew into the many-gb ranges. Even then, it was difficult to narrow down the precise conditions that causes scans. What we could see was very high RU costs and a notable high count of "Retrieved document count" in the query metrics, which is a clear sign of scanning occurring.
We changed how we were storing our indices slightly:
Storage:
"searchIndices": [
"related-type": [{ "s": "http://hl7.org/fhir/observation-relationshiptypes", "c": "derived-from" }],
"_lastUpdated": [{ "st": "2018-08-22T23:37:56.1289012+00:00", "et": "2018-08-22T23:37:56.1289012+00:00" }],
]
Example query:
WHERE
EXISTS (
SELECT VALUE si FROM c.searchIndices["_lastUpdated"]
WHERE si.st >= "2018-08-22T23:37:56.1289012+00:00"
)
This seemed to clear up the problem dramatically. Note that each entry in searchIndices.xyz
is an array so that we still support multiple search index entries at the same path in case the path included collections of data. The improvement here is now the query has one less condition in the subquery clause, which apparently helps the cosmos engine deal better with the greater-than/less-than comparisons (although I am unsure why).