Skip to content

Improved search indices persistence to reduce chance for collection scanning #2686

Open
@justinbhopper

Description

@justinbhopper

Describe the bug
We've used Microsoft FHIR Server's search architecture (well documented here) for over 2+ years now with great success in a high traffic production environment. Our business has grown to include over 50 Cosmos databases, many reaching the 20gb+ size range.

I would like to share one of our findings in how to improve the search architecture slightly so as to decrease the chance for queries to cause a scan.

Data provider?
CosmosDB

Move the Parameter property ('p')

Currently, you persist each search index where the parameter/field path is part of the index entry.
Storage:

"searchIndices": [
  { "p": "related-type", "s": "http://hl7.org/fhir/observation-relationshiptypes", "c": "derived-from" },
  { "p": "_lastUpdated", "st": "2018-08-22T23:37:56.1289012+00:00", "et": "2018-08-22T23:37:56.1289012+00:00" },
]

Example query:

WHERE
  EXISTS (
    SELECT VALUE si FROM c.searchIndices
    WHERE si.p = "_lastUpdated" AND si.st >= "2018-08-22T23:37:56.1289012+00:00"
  )

If you needed to perform an equality search, Cosmos does well in these situations. An EXISTS subquery will usually cost very little.

However, scans can potentially occur if you attempt to use a greater-than or lesser-than comparison. At first, we didn't see any signs of this until our database size grew into the many-gb ranges. Even then, it was difficult to narrow down the precise conditions that causes scans. What we could see was very high RU costs and a notable high count of "Retrieved document count" in the query metrics, which is a clear sign of scanning occurring.

We changed how we were storing our indices slightly:

Storage:

"searchIndices": [
  "related-type": [{ "s": "http://hl7.org/fhir/observation-relationshiptypes", "c": "derived-from" }],
  "_lastUpdated": [{ "st": "2018-08-22T23:37:56.1289012+00:00", "et": "2018-08-22T23:37:56.1289012+00:00" }],
]

Example query:

WHERE
  EXISTS (
    SELECT VALUE si FROM c.searchIndices["_lastUpdated"]
    WHERE si.st >= "2018-08-22T23:37:56.1289012+00:00"
  )

This seemed to clear up the problem dramatically. Note that each entry in searchIndices.xyz is an array so that we still support multiple search index entries at the same path in case the path included collections of data. The improvement here is now the query has one less condition in the subquery clause, which apparently helps the cosmos engine deal better with the greater-than/less-than comparisons (although I am unsure why).

AB#93126

Metadata

Metadata

Assignees

No one assigned

    Labels

    Area-CosmosDBArea related to CosmosDB storageArea-SearchArea related to search.BugBug bug bug.Enhancement-OptimizationOptimization on existing functionality.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions