Skip to content

feat(gms): add entity graph cache for domain, containers, group membership and glossary#17886

Open
david-leifker wants to merge 1 commit into
masterfrom
feat/graph-cache-oss
Open

feat(gms): add entity graph cache for domain, containers, group membership and glossary#17886
david-leifker wants to merge 1 commit into
masterfrom
feat/graph-cache-oss

Conversation

@david-leifker

@david-leifker david-leifker commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Introduces a GMS entity graph cache — Hazelcast-backed, pre-built relationship snapshots that avoid repeated primary-storage aspect walks and graph/search scrolls on hot paths (VBAC policy expansion, search filter rewriters, GraphQL hierarchy/membership queries, and session authorization).

Four bundled graphs ship in entity-graph-cache.yaml:

Graph Build source Scope Population Purpose
domain@search search index FULL SCHEDULED (600s) Domain IsPartOf tree for VBAC, search rewriters, GraphQL
glossary@graph graph scroll PARTIAL (depth 25) LAZY Glossary node/term hierarchy
container@graph graph scroll PARTIAL (depth 12) LAZY Container nesting hierarchy
membership@graph graph scroll FULL SCHEDULED (600s) Corp user ↔ group/role membership edges

Cache is enabled by default on GMS (ENTITY_GRAPH_CACHE_ENABLED); MAE/MCE consumers and datahub-upgrade register EntityGraphCache.NO_OP.

Architecture

  • Core API: EntityGraphCache interface + EntityGraphCacheService (expand, listRelated, ancestor walks, invalidation, rebuild scheduling)
  • Storage: Hazelcast IMaps for distributed snapshots + local view cache + memory pressure monitor
  • Build paths: primary, graph, or search scroll depending on graph config
  • Scope modes: FULL (whole graph) and PARTIAL (per weakly connected component, directional BFS capped by maxDepth)
  • Hierarchy client layer: BoundHierarchyAccess + HierarchyBindings / HierarchyReadSpecs — cache-first ancestor expand, ordered parents, direct children, descendant checks
  • Membership client layer: BoundMembershipAccess + MembershipBindings / MembershipReadSpecs — cache-first group/role neighbor listing with aspect and graph-scroll fallbacks
  • Invalidation: Sync writes that pass the UI/sync gate drop or patch snapshots inline (after ES indexing in preprocessEvent); async/non-gated ingest relies on scheduled/LAZY rebuild staleness windows
  • Fallbacks: On cache miss — aspect parent walk or membership aspect read, then GraphRetriever scroll; deleteDomain child guard uses primary-store verify via AspectDirectChildrenWalker (not cache-backed)

Integrations

Area Change
VBAC / policies DomainFieldResolverProvider, ContainerFieldResolverProvider, and GlossaryFieldResolverProvider use cache-first expand on bundled graphs
Search rewriters DomainExpansionRewriter for domains.keyword filter expansion
GraphQL hierarchy ParentDomainsResolver, ParentNodesResolver, ParentContainersResolver, and EntityRelationshipsResultResolver (direct child domains/glossary nodes/containers) routed through BoundHierarchyAccess; MoveDomainResolver / UpdateParentNodeResolver trigger sync invalidation
GraphQL membership EntityRelationshipsResultResolver routes corp user / corp group / role membership through BoundMembershipAccess; session-user outgoing membership uses a fast path backed by SessionActorIdentity
Session actor identity SessionActorIdentity retains separate corpGroups and nativeGroups sets so the session fast path labels IsMemberOfGroup vs IsMemberOfNativeGroup correctly (dual-type queries no longer mislabel native memberships)
Entity service Sync invalidation hook in preprocess path
Operation context EntityGraphCache exposed via RetrieverContext; ActorGroupMembershipService / GroupService populate typed group membership
Hazelcast Shared bootstrap condition extended; entity graph status map wired in CacheConfig

Configuration & docs

  • Bundled YAML: metadata-service/configuration/src/main/resources/entity-graph-cache.yaml
  • Spring properties + JSON overlay: EntityGraphCacheProperties, EntityGraphCacheConfigLoader
  • Deploy guide: docs/deploy/gms-entity-graph-cache.md
  • Env var reference added to docs/deploy/environment-vars.md

Test plan

  • ./gradlew spotlessCheck
  • ./gradlew :li-utils:test --tests com.datahub.authorization.SessionActorIdentityTest
  • ./gradlew :datahub-graphql-core:test --tests com.linkedin.datahub.graphql.resolvers.load.EntityRelationshipsResultResolverTest
  • metadata-io entity graph cache unit + Hazelcast integration tests
  • EntityGraphCacheConfigLoaderTest, factory map config tests
  • Smoke: smoke-test/tests/entity_graph_cache/test_entity_graph_cache.py (domain, glossary, container hierarchy + session-user membership idempotency)
  • Manual: GraphQL session user query with types: [IsMemberOfGroup, IsMemberOfNativeGroup] — verify relationship type per edge matches cache/scroll path

@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Jun 14, 2026
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
9099 1 9098 109
View the top 1 failed test(s) by shortest run time
tests.cypress.integration_test::test_run_cypress
Stack Traces | 334s run time
auth_session = <tests.utils.TestSessionWrapper object at 0x7fd829b86190>

    def test_run_cypress(auth_session):
        # Run with --record option only if CYPRESS_RECORD_KEY is non-empty
        record_key = env_vars.get_cypress_record_key()
        tag_arg = ""
        test_strategy = env_vars.get_test_strategy()
        if record_key:
            record_arg = " --record "
            batch_number = env_vars.get_batch_number()
            batch_count = env_vars.get_batch_count()
            if batch_count > 1:
                batch_suffix = f"-{batch_number}{batch_count}"
            else:
                batch_suffix = ""
            tag_arg = f" --tag {test_strategy}{batch_suffix}"
        else:
            record_arg = " "
    
        logger.info(f"test strategy is {test_strategy}")
        test_spec_arg = ""
        specs_str = ",".join([f"**/{f}" for f in _get_filtered_or_batched_tests()])
        test_spec_arg = f" --spec '{specs_str}' "
    
        logger.info("Running Cypress tests with command")
        node_options = "--max-old-space-size=500"
        electron_args = 'ELECTRON_EXTRA_LAUNCH_ARGS="--js-flags=\'--max-old-space-size=4096 --disable-dev-shm-usage --disable-gpu --no-sandbox"'
        command = f'{electron_args} NO_COLOR=1 NODE_OPTIONS="{node_options}" npx cypress run {record_arg} {test_spec_arg} {tag_arg}'
        logger.info(command)
        # Add --headed --spec '**/mutations/mutations.js' (change spec name)
        # in case you want to see the browser for debugging
        print_now()
        proc = subprocess.Popen(
            command,
            shell=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            cwd=f"{CYPRESS_TEST_DATA_DIR}",
            text=True,  # Use text mode for string output
            bufsize=1,  # Line buffered
        )
        assert proc.stdout is not None
        assert proc.stderr is not None
    
        # Function to read and print output from a pipe
        def read_and_print(pipe, prefix=""):
            for line in pipe:
                logger.info(f"{prefix}{line.rstrip()}")
    
        # Read and print output in real-time
    
        stdout_thread = threading.Thread(target=read_and_print, args=(proc.stdout,))
        stderr_thread = threading.Thread(
            target=read_and_print, args=(proc.stderr, "stderr: ")
        )
    
        # Set threads as daemon so they exit when the main thread exits
        stdout_thread.daemon = True
        stderr_thread.daemon = True
    
        # Start the threads
        stdout_thread.start()
        stderr_thread.start()
    
        # Wait for the process to complete
        return_code = proc.wait()
    
        # Wait for the threads to finish
        stdout_thread.join()
        stderr_thread.join()
    
        logger.info(f"return code: {return_code}")
        print_now()
>       assert return_code == 0
E       assert 1 == 0

tests/cypress/integration_test.py:363: AssertionError
View the full list of 1 ❄️ flaky test(s)
create and manage group cypress/e2e/settingsV2/v2_managing_groups.js::cypress/e2e/settingsV2/v2_managing_groups.js

Flake rate in main: 25.00% (Passed 24 times, Failed 8 times)

Stack Traces | 28.5s run time
2026-06-23T00:29:21.961Z
Timed out retrying after 10000ms: Expected to find content: 'Example Name 22119' but never did.

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from 9605333 to 1ccc9a3 Compare June 14, 2026 15:59
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from 1ccc9a3 to 9b7a500 Compare June 14, 2026 16:29
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from 9b7a500 to 7c7ce45 Compare June 14, 2026 23:12
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from 7c7ce45 to 5c8c894 Compare June 15, 2026 00:32
@david-leifker david-leifker marked this pull request as ready for review June 15, 2026 15:11
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from 5c8c894 to 0b56cd6 Compare June 15, 2026 16:15
@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label Jun 15, 2026
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from 0b56cd6 to 3440567 Compare June 18, 2026 19:58
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from 3440567 to b36425a Compare June 18, 2026 22:05
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from b36425a to acb53f4 Compare June 18, 2026 22:46
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from acb53f4 to bf48dab Compare June 22, 2026 20:23
@david-leifker david-leifker changed the title feat(gms): add entity graph cache for domain and glossary hierarchies feat(gms): add entity graph cache for domain, containers, group membership and glossary Jun 22, 2026
@david-leifker david-leifker force-pushed the feat/graph-cache-oss branch from bf48dab to 92b19ab Compare June 22, 2026 20:32
Introduce a Hazelcast-backed entity graph cache with bundled domain,
glossary, container, and membership graphs. Route hot GraphQL, VBAC,
and search paths through cache-first client bindings with aspect and
scroll fallbacks. Retain corp vs native group distinction in
SessionActorIdentity so session-user membership fast paths label
IsMemberOfGroup and IsMemberOfNativeGroup correctly.

Co-authored-by: Cursor <cursoragent@cursor.com>
private EntityGraphCacheClients() {}

@Nonnull
public static GraphReadResult expand(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builder pattern with defaults for the method arg (i.e. ExpandRequest { default: USE_DEFINITION_MAX_DEPTH... } ? Having 4 different method entry points is going to make adding any additional params to this a nightmare.

@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jun 23, 2026
@Nonnull
private static Optional<KnownEntityGraph> knownGraphForPolicyField(
@Nonnull String policyFieldType) {
if ("DOMAIN".equals(policyFieldType)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer constants

@Nonnull MembershipReadSpec spec) {
String entityType = UrnUtils.getUrn(seedUrn).getEntityType();
return switch (entityType) {
case "corpuser" -> new ScrollConfig(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should use entity name constants


@Nonnull @Builder.Default Set<String> roleRelationshipTypes = Set.of("IsMemberOfRole");

@Nonnull @Builder.Default Set<String> scrollUserEntityTypes = Set.of("corpuser");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same

EntityGraphSnapshot.builder()
.graphId(snapshot.getGraphId())
.cacheKey(key)
.generation((existing == null ? 0L : existing.getGeneration()) + 1L)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a version gate for the generation similar to conditional writes? I think a race condition exists for surgical removes where you can wind up with an inconsistent cache.

node1: getSnapshot(a,b,c) node2: getSnapshot(a,b,c)
node1: surgicalRemove(a) node2: surgicalRemove(b)
node1: publish(b,c)->gen1
node2: publish(a,c)->gen2

}

if (urnsWithKeyAspect.contains(urn)) {
String createKey = urn + "|" + aspectName;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to use a true separator character that can't exist in an urn, but I think this is mostly safe for the current entities it's used for since they get generated as uuid format.


@Nonnull
private DirectedMultigraph<String, DirectedEdge> forwardGraph() {
if (forwardGraph == null) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but this is an avoidable race condition through synchronization on building the graph which is a non-trivial amount of work since there are a lot of entry points into this.

@RyanHolstien RyanHolstien left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some nitpicky things and one legitimate question around cache inconsistency, but overall approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants