Skip to content

Conversation

@dinesh-verma-datahub
Copy link
Contributor

Problem

PowerBI ingestion was emitting CorpUserKeyClass (just username) instead of CorpUserInfoClass, causing existing user profiles from LDAP/SCIM/Okta to be overwritten with incomplete data.

Solution

  • Emit CorpUserInfoClass with full profile (displayName, email, active)
  • Add overwrite_existing_users config (default: false) to protect existing users
  • Add customProperties for traceability (powerbi_user_id, powerbi_graph_id, etc.)
  • Mark non-human principals (Apps, Service Principals) as active=false

Changes

  • config.py: Added overwrite_existing_users with validation
  • powerbi.py: Fixed to_datahub_user() to emit CorpUserInfoClass + skip logic
  • powerbi_pre.md: Added User Ownership Configuration documentation
  • test_powerbi_user_creation.py: 26 comprehensive unit tests
  • golden_*.json: Regenerated 16 golden files

…users

## Problem
PowerBI ingestion was emitting CorpUserKeyClass (just username) instead of
CorpUserInfoClass, causing existing user profiles from LDAP/SCIM/Okta to be
overwritten with incomplete data.

## Solution
- Emit CorpUserInfoClass with full profile (displayName, email, active)
- Add overwrite_existing_users config (default: false) to protect existing users
- Add customProperties for traceability (powerbi_user_id, powerbi_graph_id, etc.)
- Mark non-human principals (Apps, Service Principals) as active=false

## Changes
- config.py: Added overwrite_existing_users with validation
- powerbi.py: Fixed to_datahub_user() to emit CorpUserInfoClass + skip logic
- powerbi_pre.md: Added User Ownership Configuration documentation
- test_powerbi_user_creation.py: 26 comprehensive unit tests
- golden_*.json: Regenerated 16 golden files
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 22, 2025
@github-actions
Copy link
Contributor

Linear: ING-1312

@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 22, 2025
- Fix import sorting in test file (ruff I001)
- Fix markdown formatting (prettier)
- Add troubleshooting section to docs
- Add caching strategy comment
- Clarify graph access requirements in config
- Add breaking change entry to updating-datahub.md for user overwrite behavior
- Add YAML example to Known Limitations workaround section in powerbi_pre.md
- Strengthen test for _check_user_exists to ensure 100% coverage
The relative path to powerbi_pre.md doesn't resolve in docs-website build context.
Replaced with plain text reference.
config:
ownership:
create_corp_user: true
overwrite_existing_users: true # Temporarily enable to refresh all users
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I have stateful ingestion enabled and have one run with this option enabled and then second run without it, will it accidentally delete the users ingested by the previous run?

Comment on lines +243 to +245
- **Graph access requirement:** `overwrite_existing_users=false` requires DataHub graph access to
check if users exist. This works automatically when using the DataHub REST sink. File-based sinks
(e.g., writing to JSON files) don't have graph access.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment - there is an optional config datahub_api (set on the same level as sink or source). It should be possible to provide the graph connection via this parameter (automatically) if file sink is used.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 22, 2025
### Troubleshooting
**Warning: "Graph unavailable - creating all users"**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider throwing an error here and stopping the execution, instead of a warning.

)
return []

# Log at DEBUG level for data quality monitoring (INFO is too noisy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed, please remove it

user_key = CorpUserKeyClass(username=user.id)

user_key_mcp = self.new_mcp(
# Check if we should skip this user
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed, please remove it

logger.debug(f"Mapping user {user.displayName}(id={user.id}) to DataHub user")

# Create an URN for user
# Build user URN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed, please remove it

Comment on lines +125 to +126
assert isinstance(mcps[0].aspect, CorpUserInfoClass)
assert not isinstance(mcps[0].aspect, CorpUserKeyClass)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 125th implies line 126th. There is no reason to test both.

mock_reporter,
mock_dataplatform_resolver,
):
"""graphId=None → not in customProperties."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically speaking it is an empty string ("", not None).

Comment on lines +571 to +572
mock_config.ownership.use_powerbi_email = True
mock_config.ownership.remove_email_suffix = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and some other, repeating this pattern) unit test would be clearer if we just specified the config explicitly.
Setting it like that is not a big problem, but in my opinion specifying config explicitly as a proper object, with values set there explicitly, is easier to read (and in general it happens in many places in our code base).

debug_calls = [str(call) for call in mock_logger.debug.call_args_list]
assert any("has no email" in str(call) for call in debug_calls)

def test_to_datahub_user_logs_debug_for_missing_display_name(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it makes much sense to unit test whether a log has been emitted.



class TestShouldSkipUserCoverage:
"""Additional tests to ensure 100% coverage of _should_skip_user method."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel at some moment this file begins to massively over-test. Is there any specific case previous tests did not cover?

Comment on lines +171 to +203
def _check_user_exists(self, user_urn: str) -> bool:
"""
Check if a user already exists in DataHub, with caching.
Returns False if:
- User doesn't exist
- Graph is unavailable
- API call fails (with warning logged)
"""
if user_urn in self._user_exists_cache:
return self._user_exists_cache[user_urn]

if not self.__ctx.graph:
return False # Handled by caller

try:
result = self.__ctx.graph.get_entities(
entity_name="corpuser",
urns=[user_urn],
aspects=["corpUserInfo"],
with_system_metadata=False,
)
exists = user_urn in result and "corpUserInfo" in result[user_urn]
self._user_exists_cache[user_urn] = exists
return exists
except Exception as e:
# Log WARNING - API failure could lead to unintended overwrites
logger.warning(
f"Could not check if user {user_urn} exists (graph API error): {e}. "
f"Proceeding with user creation."
)
self._user_exists_cache[user_urn] = False
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major comment - It is good the code caches results of a single call. But it still means that if there are 10.000 user we are going to make 10.000 calls to the API. Why don't we pre-populate this cache by scrolling through users using batch retrieval API?

Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate your approach being very complete, with documentation and a lot of unit tests. Please see my comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants