-
Notifications
You must be signed in to change notification settings - Fork 3.3k
fix(powerbi): emit CorpUserInfoClass instead of CorpUserKeyClass for users #15748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…users ## Problem PowerBI ingestion was emitting CorpUserKeyClass (just username) instead of CorpUserInfoClass, causing existing user profiles from LDAP/SCIM/Okta to be overwritten with incomplete data. ## Solution - Emit CorpUserInfoClass with full profile (displayName, email, active) - Add overwrite_existing_users config (default: false) to protect existing users - Add customProperties for traceability (powerbi_user_id, powerbi_graph_id, etc.) - Mark non-human principals (Apps, Service Principals) as active=false ## Changes - config.py: Added overwrite_existing_users with validation - powerbi.py: Fixed to_datahub_user() to emit CorpUserInfoClass + skip logic - powerbi_pre.md: Added User Ownership Configuration documentation - test_powerbi_user_creation.py: 26 comprehensive unit tests - golden_*.json: Regenerated 16 golden files
|
Linear: ING-1312 |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
- Fix import sorting in test file (ruff I001) - Fix markdown formatting (prettier) - Add troubleshooting section to docs - Add caching strategy comment - Clarify graph access requirements in config
- Add breaking change entry to updating-datahub.md for user overwrite behavior - Add YAML example to Known Limitations workaround section in powerbi_pre.md - Strengthen test for _check_user_exists to ensure 100% coverage
The relative path to powerbi_pre.md doesn't resolve in docs-website build context. Replaced with plain text reference.
| config: | ||
| ownership: | ||
| create_corp_user: true | ||
| overwrite_existing_users: true # Temporarily enable to refresh all users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I have stateful ingestion enabled and have one run with this option enabled and then second run without it, will it accidentally delete the users ingested by the previous run?
| - **Graph access requirement:** `overwrite_existing_users=false` requires DataHub graph access to | ||
| check if users exist. This works automatically when using the DataHub REST sink. File-based sinks | ||
| (e.g., writing to JSON files) don't have graph access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment - there is an optional config datahub_api (set on the same level as sink or source). It should be possible to provide the graph connection via this parameter (automatically) if file sink is used.
| ### Troubleshooting | ||
| **Warning: "Graph unavailable - creating all users"** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider throwing an error here and stopping the execution, instead of a warning.
| ) | ||
| return [] | ||
|
|
||
| # Log at DEBUG level for data quality monitoring (INFO is too noisy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not needed, please remove it
| user_key = CorpUserKeyClass(username=user.id) | ||
|
|
||
| user_key_mcp = self.new_mcp( | ||
| # Check if we should skip this user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not needed, please remove it
| logger.debug(f"Mapping user {user.displayName}(id={user.id}) to DataHub user") | ||
|
|
||
| # Create an URN for user | ||
| # Build user URN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not needed, please remove it
| assert isinstance(mcps[0].aspect, CorpUserInfoClass) | ||
| assert not isinstance(mcps[0].aspect, CorpUserKeyClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 125th implies line 126th. There is no reason to test both.
| mock_reporter, | ||
| mock_dataplatform_resolver, | ||
| ): | ||
| """graphId=None → not in customProperties.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically speaking it is an empty string ("", not None).
| mock_config.ownership.use_powerbi_email = True | ||
| mock_config.ownership.remove_email_suffix = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and some other, repeating this pattern) unit test would be clearer if we just specified the config explicitly.
Setting it like that is not a big problem, but in my opinion specifying config explicitly as a proper object, with values set there explicitly, is easier to read (and in general it happens in many places in our code base).
| debug_calls = [str(call) for call in mock_logger.debug.call_args_list] | ||
| assert any("has no email" in str(call) for call in debug_calls) | ||
|
|
||
| def test_to_datahub_user_logs_debug_for_missing_display_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it makes much sense to unit test whether a log has been emitted.
|
|
||
|
|
||
| class TestShouldSkipUserCoverage: | ||
| """Additional tests to ensure 100% coverage of _should_skip_user method.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel at some moment this file begins to massively over-test. Is there any specific case previous tests did not cover?
| def _check_user_exists(self, user_urn: str) -> bool: | ||
| """ | ||
| Check if a user already exists in DataHub, with caching. | ||
| Returns False if: | ||
| - User doesn't exist | ||
| - Graph is unavailable | ||
| - API call fails (with warning logged) | ||
| """ | ||
| if user_urn in self._user_exists_cache: | ||
| return self._user_exists_cache[user_urn] | ||
|
|
||
| if not self.__ctx.graph: | ||
| return False # Handled by caller | ||
|
|
||
| try: | ||
| result = self.__ctx.graph.get_entities( | ||
| entity_name="corpuser", | ||
| urns=[user_urn], | ||
| aspects=["corpUserInfo"], | ||
| with_system_metadata=False, | ||
| ) | ||
| exists = user_urn in result and "corpUserInfo" in result[user_urn] | ||
| self._user_exists_cache[user_urn] = exists | ||
| return exists | ||
| except Exception as e: | ||
| # Log WARNING - API failure could lead to unintended overwrites | ||
| logger.warning( | ||
| f"Could not check if user {user_urn} exists (graph API error): {e}. " | ||
| f"Proceeding with user creation." | ||
| ) | ||
| self._user_exists_cache[user_urn] = False | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major comment - It is good the code caches results of a single call. But it still means that if there are 10.000 user we are going to make 10.000 calls to the API. Why don't we pre-populate this cache by scrolling through users using batch retrieval API?
skrydal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate your approach being very complete, with documentation and a lot of unit tests. Please see my comments.
Problem
PowerBI ingestion was emitting CorpUserKeyClass (just username) instead of CorpUserInfoClass, causing existing user profiles from LDAP/SCIM/Okta to be overwritten with incomplete data.
Solution
Changes