Skip to content

Conversation

@jominjohny
Copy link
Contributor

Changes

LLM based Pk detector

Linked issues

#484

Resolves #..

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests

@jominjohny jominjohny requested a review from a team as a code owner August 25, 2025 05:39
@jominjohny jominjohny requested review from grusin-db and removed request for a team August 25, 2025 05:39
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. We should clarify the scope. The primary purpose of PK detection is to use it in compare_datasets check, for cases where user don't know pk keys for comparison. There should be a way to call this as a standalone method as well. Profiler seems to be a good place. So a new method that can be called from the profiler should be added, e.g. detect_primary_keys_with_llm. If we want to generate uniqueness check from the profiler, then it should suggest existing is_unique check func. Yes, we can add this as as another profile, and use it for rules generation.

@mwojtyczka mwojtyczka requested a review from Copilot August 29, 2025 10:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds LLM-based primary key detection capabilities to the DQX data quality framework. The functionality is completely optional and only activates when explicitly requested by users.

Key changes:

  • Implements intelligent primary key detection using Large Language Models via DSPy and Databricks Model Serving
  • Adds comprehensive configuration options for LLM-based detection with graceful fallback when dependencies are unavailable
  • Integrates seamlessly with existing profiling workflow while maintaining backward compatibility

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/databricks/labs/dqx/llm/pk_identifier.py Core LLM detection engine with table metadata analysis and duplicate validation
src/databricks/labs/dqx/profiler/profiler.py Enhanced profiler with LLM detection methods and lazy import handling
src/databricks/labs/dqx/profiler/generator.py Added primary key rule generation with LLM-specific metadata
src/databricks/labs/dqx/profiler/runner.py Updated runner to support table-based profiling with PK detection
src/databricks/labs/dqx/config.py Added LLM configuration fields to ProfilerConfig
src/databricks/labs/dqx/check_funcs.py Implemented is_primary_key validation function
tests/unit/test_llm_based_pk_identifier.py Comprehensive unit tests with graceful dependency handling
tests/integration/test_pk_detection_integration.py End-to-end integration tests for the complete workflow
src/databricks/labs/dqx/llm/demo.py Usage demonstration showing optional LLM activation
src/databricks/labs/dqx/llm/README.md Detailed documentation with examples and best practices

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

❌ 406/407 passed, 3 flaky, 1 failed, 3 skipped, 4h41m36s total

❌ test_e2e_workflow: databricks.sdk.errors.platform.Unknown: finalize: Run failed with error message (10m2.183s)
databricks.sdk.errors.platform.Unknown: finalize: Run failed with error message
 Execution context '7605794709678231298' in cluster '1031-092004-dkcglt0l' is not found.
[gw8] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
09:19 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
09:19 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
09:19 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+5720251031091948
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
09:19 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
09:19 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
09:19 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OOeb/dashboards'
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=742432001854428
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=742432001854428
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=742432001854428
09:19 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
09:20 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/742432001854428/runs/259322235792692
09:29 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
09:29 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+5720251031091948 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OOeb/logs/e2e/run-259322235792692-0/prepare.log
09:29 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
09:29 INFO [databricks.labs.dqx:finalize] DQX v0.9.4+5720251031091948 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OOeb/logs/e2e/run-259322235792692-0/finalize.log
09:29 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
09:29 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
09:29 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
09:19 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
09:19 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
09:19 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+5720251031091948
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
09:19 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
09:19 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
09:19 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
09:19 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OOeb/dashboards'
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=742432001854428
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=742432001854428
09:19 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=742432001854428
09:19 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
09:20 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/742432001854428/runs/259322235792692
09:29 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
09:29 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+5720251031091948 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OOeb/logs/e2e/run-259322235792692-0/prepare.log
09:29 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
09:29 INFO [databricks.labs.dqx:finalize] DQX v0.9.4+5720251031091948 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OOeb/logs/e2e/run-259322235792692-0/finalize.log
09:29 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
09:29 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
09:29 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
09:29 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+5720251031091948 from https://DATABRICKS_HOST
09:29 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=1120589382487332, as it is no longer needed
09:29 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=108012810876194, as it is no longer needed
09:29 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=742432001854428, as it is no longer needed
09:29 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw8] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

  • 🤪 test_profiler_serverless (10.005s)
  • 🤪 test_load_checks_from_table_saved_from_dict_with_unresolved_for_each_column (2.655s)
  • 🤪 test_e2e_workflow_for_patterns_exclude_patterns (14m6.584s)

Running from acceptance #3023

@codecov
Copy link

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 33.80952% with 417 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.05%. Comparing base (a06e918) to head (ca82ba0).

Files with missing lines Patch % Lines
src/databricks/labs/dqx/llm/pk_identifier.py 35.53% 254 Missing ⚠️
src/databricks/labs/dqx/check_funcs.py 32.22% 61 Missing ⚠️
src/databricks/labs/dqx/profiler/profiler.py 11.59% 61 Missing ⚠️
src/databricks/labs/dqx/engine.py 12.50% 21 Missing ⚠️
src/databricks/labs/dqx/profiler/generator.py 16.66% 10 Missing ⚠️
...atabricks/labs/dqx/installer/workflow_installer.py 0.00% 2 Missing ⚠️
src/databricks/labs/dqx/io.py 33.33% 2 Missing ⚠️
...rc/databricks/labs/dqx/profiler/profiler_runner.py 0.00% 2 Missing ⚠️
src/databricks/labs/dqx/rule.py 85.71% 2 Missing ⚠️
src/databricks/labs/dqx/llm/__init__.py 75.00% 1 Missing ⚠️
... and 1 more

❗ There is a different number of reports uploaded between BASE (a06e918) and HEAD (ca82ba0). Click for more details.

HEAD has 7 uploads less than BASE
Flag BASE (a06e918) HEAD (ca82ba0)
8 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #543       +/-   ##
===========================================
- Coverage   89.77%   50.05%   -39.72%     
===========================================
  Files          56       56               
  Lines        4966     5270      +304     
===========================================
- Hits         4458     2638     -1820     
- Misses        508     2632     +2124     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 60 out of 65 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (6)

src/databricks/labs/dqx/llm/pk_identifier.py:1

  • SQL injection vulnerability in table name parameter. The table parameter is directly interpolated into SQL query without validation or sanitization. Use parameterized queries or validate table name format.
    src/databricks/labs/dqx/llm/pk_identifier.py:1
  • SQL injection vulnerability in dynamic query construction. Both table and pk_columns are used in string formatting without proper validation. While column names are backtick-quoted, the table name is not escaped. Validate inputs or use Spark SQL's parameterized query capabilities.
    src/databricks/labs/dqx/llm/pk_identifier.py:1
  • SQL injection vulnerability in table name. The table parameter is directly interpolated into the SQL query without sanitization.
    src/databricks/labs/dqx/llm/pk_identifier.py:1
  • SQL injection vulnerability in table name parameter used in query construction.
    src/databricks/labs/dqx/llm/pk_identifier.py:1
  • SQL injection vulnerability in table name parameter used in query construction.
    src/databricks/labs/dqx/profiler/profiler.py:1
  • Overly broad exception handling silently ignores all cleanup errors. Consider logging the error or catching only expected exceptions like AnalysisException to avoid masking unexpected issues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants