-
Notifications
You must be signed in to change notification settings - Fork 69
llm based pk detector added #543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. We should clarify the scope. The primary purpose of PK detection is to use it in compare_datasets check, for cases where user don't know pk keys for comparison. There should be a way to call this as a standalone method as well. Profiler seems to be a good place. So a new method that can be called from the profiler should be added, e.g. detect_primary_keys_with_llm. If we want to generate uniqueness check from the profiler, then it should suggest existing is_unique check func. Yes, we can add this as as another profile, and use it for rules generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds LLM-based primary key detection capabilities to the DQX data quality framework. The functionality is completely optional and only activates when explicitly requested by users.
Key changes:
- Implements intelligent primary key detection using Large Language Models via DSPy and Databricks Model Serving
- Adds comprehensive configuration options for LLM-based detection with graceful fallback when dependencies are unavailable
- Integrates seamlessly with existing profiling workflow while maintaining backward compatibility
Reviewed Changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/databricks/labs/dqx/llm/pk_identifier.py |
Core LLM detection engine with table metadata analysis and duplicate validation |
src/databricks/labs/dqx/profiler/profiler.py |
Enhanced profiler with LLM detection methods and lazy import handling |
src/databricks/labs/dqx/profiler/generator.py |
Added primary key rule generation with LLM-specific metadata |
src/databricks/labs/dqx/profiler/runner.py |
Updated runner to support table-based profiling with PK detection |
src/databricks/labs/dqx/config.py |
Added LLM configuration fields to ProfilerConfig |
src/databricks/labs/dqx/check_funcs.py |
Implemented is_primary_key validation function |
tests/unit/test_llm_based_pk_identifier.py |
Comprehensive unit tests with graceful dependency handling |
tests/integration/test_pk_detection_integration.py |
End-to-end integration tests for the complete workflow |
src/databricks/labs/dqx/llm/demo.py |
Usage demonstration showing optional LLM activation |
src/databricks/labs/dqx/llm/README.md |
Detailed documentation with examples and best practices |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
❌ 406/407 passed, 3 flaky, 1 failed, 3 skipped, 4h41m36s total ❌ test_e2e_workflow: databricks.sdk.errors.platform.Unknown: finalize: Run failed with error message (10m2.183s)Flaky tests:
Running from acceptance #3023 |
fa4784a to
2b04529
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #543 +/- ##
===========================================
- Coverage 89.77% 50.05% -39.72%
===========================================
Files 56 56
Lines 4966 5270 +304
===========================================
- Hits 4458 2638 -1820
- Misses 508 2632 +2124 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 60 out of 65 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (6)
src/databricks/labs/dqx/llm/pk_identifier.py:1
- SQL injection vulnerability in table name parameter. The
tableparameter is directly interpolated into SQL query without validation or sanitization. Use parameterized queries or validate table name format.
src/databricks/labs/dqx/llm/pk_identifier.py:1 - SQL injection vulnerability in dynamic query construction. Both
tableandpk_columnsare used in string formatting without proper validation. While column names are backtick-quoted, the table name is not escaped. Validate inputs or use Spark SQL's parameterized query capabilities.
src/databricks/labs/dqx/llm/pk_identifier.py:1 - SQL injection vulnerability in table name. The
tableparameter is directly interpolated into the SQL query without sanitization.
src/databricks/labs/dqx/llm/pk_identifier.py:1 - SQL injection vulnerability in table name parameter used in query construction.
src/databricks/labs/dqx/llm/pk_identifier.py:1 - SQL injection vulnerability in table name parameter used in query construction.
src/databricks/labs/dqx/profiler/profiler.py:1 - Overly broad exception handling silently ignores all cleanup errors. Consider logging the error or catching only expected exceptions like
AnalysisExceptionto avoid masking unexpected issues.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Changes
LLM based Pk detector
Linked issues
#484
Resolves #..
Tests