-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Rdf ingestion mvp #15741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Rdf ingestion mvp #15741
Conversation
- Add RDF ingestion source for glossary terms, domains, and relationships - Streamlined architecture: extractors return DataHub AST directly - Removed unnecessary abstraction layers (RDF AST, converters where not needed) - Support for SKOS, OWL, and other RDF vocabularies - Comprehensive test coverage with 128 passing tests - UI integration for RDF source configuration
- Remove build_relationship_mcps() method from GlossaryTermMCPBuilder - Update tests to use RelationshipMCPBuilder directly - Clean separation: glossary_term handles terms, relationship handles relationships - Refactor _generate_workunits_from_ast to reduce complexity
…ance - Enhance error handling in RDFSource to provide actionable messages for missing files, malformed RDF, and invalid formats. - Implement unit tests to verify error handling behavior and ensure graceful degradation. - Update glossary term URN generation to use dot notation for hierarchical paths. - Improve logging for large file processing and ensure consistent URN formats across glossary nodes and terms. - Refactor methods to yield MCPs for memory efficiency during processing.
…uidance - Added helper text for various RDF source fields including source, format, extensions, recursive processing, environment, and dialect. - These enhancements aim to provide clearer instructions and examples for users configuring RDF ingestion settings.
- Introduced new RDF platform entry in capability_summary.json with detailed capabilities including deletion detection, tags, ownership, lineage, data profiling, domains, descriptions, and platform instance support. - Each capability includes a description and support status to enhance clarity for users configuring RDF ingestion.
- Changed the warning filter to ignore specific SQLAlchemy warnings. - Added a new dependency for RDF support in the setup configuration.
…ters - Introduced a new documentation file for the RDF ingestion source. - Updated type hints across various classes to use `Optional` for context parameters, enhancing code clarity and type safety. - Adjusted method signatures in `EntityExtractor`, `EntityConverter`, `EntityMCPBuilder`, and related classes to reflect these changes.
- Included "rdf" in both base development and full test development requirements in setup.py to ensure proper support for RDF ingestion.
This reverts commit 48a0118.
- Added unit tests for duplicate term definition handling, ensuring correct extraction behavior for same URIs and properties. - Implemented comprehensive validation tests for RDF source configuration, covering required fields, type checks, and value constraints. - Introduced connection testing unit tests to verify functionality and error handling for various scenarios, including file existence and RDF format validation. - Developed edge case tests to handle scenarios like empty files, circular relationships, and special characters in paths. - Enhanced error handling tests to ensure actionable feedback for file not found, invalid format, and unsupported extensions.
- Updated test_duplicate_handling.py to use URIRef and Literal for RDF terms, enhancing readability and maintainability. - Modified test_rdf_config.py to specify type for config_dict, improving type safety. - Adjusted test_rdf_source_errors.py to refine error assertion logic for better clarity in failure messages.
- Updated test_rdf_config.py to clarify format and environment validation logic, ensuring accurate assertions for invalid formats and environments. - Refined test cases to reflect changes in default values and type coercion for recursive configurations. - Improved test_rdf_connection.py and test_rdf_edge_cases.py by adding namespace prefixes for better RDF structure clarity. - Enhanced error handling in test_rdf_source_errors.py to ensure actionable feedback for various failure scenarios.
- Introduced a new RDFSourceConfig class for managing RDF ingestion configurations, including options for source paths, formats, and stateful ingestion. - Updated the GlossaryTermMCPBuilder to dynamically set the termSource based on the presence of the term's source, ensuring accurate metadata representation for RDF terms. - Improved documentation within the code to clarify the behavior and expectations for term source handling in RDF ingestion.
- Added a new entry for the Airflow data platform in the sources.json file, including its URN, name, and display name. - This update enhances the ingestion capabilities by incorporating Airflow as a supported source.
- Added support for loading RDF data from local and remote zip files. - Implemented functionality to handle web folder URLs with HTML directory listings. - Updated documentation examples to include zip file usage. - Added unit tests for zip file and web folder loading, including edge cases for error handling and security checks.
… be regenerated by CI)
|
Linear: ING-1308 |
|
✅ Meticulous spotted 0 visual differences across 979 screens tested: view results. Meticulous evaluated ~8 hours of user flows against your PR. Expected differences? Click here. Last updated for commit 83de506. This comment will update as new commits are pushed. |
Bundle ReportChanges will increase total bundle size by 3.34kB (0.01%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
Files in
|
Summary
This PR introduces a new RDF ingestion source for DataHub, enabling ingestion of RDF/OWL ontologies (Turtle, RDF/XML, JSON-LD, N3, N-Triples) with a focus on business glossaries. The source extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.
What's New
Core Features
type: rdf) - Native DataHub plugin for RDF/OWL ontologiesskos:Conceptandowl:Classto DataHub GlossaryTermsskos:broaderandskos:narrowerrelationships asisRelatedTermsstateful_ingestionconfigplatform_instanceconfigArchitecture
test_connection()for connection validationCapabilities
The source supports the following DataHub capabilities:
skos:broaderandskos:narrowerstateful_ingestion.enabled: trueplatform_instanceconfigskos:definitionorrdfs:comment)Testing
Test Coverage
export_only,skip_export)Test Files
tests/unit/rdf/- Unit tests for individual componentstests/integration/rdf/- Integration tests with golden file validationDocumentation
User Documentation
docs/sources/rdf/rdf.md- Comprehensive user guide (489 lines)Recipe Examples
docs/sources/rdf/rdf_recipe.yml- Example recipes for basic and stateful ingestionIntegration Test Documentation
tests/integration/rdf/README.md- Detailed guide for running integration testsConfiguration Example
source:
type: rdf
config:
source: ./glossary.ttl
format: turtle
environment: PROD
stateful_ingestion:
enabled: true
remove_stale_metadata: true
export_only:
- glossary## Files Changed
Technical Notes
Security & Performance
Code Quality
New Files
src/datahub/ingestion/source/rdf/ingestion/rdf_source.py- Main source implementationsrc/datahub/ingestion/source/rdf/core/rdf_loader.py- RDF loading utilities with securitysrc/datahub/ingestion/source/rdf/core/urn_generator.py- URN generation with encodingsrc/datahub/ingestion/source/rdf/entities/base.py- Base interfaces for entity processingsrc/datahub/ingestion/source/rdf/entities/registry.py- Thread-safe entity registrydocs/sources/rdf/rdf.md- User documentationdocs/sources/rdf/rdf_recipe.yml- Recipe examplestests/integration/rdf/test_rdf_source.py- Integration teststests/unit/rdf/- Unit tests (multiple files)Modified Files
setup.py- Added RDF source to entry points (line 862)Breaking Changes
None - This is a new feature addition with no breaking changes to existing functionality.
Support Status
The RDF source is marked as INCUBATING (
SupportStatus.INCUBATING), indicating it's ready for community adoption but may have minor version changes in future releases based on feedback.Checklist
setup.py@platform_name,@config_class,@support_status)test_connection()implemented