Skip to content

Latest commit

 

History

History
237 lines (178 loc) · 11.8 KB

File metadata and controls

237 lines (178 loc) · 11.8 KB

Deep Dive: The Ontology Engine

Role: The Schema Authority, Validation Layer, and Metamodel Governor.

1. Executive Summary & Philosophy

The Ontology Engine is the "Constitutional Court" of ModelKG. It defines the unbreachable laws of physics for the data universe. It exists to solve the fundamental paradox of Graph Databases: Flexibility vs. Integrity.

In a raw Graph Database (like Neo4j), the schema is "schema-on-read," meaning the database will happily accept any garbage data you throw at it. If you write a node with age: "thirty" (string) today and age: 30 (int) tomorrow, the database doesn't care. But your application will crash when it tries to calculate the average age.

ModelKG flips this model to Schema-on-Write. We assert that for a Knowledge Graph to be a "System of Record" rather than a "Dumpster", we must continuously enforce semantic consistency before persistence.

      TRADITIONAL (Schema-on-Read)             MODELKG (Schema-on-Write)
      +------------------------+              +------------------------+
      | Any Data (Garbage In)  |              | Any Data               |
      +-----------+------------+              +-----------+------------+
                  |                                       |
                  v                                       v
      +-----------+------------+              +-----------+------------+
      |       DATABASE         |              |    ONTOLOGY ENGINE     |
      |   {age: "thirty"}      |              |   Checks: isInt(age)?  |
      |   {age: 30}            |              +-----------+------------+
      +-----------+------------+                          |
                  |                             (Only Valid Data Passes)
                  v                                       |
      +-----------+------------+                          v
      |   APPLICATION CRASH    |              +-----------+------------+
      |   avg(age) = Error     |              |   CLEAN GRAPH DB       |
      +------------------------+              +------------------------+

1.1. The Strategic Value of Explicit Ontology

  1. Cognitive Offloading: No developer can hold the entire entity relationship model of a large enterprise in their head. The Ontology is living documentation. You don't ask a senior engineer "Can a Server connect to a Business Process?"; you ask the Ontology API.
  2. Generic Tooling: Because the schema is introspectable, we can build generic UI components (Generic Table, Generic Form, Generic Explorer) that adapt automatically. A frontend form component can query the definition of Incident and automatically render a date-picker for the occurred_at field without custom code.
  3. Governance & Compliance: By defining "Data Classification" at the schema level (e.g., via a Confidential trait), we ensure that PII rules are applied automatically to every new instance of that type.

2. Structural Mechanics: The Metamodel Primitive

The Ontology is composed of three core primitives: Concepts, Relationships, and Traits.

2.1. Concepts (The Nouns)

A Concept represents a class of things in the universe. It is analagous to a Class in OOP or a Table in SQL, but more flexible.

Inheritance & Polymorphism

ModelKG supports multiple inheritance (Mixins) and hierarchical inheritance.

           [ Concept: Asset ]
           (Abstract: True)
           (Props: asset_id, cost)
               ^       ^
               |       | (Inherits)
               |       |
      +--------+       +---------+
      |                          |
[ Concept: Server ]       [ Concept: Laptop ]
(Props: cpu, ram)         (Props: battery_level)
      ^
      |
[ Concept: DB_Server ]
(Props: storage_type)
  • Hierarchical: DB_Server points to Server. Server points to Asset.
    • Reasoning: This allows for Liskov Substitution. If a query looks for MATCH (n:Asset), it successfully returns DB_Server nodes.
  • Abstract Concepts: Concepts can be marked abstract: true. You cannot create a node that is just an Asset. It must be a concrete implementation like Laptop.

Property System

Properties are not just Key-Values. They are strictly typed descriptors.

  • Primitive Types: string, integer, float, boolean, date, datetime.
  • Complex Types: json (for unstructured payloads), point (geospatial).
  • Enums: Restricted lists of values (e.g., Status: ['Draft', 'Active', 'Archived']).
    • Why? Stringly-typed status fields are a source of constant bugs (e.g., In Progress vs in-progress vs In_Progress). Enums enforce normalization.

2.2. Relationships (The Verbs)

Relationships are first-class citizens. In SQL, a relationship is often a hidden "Foreign Key" or a "Join Table". In ModelKG, it is a tangible entity.

[ Source: Person ] --( Relationship: EMPLOYED_BY )--> [ Target: Company ]
                         |
                         | Properties:
                         +-- start_date: Date
                         +-- role: String
                         +-- salary_grade: Int

Directed Semantics

All relationships in ModelKG are directed (Source -> Target). However, the Ontology can declare a relationship as "Semantically Symmetric" (e.g., PEER_OF). Even if stored as A->B, the query engine knows to treat it as bidirectional for logic purposes.

2.3. Traits (The Adjectives / Interfaces)

Traits are the superpower of the Ontology Engine. They allow us to standardize behavior across unrelated domains.

The "Auditable" Trait

Every impactful node should track its history. Instead of defining created_by on 50 different concepts, we define it once in the Auditable trait (Interface) and mix it in.

  • Impact: The system automatically validates that user context is present when modifying any node implementing this trait.

The "Lifecycle" Trait

Many objects move through states. The Lifecycle trait enforces a status property and creates a "State Machine" constraint.

  • Logic: It prevents illegal transitions. e.g., you cannot go from Draft to Archived without passing through Active.

3. Behavioral Semantics: Constraints & Logic

The structure defines what data can exist. The Constraints define what data must (or must not) exist.

3.1. Cardinality

We enforce strict cardinality rules on relationships.

(1..1) REQUIRED ONE
[ Task ] ---------------> [ Project ]
(Every task MUST belong to a project)

(0..1) OPTIONAL ONE
[ User ] ---------------> [ Manager ]
(A user might be the CEO, having no manager)

(0..*) MANY
[ Project ] ------------> [ Document ]
(A project can have zero or unlimited documents)

Why this matters: In many systems, "Orphan Nodes" (data disconnected from the main graph) trigger silent failures in reports. ModelKG's 1..* constraint effectively garbage collects or blocks the creation of orphans at the root.

3.2. Topological Constraints

Graph theory allows for dangerous structures like Cycles.

  • DAG Enforcement: For dependencies (e.g., Project Tasks, Software Dependencies), the Ontology can enforce a Directed Acyclic Graph (DAG) constraint.
  • Mechanism: Before adding edge A->B, the engine runs a lightweight "Shortest Path" check backwards from B->A. If a path exists, adding A->B would close a loop. The write is rejected.
CYCLE PREVENTION LOGIC:

Existing: [ A ] ---> [ B ] ---> [ C ]

Action: User tries to add [ C ] ---> [ A ]

Check:  Does Path(A -> ... -> C) exist? -> YES

Result: BLOCK WRITE (Cycle Detected)

3.3. Uniqueness & Identity

Neo4j supports generic constraints, but ModelKG creates Semantic Identity.

  • Global ID: UUIDv4 used for system referencing.
  • Natural Key: The human-readable identifier (e.g., hostname for a Server, email for a Person).
  • Scope: The Ontology defines scope. Project.name might only need to be unique within a Department, whereas User.email must be globally unique.

4. The Validation Pipeline Architecture

How do we perform these checks without slowing the system to a crawl?

4.1. The "Hot Schema" Cache

We do not query Postgres for every write. The Ontology Engine publishes the compiled schema to Redis and local memory in the Graph Core.

  • Versioning: Each schema bundle has a hash. If the Graph Core sees a new hash, it hot-reloads the schema validation rules.

4.2. Validation Steps

  1. Syntactic Validation (Fast):
    • Input: JSON payload.
    • Check: Do fields match types? (Integers are integers).
    • Location: In-Memory (Pydantic models generated from Ontology).
  2. Structural Validation (Medium):
    • Input: Source/Target IDs.
    • Check: Is Source actually a Person? Is Target actually a Project?
    • Location: Neo4j Index Lookups (very fast).
  3. Semantic Validation (Slow/Complex):
    • Input: The Graph Topology.
    • Check: Cycle detection, Max Depth limits.
    • Location: Graph Algorithm execution. Note: These are only run for specific Relationship Types flagged as complex_constraint: true.

5. Schema Evolution & Migration

In SQL, ALTER TABLE is a nightmare. In Graph, it's nuanced.

5.1. The "Deprecation" Workflow

We rarely delete fields instantly. We use a 3-phase lifecycle for Schema Changes:

  1. Active: Property is required and enforced.
  2. Deprecated: Property is optional, but logs a warning if used. Frontend hides it.
  3. End-of-Life: Property is rejected.

5.2. Data Patching

When a constraint is tightened (e.g., Status becomes required), existing data is invalid.

  • The Ontology Engine generates a Compliance Report: "500 nodes missing 'status'."
  • It does not allow the Schema Upgrade until the Action Executor runs a Migration Job (e.g., "Set default status = 'Active'").

6. Real-World Use Case Scenarios

Use Case A: Enterprise Access Management (RBAC)

  • Problem: "Role Explosion" - thousands of AD groups.
  • Constraint: Segregation of Duties (SoD). A user cannot have multiple paths to conflicting roles.
      /--[Role:Approver]--\
[User]                     [Resource:PaymentSystem]
      \--[Role:Requester]-/

LOGIC: Ontology Trait "Segregated" on Roles checks path distinctness.
RESULT: User cannot hold both roles for the same system.

Use Case B: The "HearthOS" (Contextual Intelligence)

  • Problem: Context is vague.
  • Ontology Solution:
    • Concepts: Context (e.g., "Deep Work").
    • Rule: Task --BEST_FOR--> Context.
    • Inference: When the user's phone state is "Driving", the system queries MATCH (t:Task)-[:BEST_FOR]->(:Context {name: 'Commuting'}).
[ Phone State: Driving ] ---> [ Context: Commuting ]
                                     ^
                                     | (Filter)
[ Task: "Call Mom" ] ----------------+
[ Task: "Code" ]     (Excluded: requires 'Deep Work')

7. Comparison with Standards (OWL/SHACL)

Why build a custom engine instead of using W3C standards like OWL or SHACL?

  1. Complexity: OWL is designed for "Open World" reasoning (inferring truth). We operate in a "Closed World" (validation constraints). We don't want to guess data; we want to enforce it.
  2. Performance: SHACL validation is computationally expensive to run on every transaction. Our lightweight JSON-schema-based approach allows for O(1) validation complexity for 90% of operations.
  3. Developer Experience: Asking a modern Full-Stack developer to write RDF/XML is a non-starter. JSON definitions are native to the TypeScript/Python ecosystem.

8. Conclusion

The Ontology Engine is not just a "checker". It is the Blueprint of Reality for the organization. By investing effort in defining the Ontology, we move complexity out of the Application Code (millions of if statements) and into the Metadata layer, where it is visible, manageable, and enforceable.