Skip to content

Latest commit

 

History

History
329 lines (220 loc) · 21.9 KB

File metadata and controls

329 lines (220 loc) · 21.9 KB

Critical Analysis: Ontology Atomization Plan

Executive Summary

The atomization plan proposes decomposing monolithic Turtle ontology files into per-entity JSON files with PostgreSQL indexing and auto-generated Turtle. While it solves real pain points around merge conflicts and query performance, the plan introduces substantial new complexity, abandons standard ontology tooling interoperability, and may create worse problems at the file-system and Git layers than the ones it solves. This analysis examines the trade-offs and explores alternative approaches.


1. Strengths of the Plan

1.1 Meaningful Git Diffs

The single strongest argument for atomization. A monolithic 50K-concept Turtle file produces unreadable diffs when a single annotation changes. Per-entity files make git log --follow classes/Fraud_e5f6g7h8.json possible, giving each concept a traceable edit history.

1.2 Elimination of Merge Conflicts

With one file, two users editing different classes on different branches will produce a textual merge conflict even though the changes are semantically independent. Per-entity files make this structurally impossible (unless both users edit the same entity).

1.3 PostgreSQL-Backed Queries

Replacing in-memory RDFLib graph traversal with indexed PostgreSQL queries is sound engineering for scale. The current architecture loads the entire graph into memory for every tree/search/detail operation. At 50K concepts this becomes a real bottleneck — both in RAM and in parse time.

1.4 Parallel Editing

The per-entity write path (JsonFileManager.write_entity) enables fine-grained locking. Two users can save different entities concurrently without contention on a single file.


2. Weaknesses and Risks

2.1 Abandoning Turtle as Source of Truth

The most consequential decision in the plan. Turtle (and RDF/OWL generally) is the lingua franca of the semantic web. By making JSON the source of truth and Turtle a "generated artifact," the plan:

  • Breaks round-trip fidelity. The custom_axioms field (raw Turtle snippets embedded in JSON) is an escape hatch that admits the JSON schema cannot represent all of OWL. Any axiom not covered by the structured fields — property chains with intermediate variables, complex SWRL rules, GCI axioms (General Class Inclusion axioms that don't "belong" to a single named class), punning, annotation assertions on axioms, etc. — gets dumped into an unstructured string. The plan even sets a monitoring alert for when custom_axioms exceeds 10% of entities, acknowledging this is a known gap.

  • Creates a lossy intermediate format. The Turtle → JSON migration (Phase 7) must parse every construct in an arbitrary OWL ontology and map it to the fixed JSON schema. Anything it cannot map goes to custom_axioms. But custom_axioms is per-entity — what about axioms that span multiple entities or belong to no named entity? GCI axioms like SubClassOf(ObjectSomeValuesFrom(:hasPart :Wing) :Bird) don't naturally attach to either :Bird or :Wing. The plan does not address where these go.

  • Duplicates representation. Every entity now exists in three places: a JSON file, a PostgreSQL row, and a line range in ontology.ttl. Any bug in synchronization creates silent data divergence. The plan calls for regenerating Turtle "after every entity save," which at 50K entities means the entire ontology must be reassembled and serialized on every single edit.

  • Loses ecosystem compatibility. Standard tools (Protege, ROBOT, OWL API, SPARQL endpoints, TopBraid, RDFLib itself) all operate on standard serializations. A JSON-per-entity format is proprietary to OntoKit. Users who want to run SPARQL queries, use ROBOT for CI/CD validation, or load their ontology into a triple store must use the generated Turtle — which is now a second-class citizen that may drift.

2.2 Git Performance with 50,000+ Files

The plan underestimates the Git performance implications of tens of thousands of small files:

  • git status and git diff: Git must stat every file in the working tree (or, for bare repos, walk every tree object). With 50K+ files, tree-walking operations become measurably slow. Git's index.version = 4 helps with index size but does not eliminate the O(n) tree walk.

  • git clone: A fresh clone must reconstruct 50K+ files. Even with packfiles, the checkout phase is I/O-bound on file creation. This is a known pain point for monorepos (see: Microsoft's VFSForGit, which exists specifically because Git does not handle millions of files well).

  • Pack file bloat: Each JSON file is a separate blob. Git's delta compression works best on similar blobs — 50K distinct JSON files with different schemas (classes vs. properties vs. individuals) will compress poorly compared to a single large Turtle file where Git can delta-compress successive versions efficiently.

  • Bare repo tree construction: The plan's BareRepository.write_files() must build a new tree object containing 50K+ entries on every commit. pygit2's TreeBuilder constructs trees in memory; at 50K entries per directory level, this is non-trivial. The nested directory structure (root classes as subdirectories) helps but introduces its own complexity — renaming a class's parent requires moving its file between directories, creating a rename + content change in a single commit.

  • GitHub/GitLab rendering: If projects are synced to GitHub (the plan mentions GitHub App integration), repository browsing becomes unusable. GitHub's file browser paginates at ~1000 files per directory. Searching, blame, and PR file lists for 50K+ changed files (e.g., a migration commit) will time out or be truncated.

2.3 Turtle Regeneration is an O(n) Bottleneck

The plan states Turtle is regenerated "after every entity save." For a 50K-concept ontology:

  1. Read all 50K JSON files from Git (or PostgreSQL).
  2. Assemble a complete RDF graph.
  3. Serialize to Turtle.
  4. Commit the result.

This is the same monolithic operation the plan was trying to avoid, now running on every single edit. Even if the Turtle file is treated as a cache, the regeneration cost is prohibitive for interactive editing. The plan acknowledges this with a "Turtle generation duration" monitoring metric but offers no solution.

Alternatives within the plan's framework:

  • Lazy regeneration (generate on export/download, not on every save) — but then the Turtle in the repo is stale, defeating the purpose of having it there.
  • Incremental Turtle patching — extremely fragile and essentially impossible to do correctly with Turtle's prefix-dependent serialization.
  • Background async generation — but then the Turtle file lags behind the JSON files, creating a consistency window where the two disagree.

2.4 The JSON Schema Cannot Fully Represent OWL

OWL 2 is a complex language. The plan's ClassExpression discriminated union covers 10 types, but OWL 2 has significantly more constructs:

  • Datatypes: DataSomeValuesFrom, DataAllValuesFrom, DataHasValue, DatatypeRestriction (facets like xsd:minInclusive)
  • Keys: owl:hasKey
  • Annotation axioms: Annotations on annotations (OWL 2 allows annotating any axiom)
  • SWRL rules: Common in biomedical ontologies
  • Negative property assertions: owl:NegativePropertyAssertion
  • Enumerated classes: owl:oneOf with complex individuals
  • Self-restriction: owl:hasSelf

Each missing construct forces content into custom_axioms, where it loses structure, queryability, and form-based editing. The JSON schema would need continuous expansion, and every expansion requires migration of existing JSON files.

2.5 Big-Bang Migration Risk

The plan calls for migrating all existing projects at once ("big-bang"). This means:

  • No fallback if the migration introduces bugs — all projects are affected simultaneously.
  • The migration commit itself will be enormous (50K+ file additions in a single commit). Git operations on that commit (diff, blame, log) will be slow permanently.
  • Any data loss in the Turtle → JSON conversion is global and may not be detected until a user notices missing axioms weeks later.

2.6 File Naming Fragility

The naming scheme <Label>_<8-char-IRI-hash>.json has edge cases:

  • Label changes require file renames. If a class is relabeled, the file must be renamed (or the name becomes misleading). Renames in Git lose history unless git log --follow is used, and even then, detection is heuristic.
  • Non-ASCII labels produce awkward filenames. An ontology with Japanese or Arabic labels will have unreadable directory listings.
  • Label collisions within the same directory are possible (two classes with the same label but different IRIs). The 8-char hash suffix mitigates this but doesn't eliminate the confusion.
  • Hash collisions: 8 hex characters = 32 bits = ~4 billion values. At 50K entities the birthday paradox gives a ~0.03% collision probability, which is non-negligible over many projects.

3. Consequences for Linting

The current linter (OntologyLinter) operates on a complete rdflib.Graph. Atomization has significant implications:

3.1 Cross-Entity Rules Require Full Graph Assembly

Many lint rules are inherently global — they require knowledge of the full ontology, not just a single entity:

Rule Why it needs the full graph
undefined-parent Must check if the referenced parent class exists anywhere in the ontology
circular-hierarchy DFS cycle detection over the entire rdfs:subClassOf graph
duplicate-label Must compare labels across all entities
domain-violation Must know the full class hierarchy to check domain membership
range-violation Must know the full class hierarchy to check range membership
disjoint-violation Must know all owl:disjointWith and owl:AllDisjointClasses axioms
cardinality-violation Must count property assertions across all individuals
inverse-property-inconsistency Must check both directions of an owl:inverseOf pair

Only 4 of the 15 rules (missing-label, missing-comment, empty-label, missing-english-label) can operate on a single entity in isolation.

This means linting still requires loading the entire ontology — either by reassembling all JSON files into an RDF graph (defeating the performance benefit) or by running lint queries against PostgreSQL (requiring a complete reimplementation of the linter against a relational schema rather than an RDF graph API).

3.2 PostgreSQL-Based Linting Is a Full Rewrite

If linting moves to PostgreSQL, every rule must be reimplemented as SQL queries rather than RDFLib graph traversal. The current linter is ~400 lines of clean Python using graph.triples(), graph.subjects(), etc. A SQL-based linter would be:

  • More complex (JOIN-heavy queries for hierarchy traversal, JSONB path queries for class expressions)
  • Harder to test (requires database fixtures rather than in-memory graphs)
  • Less extensible (adding a new rule means writing SQL, not graph pattern matching)

3.3 custom_axioms Are Invisible to the Linter

Axioms stored as raw Turtle strings in custom_axioms cannot be linted by structured rules. If a custom_axiom contains a rdfs:subClassOf with a typo in the parent IRI, the undefined-parent rule won't catch it unless the linter also parses those Turtle snippets — at which point you're back to needing an RDF parser and losing the benefits of structured JSON.

3.4 Incremental Linting Becomes Harder, Not Easier

One might expect per-entity files to enable incremental linting (only re-lint changed entities). But because most rules are cross-entity, a change to one class can invalidate lint results for unrelated classes (e.g., adding a new class removes an orphan-class warning from another class that was previously parentless). Reliable incremental linting requires dependency tracking that is itself as complex as a full lint pass.


4. Alternative: Modular Ontologies (OWL Imports)

Rather than atomizing into proprietary JSON files, consider leveraging OWL's built-in modularity mechanism: owl:imports.

4.1 How It Works

A large ontology is split into multiple smaller Turtle files, each a self-contained OWL ontology that imports its dependencies:

project-repo.git/
  main.ttl                    # Root ontology, imports all modules
  modules/
    criminal-law.ttl          # owl:imports main.ttl (for shared terms)
    civil-law.ttl
    jurisdictions.ttl
    properties.ttl             # Shared object/data properties
  generated/
    merged.ttl                 # Full merged ontology (generated artifact)

Each module file:

@prefix : <http://example.org/ontology/criminal-law#> .
@prefix main: <http://example.org/ontology#> .

<http://example.org/ontology/criminal-law>
    a owl:Ontology ;
    owl:imports <http://example.org/ontology/properties> .

:Fraud a owl:Class ;
    rdfs:subClassOf main:Crime ;
    rdfs:label "Fraud"@en .

4.2 Advantages Over Atomization

Criterion Atomization (JSON) Modular (OWL Imports)
Source of truth format Proprietary JSON Standard Turtle/OWL
Tool compatibility OntoKit only Protege, ROBOT, OWL API, any RDF tool
Merge conflicts Eliminated (1 entity = 1 file) Greatly reduced (1 module = tens to hundreds of entities)
Git performance 50K+ files (problematic) Tens to low hundreds of files (fine)
Linting Requires full rewrite or full graph assembly Per-module linting possible for intra-module rules; full graph for cross-module
custom_axioms escape hatch Required Not needed — all OWL is representable
Round-trip fidelity Lossy for complex OWL Lossless
GitHub browsability Unusable at scale Natural and readable
Migration complexity Turtle → JSON conversion with data loss risk Partitioning an existing graph into subgraphs (well-understood operation)
Learning curve Developers must learn proprietary JSON schema Developers work with standard Turtle

4.3 How Module Boundaries Are Drawn

Modules can be defined by:

  • Top-level class hierarchy: Each root class and its descendants form a module (similar to the plan's root-class subdirectories).
  • Domain or topic: Group related concepts regardless of hierarchy.
  • User-defined: Let project owners create modules and assign classes to them through the UI.

A manifest.json (or the root main.ttl ontology) tracks module membership and import relationships.

4.4 Addressing the Original Pain Points

Pain Point How Modular Ontologies Help
Massive Git diffs Diffs are scoped to the module file that changed. A 500-concept module produces readable diffs.
Merge conflicts Two users editing different modules = no conflict. Same module = possible conflict, but within a much smaller file.
Slow load/save Only the affected module is loaded/saved. The merged file is regenerated lazily or on demand.
Disk I/O Modules are small files; parallel I/O is natural.
Query performance Same PostgreSQL index approach works — index is populated from module files instead of a monolith.

4.5 Limitations

  • Module boundaries must be decided (by users or heuristics). Badly drawn boundaries bring back the merge conflict problem.
  • Cross-module references are common (a class in criminal-law.ttl subclasses something in main.ttl). These require import declarations and careful prefix management.
  • Moving a class between modules requires editing two files (source and destination), which is a merge conflict risk if both modules are being edited.
  • At extreme scale (50K+ concepts with flat hierarchies), modules may still be large unless hierarchical partitioning is deep.

5. Alternative: Hybrid Approach

A third option combines the best elements:

5.1 Keep Turtle as Source of Truth, Add PostgreSQL Index

The plan's Phase 2 (PostgreSQL index tables) and Phase 4 (query optimization) are valuable regardless of storage format. The index can be populated by parsing Turtle files — either a monolith or modules — without changing the source of truth.

5.2 Modular Turtle for Git Benefits

Split large ontologies into module files for manageable diffs and reduced merge conflicts, as described in Section 4.

5.3 Deterministic Serialization Per Module

The existing serialize_deterministic() function (using to_isomorphic()) already produces stable Turtle output. Apply it per-module to minimize spurious diffs from serialization ordering.

5.4 Fine-Grained Editing via SPARQL Update or Patch Operations

Instead of rewriting the entire file on every edit, use graph-level patch operations:

  • Parse the module into an in-memory graph.
  • Apply the user's edit (add/remove/modify triples).
  • Re-serialize only the affected module.

This gives per-entity edit granularity without per-entity files.

5.5 Architecture

project-repo.git/
  ontology.ttl              # Root ontology with imports
  modules/
    module-1.ttl            # ~100-500 concepts each
    module-2.ttl
    ...

PostgreSQL:
  ontology_classes           # Indexed from Turtle, same as plan's Phase 2
  ontology_properties
  ontology_individuals
  ontology_modules           # Tracks which entity belongs to which module

On save (single entity edit):
  1. Identify which module contains the entity
  2. Load that module's graph (~500 concepts, fast)
  3. Apply the edit
  4. Re-serialize the module (deterministic)
  5. Commit the single changed module file
  6. Update PostgreSQL index for the changed entity

On lint:
  1. Load all modules into a merged graph (or use PostgreSQL for structural rules)
  2. Run existing linter unchanged

6. Comparison Matrix

Factor Monolith (Status Quo) Atomization (Plan) Modular Turtle Hybrid
Merge conflicts Severe Eliminated Greatly reduced Greatly reduced
Git diff readability Poor Excellent (per-entity) Good (per-module) Good (per-module)
Git performance at 50K Fine (1 file) Poor (50K files) Fine (tens of files) Fine (tens of files)
GitHub browsability Fine Unusable Natural Natural
Query performance Poor (in-memory) Excellent (PG) Poor without PG Excellent (PG)
OWL compatibility Full Partial (custom_axioms) Full Full
Tool ecosystem Full OntoKit only Full Full
Linter impact None Major rewrite Minor changes None
Migration risk N/A High (big-bang, lossy) Medium (graph partitioning) Low (additive)
Implementation effort N/A Very high (8 phases) Medium Medium
Per-entity Git history No Yes No (per-module) No (per-module)
Concurrent edit safety Poor Excellent Good Good

7. Recommendations

7.1 Do Not Abandon Turtle as Source of Truth

The cost of a proprietary JSON intermediate format — in ecosystem compatibility, OWL expressiveness, linter complexity, and ongoing maintenance of the JSON schema — outweighs the benefit of per-entity file diffs. The custom_axioms escape hatch is a red flag: it signals that the JSON schema is fundamentally incomplete and will remain so.

7.2 Adopt the PostgreSQL Index (Plan Phase 2) Independently

This is the plan's most unambiguously valuable component. It eliminates the in-memory RDFLib bottleneck for tree/search/detail queries and can be implemented regardless of file storage strategy. Build it first; it delivers immediate performance benefits.

7.3 Implement Modular Ontologies for Git Scalability

Use owl:imports to partition large ontologies into manageable module files. This provides 80% of the merge conflict and diff readability benefits at 20% of the implementation cost, without sacrificing OWL compatibility.

7.4 If Per-Entity History Is Truly Required, Consider a Middle Ground

If the ability to run git log on a single entity's history is a hard requirement (not just a nice-to-have), consider:

  • Named graphs in TriG format: Each entity's triples in a named graph within a single file — gives per-entity structure while remaining standard RDF.
  • Per-entity Turtle files (not JSON): Same directory structure as the plan, but each file is valid Turtle. This preserves OWL compatibility while gaining per-entity Git history. The tradeoff is the same Git performance concern (50K+ files), but without the JSON schema limitations.
  • Git notes or structured commit messages: Tag commits with the IRIs of affected entities, enabling entity-level history queries without per-entity files.

7.5 Roll Out Incrementally, Not Big-Bang

Whatever approach is chosen, support both formats concurrently during migration. The plan's storage_format column on Project is a good idea — use it to enable per-project opt-in migration with rollback capability.


8. Conclusion

The atomization plan correctly identifies real scaling problems — monolithic Turtle files create merge conflicts, unreadable diffs, and performance bottlenecks. However, its proposed solution introduces new problems that may be worse: Git performance degradation with tens of thousands of files, loss of OWL expressiveness via an incomplete JSON schema, full linter rewrite, Turtle regeneration bottleneck, and ecosystem incompatibility.

The PostgreSQL indexing component should be adopted immediately as it is independently valuable. For file-level concerns, modular ontologies using standard owl:imports deliver most of the benefits at a fraction of the cost and risk. If the plan proceeds as-is, the custom_axioms escape hatch will grow into a maintenance burden, the Turtle regeneration will become a performance cliff, and the 50K-file Git repositories will create new categories of operational pain.

The strongest version of this architecture is the hybrid: modular Turtle files for Git sanity, PostgreSQL indexes for query performance, and the existing linter running unchanged against assembled graphs. It's less ambitious than full atomization but far more likely to succeed at scale.