The atomization plan proposes decomposing monolithic Turtle ontology files into per-entity JSON files with PostgreSQL indexing and auto-generated Turtle. While it solves real pain points around merge conflicts and query performance, the plan introduces substantial new complexity, abandons standard ontology tooling interoperability, and may create worse problems at the file-system and Git layers than the ones it solves. This analysis examines the trade-offs and explores alternative approaches.
The single strongest argument for atomization. A monolithic 50K-concept Turtle file produces unreadable diffs when a single annotation changes. Per-entity files make git log --follow classes/Fraud_e5f6g7h8.json possible, giving each concept a traceable edit history.
With one file, two users editing different classes on different branches will produce a textual merge conflict even though the changes are semantically independent. Per-entity files make this structurally impossible (unless both users edit the same entity).
Replacing in-memory RDFLib graph traversal with indexed PostgreSQL queries is sound engineering for scale. The current architecture loads the entire graph into memory for every tree/search/detail operation. At 50K concepts this becomes a real bottleneck — both in RAM and in parse time.
The per-entity write path (JsonFileManager.write_entity) enables fine-grained locking. Two users can save different entities concurrently without contention on a single file.
The most consequential decision in the plan. Turtle (and RDF/OWL generally) is the lingua franca of the semantic web. By making JSON the source of truth and Turtle a "generated artifact," the plan:
-
Breaks round-trip fidelity. The
custom_axiomsfield (raw Turtle snippets embedded in JSON) is an escape hatch that admits the JSON schema cannot represent all of OWL. Any axiom not covered by the structured fields — property chains with intermediate variables, complex SWRL rules, GCI axioms (General Class Inclusion axioms that don't "belong" to a single named class), punning, annotation assertions on axioms, etc. — gets dumped into an unstructured string. The plan even sets a monitoring alert for whencustom_axiomsexceeds 10% of entities, acknowledging this is a known gap. -
Creates a lossy intermediate format. The Turtle → JSON migration (Phase 7) must parse every construct in an arbitrary OWL ontology and map it to the fixed JSON schema. Anything it cannot map goes to
custom_axioms. Butcustom_axiomsis per-entity — what about axioms that span multiple entities or belong to no named entity? GCI axioms likeSubClassOf(ObjectSomeValuesFrom(:hasPart :Wing) :Bird)don't naturally attach to either:Birdor:Wing. The plan does not address where these go. -
Duplicates representation. Every entity now exists in three places: a JSON file, a PostgreSQL row, and a line range in
ontology.ttl. Any bug in synchronization creates silent data divergence. The plan calls for regenerating Turtle "after every entity save," which at 50K entities means the entire ontology must be reassembled and serialized on every single edit. -
Loses ecosystem compatibility. Standard tools (Protege, ROBOT, OWL API, SPARQL endpoints, TopBraid, RDFLib itself) all operate on standard serializations. A JSON-per-entity format is proprietary to OntoKit. Users who want to run SPARQL queries, use ROBOT for CI/CD validation, or load their ontology into a triple store must use the generated Turtle — which is now a second-class citizen that may drift.
The plan underestimates the Git performance implications of tens of thousands of small files:
-
git statusandgit diff: Git must stat every file in the working tree (or, for bare repos, walk every tree object). With 50K+ files, tree-walking operations become measurably slow. Git'sindex.version = 4helps with index size but does not eliminate the O(n) tree walk. -
git clone: A fresh clone must reconstruct 50K+ files. Even with packfiles, the checkout phase is I/O-bound on file creation. This is a known pain point for monorepos (see: Microsoft's VFSForGit, which exists specifically because Git does not handle millions of files well). -
Pack file bloat: Each JSON file is a separate blob. Git's delta compression works best on similar blobs — 50K distinct JSON files with different schemas (classes vs. properties vs. individuals) will compress poorly compared to a single large Turtle file where Git can delta-compress successive versions efficiently.
-
Bare repo tree construction: The plan's
BareRepository.write_files()must build a new tree object containing 50K+ entries on every commit. pygit2'sTreeBuilderconstructs trees in memory; at 50K entries per directory level, this is non-trivial. The nested directory structure (root classes as subdirectories) helps but introduces its own complexity — renaming a class's parent requires moving its file between directories, creating a rename + content change in a single commit. -
GitHub/GitLab rendering: If projects are synced to GitHub (the plan mentions GitHub App integration), repository browsing becomes unusable. GitHub's file browser paginates at ~1000 files per directory. Searching, blame, and PR file lists for 50K+ changed files (e.g., a migration commit) will time out or be truncated.
The plan states Turtle is regenerated "after every entity save." For a 50K-concept ontology:
- Read all 50K JSON files from Git (or PostgreSQL).
- Assemble a complete RDF graph.
- Serialize to Turtle.
- Commit the result.
This is the same monolithic operation the plan was trying to avoid, now running on every single edit. Even if the Turtle file is treated as a cache, the regeneration cost is prohibitive for interactive editing. The plan acknowledges this with a "Turtle generation duration" monitoring metric but offers no solution.
Alternatives within the plan's framework:
- Lazy regeneration (generate on export/download, not on every save) — but then the Turtle in the repo is stale, defeating the purpose of having it there.
- Incremental Turtle patching — extremely fragile and essentially impossible to do correctly with Turtle's prefix-dependent serialization.
- Background async generation — but then the Turtle file lags behind the JSON files, creating a consistency window where the two disagree.
OWL 2 is a complex language. The plan's ClassExpression discriminated union covers 10 types, but OWL 2 has significantly more constructs:
- Datatypes:
DataSomeValuesFrom,DataAllValuesFrom,DataHasValue,DatatypeRestriction(facets likexsd:minInclusive) - Keys:
owl:hasKey - Annotation axioms: Annotations on annotations (OWL 2 allows annotating any axiom)
- SWRL rules: Common in biomedical ontologies
- Negative property assertions:
owl:NegativePropertyAssertion - Enumerated classes:
owl:oneOfwith complex individuals - Self-restriction:
owl:hasSelf
Each missing construct forces content into custom_axioms, where it loses structure, queryability, and form-based editing. The JSON schema would need continuous expansion, and every expansion requires migration of existing JSON files.
The plan calls for migrating all existing projects at once ("big-bang"). This means:
- No fallback if the migration introduces bugs — all projects are affected simultaneously.
- The migration commit itself will be enormous (50K+ file additions in a single commit). Git operations on that commit (diff, blame, log) will be slow permanently.
- Any data loss in the Turtle → JSON conversion is global and may not be detected until a user notices missing axioms weeks later.
The naming scheme <Label>_<8-char-IRI-hash>.json has edge cases:
- Label changes require file renames. If a class is relabeled, the file must be renamed (or the name becomes misleading). Renames in Git lose history unless
git log --followis used, and even then, detection is heuristic. - Non-ASCII labels produce awkward filenames. An ontology with Japanese or Arabic labels will have unreadable directory listings.
- Label collisions within the same directory are possible (two classes with the same label but different IRIs). The 8-char hash suffix mitigates this but doesn't eliminate the confusion.
- Hash collisions: 8 hex characters = 32 bits = ~4 billion values. At 50K entities the birthday paradox gives a ~0.03% collision probability, which is non-negligible over many projects.
The current linter (OntologyLinter) operates on a complete rdflib.Graph. Atomization has significant implications:
Many lint rules are inherently global — they require knowledge of the full ontology, not just a single entity:
| Rule | Why it needs the full graph |
|---|---|
undefined-parent |
Must check if the referenced parent class exists anywhere in the ontology |
circular-hierarchy |
DFS cycle detection over the entire rdfs:subClassOf graph |
duplicate-label |
Must compare labels across all entities |
domain-violation |
Must know the full class hierarchy to check domain membership |
range-violation |
Must know the full class hierarchy to check range membership |
disjoint-violation |
Must know all owl:disjointWith and owl:AllDisjointClasses axioms |
cardinality-violation |
Must count property assertions across all individuals |
inverse-property-inconsistency |
Must check both directions of an owl:inverseOf pair |
Only 4 of the 15 rules (missing-label, missing-comment, empty-label, missing-english-label) can operate on a single entity in isolation.
This means linting still requires loading the entire ontology — either by reassembling all JSON files into an RDF graph (defeating the performance benefit) or by running lint queries against PostgreSQL (requiring a complete reimplementation of the linter against a relational schema rather than an RDF graph API).
If linting moves to PostgreSQL, every rule must be reimplemented as SQL queries rather than RDFLib graph traversal. The current linter is ~400 lines of clean Python using graph.triples(), graph.subjects(), etc. A SQL-based linter would be:
- More complex (JOIN-heavy queries for hierarchy traversal, JSONB path queries for class expressions)
- Harder to test (requires database fixtures rather than in-memory graphs)
- Less extensible (adding a new rule means writing SQL, not graph pattern matching)
Axioms stored as raw Turtle strings in custom_axioms cannot be linted by structured rules. If a custom_axiom contains a rdfs:subClassOf with a typo in the parent IRI, the undefined-parent rule won't catch it unless the linter also parses those Turtle snippets — at which point you're back to needing an RDF parser and losing the benefits of structured JSON.
One might expect per-entity files to enable incremental linting (only re-lint changed entities). But because most rules are cross-entity, a change to one class can invalidate lint results for unrelated classes (e.g., adding a new class removes an orphan-class warning from another class that was previously parentless). Reliable incremental linting requires dependency tracking that is itself as complex as a full lint pass.
Rather than atomizing into proprietary JSON files, consider leveraging OWL's built-in modularity mechanism: owl:imports.
A large ontology is split into multiple smaller Turtle files, each a self-contained OWL ontology that imports its dependencies:
project-repo.git/
main.ttl # Root ontology, imports all modules
modules/
criminal-law.ttl # owl:imports main.ttl (for shared terms)
civil-law.ttl
jurisdictions.ttl
properties.ttl # Shared object/data properties
generated/
merged.ttl # Full merged ontology (generated artifact)
Each module file:
@prefix : <http://example.org/ontology/criminal-law#> .
@prefix main: <http://example.org/ontology#> .
<http://example.org/ontology/criminal-law>
a owl:Ontology ;
owl:imports <http://example.org/ontology/properties> .
:Fraud a owl:Class ;
rdfs:subClassOf main:Crime ;
rdfs:label "Fraud"@en .| Criterion | Atomization (JSON) | Modular (OWL Imports) |
|---|---|---|
| Source of truth format | Proprietary JSON | Standard Turtle/OWL |
| Tool compatibility | OntoKit only | Protege, ROBOT, OWL API, any RDF tool |
| Merge conflicts | Eliminated (1 entity = 1 file) | Greatly reduced (1 module = tens to hundreds of entities) |
| Git performance | 50K+ files (problematic) | Tens to low hundreds of files (fine) |
| Linting | Requires full rewrite or full graph assembly | Per-module linting possible for intra-module rules; full graph for cross-module |
custom_axioms escape hatch |
Required | Not needed — all OWL is representable |
| Round-trip fidelity | Lossy for complex OWL | Lossless |
| GitHub browsability | Unusable at scale | Natural and readable |
| Migration complexity | Turtle → JSON conversion with data loss risk | Partitioning an existing graph into subgraphs (well-understood operation) |
| Learning curve | Developers must learn proprietary JSON schema | Developers work with standard Turtle |
Modules can be defined by:
- Top-level class hierarchy: Each root class and its descendants form a module (similar to the plan's root-class subdirectories).
- Domain or topic: Group related concepts regardless of hierarchy.
- User-defined: Let project owners create modules and assign classes to them through the UI.
A manifest.json (or the root main.ttl ontology) tracks module membership and import relationships.
| Pain Point | How Modular Ontologies Help |
|---|---|
| Massive Git diffs | Diffs are scoped to the module file that changed. A 500-concept module produces readable diffs. |
| Merge conflicts | Two users editing different modules = no conflict. Same module = possible conflict, but within a much smaller file. |
| Slow load/save | Only the affected module is loaded/saved. The merged file is regenerated lazily or on demand. |
| Disk I/O | Modules are small files; parallel I/O is natural. |
| Query performance | Same PostgreSQL index approach works — index is populated from module files instead of a monolith. |
- Module boundaries must be decided (by users or heuristics). Badly drawn boundaries bring back the merge conflict problem.
- Cross-module references are common (a class in
criminal-law.ttlsubclasses something inmain.ttl). These require import declarations and careful prefix management. - Moving a class between modules requires editing two files (source and destination), which is a merge conflict risk if both modules are being edited.
- At extreme scale (50K+ concepts with flat hierarchies), modules may still be large unless hierarchical partitioning is deep.
A third option combines the best elements:
The plan's Phase 2 (PostgreSQL index tables) and Phase 4 (query optimization) are valuable regardless of storage format. The index can be populated by parsing Turtle files — either a monolith or modules — without changing the source of truth.
Split large ontologies into module files for manageable diffs and reduced merge conflicts, as described in Section 4.
The existing serialize_deterministic() function (using to_isomorphic()) already produces stable Turtle output. Apply it per-module to minimize spurious diffs from serialization ordering.
Instead of rewriting the entire file on every edit, use graph-level patch operations:
- Parse the module into an in-memory graph.
- Apply the user's edit (add/remove/modify triples).
- Re-serialize only the affected module.
This gives per-entity edit granularity without per-entity files.
project-repo.git/
ontology.ttl # Root ontology with imports
modules/
module-1.ttl # ~100-500 concepts each
module-2.ttl
...
PostgreSQL:
ontology_classes # Indexed from Turtle, same as plan's Phase 2
ontology_properties
ontology_individuals
ontology_modules # Tracks which entity belongs to which module
On save (single entity edit):
1. Identify which module contains the entity
2. Load that module's graph (~500 concepts, fast)
3. Apply the edit
4. Re-serialize the module (deterministic)
5. Commit the single changed module file
6. Update PostgreSQL index for the changed entity
On lint:
1. Load all modules into a merged graph (or use PostgreSQL for structural rules)
2. Run existing linter unchanged
| Factor | Monolith (Status Quo) | Atomization (Plan) | Modular Turtle | Hybrid |
|---|---|---|---|---|
| Merge conflicts | Severe | Eliminated | Greatly reduced | Greatly reduced |
| Git diff readability | Poor | Excellent (per-entity) | Good (per-module) | Good (per-module) |
| Git performance at 50K | Fine (1 file) | Poor (50K files) | Fine (tens of files) | Fine (tens of files) |
| GitHub browsability | Fine | Unusable | Natural | Natural |
| Query performance | Poor (in-memory) | Excellent (PG) | Poor without PG | Excellent (PG) |
| OWL compatibility | Full | Partial (custom_axioms) |
Full | Full |
| Tool ecosystem | Full | OntoKit only | Full | Full |
| Linter impact | None | Major rewrite | Minor changes | None |
| Migration risk | N/A | High (big-bang, lossy) | Medium (graph partitioning) | Low (additive) |
| Implementation effort | N/A | Very high (8 phases) | Medium | Medium |
| Per-entity Git history | No | Yes | No (per-module) | No (per-module) |
| Concurrent edit safety | Poor | Excellent | Good | Good |
The cost of a proprietary JSON intermediate format — in ecosystem compatibility, OWL expressiveness, linter complexity, and ongoing maintenance of the JSON schema — outweighs the benefit of per-entity file diffs. The custom_axioms escape hatch is a red flag: it signals that the JSON schema is fundamentally incomplete and will remain so.
This is the plan's most unambiguously valuable component. It eliminates the in-memory RDFLib bottleneck for tree/search/detail queries and can be implemented regardless of file storage strategy. Build it first; it delivers immediate performance benefits.
Use owl:imports to partition large ontologies into manageable module files. This provides 80% of the merge conflict and diff readability benefits at 20% of the implementation cost, without sacrificing OWL compatibility.
If the ability to run git log on a single entity's history is a hard requirement (not just a nice-to-have), consider:
- Named graphs in TriG format: Each entity's triples in a named graph within a single file — gives per-entity structure while remaining standard RDF.
- Per-entity Turtle files (not JSON): Same directory structure as the plan, but each file is valid Turtle. This preserves OWL compatibility while gaining per-entity Git history. The tradeoff is the same Git performance concern (50K+ files), but without the JSON schema limitations.
- Git notes or structured commit messages: Tag commits with the IRIs of affected entities, enabling entity-level history queries without per-entity files.
Whatever approach is chosen, support both formats concurrently during migration. The plan's storage_format column on Project is a good idea — use it to enable per-project opt-in migration with rollback capability.
The atomization plan correctly identifies real scaling problems — monolithic Turtle files create merge conflicts, unreadable diffs, and performance bottlenecks. However, its proposed solution introduces new problems that may be worse: Git performance degradation with tens of thousands of files, loss of OWL expressiveness via an incomplete JSON schema, full linter rewrite, Turtle regeneration bottleneck, and ecosystem incompatibility.
The PostgreSQL indexing component should be adopted immediately as it is independently valuable. For file-level concerns, modular ontologies using standard owl:imports deliver most of the benefits at a fraction of the cost and risk. If the plan proceeds as-is, the custom_axioms escape hatch will grow into a maintenance burden, the Turtle regeneration will become a performance cliff, and the 50K-file Git repositories will create new categories of operational pain.
The strongest version of this architecture is the hybrid: modular Turtle files for Git sanity, PostgreSQL indexes for query performance, and the existing linter running unchanged against assembled graphs. It's less ambitious than full atomization but far more likely to succeed at scale.