feat: export/import workspace knowledge in .tgx bundles (#877 Phase 2)#1024
Open
sunnyadn wants to merge 2 commits into
Open
feat: export/import workspace knowledge in .tgx bundles (#877 Phase 2)#1024sunnyadn wants to merge 2 commits into
sunnyadn wants to merge 2 commits into
Conversation
Groundwork for Phase 2 of trustgraph-ai#877 (knowledge export). Hand-rolled N-Triples term encoding: rdflib's term.n3() emits Turtle-style forms (numeric shorthand, unescaped newlines) that are invalid in line-oriented N-Quads, so literals are escaped per the ECHAR grammar and IRIs validated for representability. Round-trip tests parse the output back with rdflib's nquads parser and compare term-for-term.
Contributor License Agreement ✅All contributors have signed the CLA. Thank you! |
…i#877) Phase 2 of the workspace bundle commands: tg-export-workspace now includes the workspace's knowledge by default — per-collection knowledge-graph triples as N-Quads (the collection names the graph, streamed through a tempfile so memory stays flat regardless of knowledge-base size) and the document library (metadata plus content, fetched one document at a time). --config-only skips knowledge on both sides; --triples-limit bounds very large graphs; -f/--flow-id selects the flow the triples services run through. tg-import-workspace streams triples back through the bulk import per collection and recreates library documents (children after parents). Knowledge import is additive, unlike config's skip-existing semantics. Embedding vectors are not carried in bundles: --process re-runs imported documents through the flow, which regenerates extraction output and embeddings; --process-collection targets it. Round-trip covered by unit tests over real archives: export with a mocked Api, re-import, and assert the bulk triples stream and library add calls reproduce the original values (including datatyped literals via the N-Quads path).
9bd6d18 to
36a74d8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 2 of #877, on top of the merged Phase 1 (#1019):
.tgxbundles now carry the workspace's knowledge alongside its configuration.Export
tg-export-workspacenow also writes, per collection, a streaming N-Quads dumpknowledge/<collection>/triples.nq(graph =urn:trustgraph:collection:<c>, via the flow's streaming triples query), and the library's documents underknowledge/library/(per-document metadata JSON + content, fetched one document at a time; each collection's quads go through a tempfile into the tar, so bundle size doesn't drive memory). Collection discovery is registry-based, and collections created implicitly by raw triple loads (e.g.tg-load-knowledge) are queryable but not listed — they'd silently drop out of a backup — so the enumeration is printed on every export and--collection(repeatable) names extras explicitly.--config-onlyskips knowledge;-f/--flow-idselects the flow;--triples-limitcaps truly huge graphs. The manifest recordscontents.knowledgeand a summary (collections, per-collection triple counts, document count).N-Quads term encoding is hand-rolled to the N-Triples grammar rather than using rdflib's
term.n3(), which emits Turtle-style forms (numeric shorthand, unescaped newlines) invalid in line-oriented N-Quads; round-trip tests parse the output back with rdflib's nquads parser. RDF-star quoted triples have no standard N-Quads encoding and are skipped with a reported count.Import
tg-import-workspacestreams triples per collection through the bulk websocket import and registers each restored collection in collection management (the bulk path doesn't auto-register the way document processing does, and an unregistered collection would drop out of the restored workspace's next export). Library documents are recreated (parents before children, chunked upload handled by the client) with the same semantics as config: existing documents skip by default,--overwritereplaces (remove + add — the library API has no in-place content update). Triples are additive (the store dedupes identical statements).--dry-runcovers knowledge.Embeddings are intentionally not shipped in bundles. Vectors are only meaningful for the exact model that produced them, and the default skip-existing import deliberately leaves an established target's embeddings configuration untouched — so in the cross-deployment cases (migration, starter kits) imported vectors could silently mismatch the target's model. Re-deriving via
--process(with--process-collection, since bundles can't record a document's original processing collection) is always semantically correct. For the same-stack restore case, where the model provably matches (the bundle carries theprocessor:embeddingsconfig to check against), an opt-in--with-embeddingsreusing the existing embeddings msgpack shapes would be a sensible follow-up — happy to add it if you'd find it useful.Phase-1's guard that refused knowledge-carrying bundles is replaced by the actual import path.
Testing
Unit:
tests/unit/test_cligreen (105 passed) — N-Quads round-trip (hostile-content escaping, malformed/RDF-star skips), bundle layout,--config-onlyboth ways, archive round-trip, document skip/overwrite, collection discovery/registration, dry-run.Live: verified end-to-end against a local docker-compose deployment (2.6.7: pulsar, cassandra + triple-store, qdrant, garage, embeddings-hf) — seeded two collections (one deliberately left unregistered) including a literal with newlines, quotes, CJK and emoji, plus a library document; exported; then wiped the stack (
down -v) and restored from the bundle alone. Both collections' triples byte-faithful, document content identical, collections registered so the restored workspace re-exports losslessly; import onto an already-populated workspace skips/dedupes as described, and--overwritereplaces cleanly.