Skip to content

feat: export/import workspace knowledge in .tgx bundles (#877 Phase 2)#1024

Open
sunnyadn wants to merge 2 commits into
trustgraph-ai:release/v2.6from
sunnyadn:feat/workspace-knowledge-export
Open

feat: export/import workspace knowledge in .tgx bundles (#877 Phase 2)#1024
sunnyadn wants to merge 2 commits into
trustgraph-ai:release/v2.6from
sunnyadn:feat/workspace-knowledge-export

Conversation

@sunnyadn

@sunnyadn sunnyadn commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Phase 2 of #877, on top of the merged Phase 1 (#1019): .tgx bundles now carry the workspace's knowledge alongside its configuration.

Export

tg-export-workspace now also writes, per collection, a streaming N-Quads dump knowledge/<collection>/triples.nq (graph = urn:trustgraph:collection:<c>, via the flow's streaming triples query), and the library's documents under knowledge/library/ (per-document metadata JSON + content, fetched one document at a time; each collection's quads go through a tempfile into the tar, so bundle size doesn't drive memory). Collection discovery is registry-based, and collections created implicitly by raw triple loads (e.g. tg-load-knowledge) are queryable but not listed — they'd silently drop out of a backup — so the enumeration is printed on every export and --collection (repeatable) names extras explicitly. --config-only skips knowledge; -f/--flow-id selects the flow; --triples-limit caps truly huge graphs. The manifest records contents.knowledge and a summary (collections, per-collection triple counts, document count).

N-Quads term encoding is hand-rolled to the N-Triples grammar rather than using rdflib's term.n3(), which emits Turtle-style forms (numeric shorthand, unescaped newlines) invalid in line-oriented N-Quads; round-trip tests parse the output back with rdflib's nquads parser. RDF-star quoted triples have no standard N-Quads encoding and are skipped with a reported count.

Import

tg-import-workspace streams triples per collection through the bulk websocket import and registers each restored collection in collection management (the bulk path doesn't auto-register the way document processing does, and an unregistered collection would drop out of the restored workspace's next export). Library documents are recreated (parents before children, chunked upload handled by the client) with the same semantics as config: existing documents skip by default, --overwrite replaces (remove + add — the library API has no in-place content update). Triples are additive (the store dedupes identical statements). --dry-run covers knowledge.

Embeddings are intentionally not shipped in bundles. Vectors are only meaningful for the exact model that produced them, and the default skip-existing import deliberately leaves an established target's embeddings configuration untouched — so in the cross-deployment cases (migration, starter kits) imported vectors could silently mismatch the target's model. Re-deriving via --process (with --process-collection, since bundles can't record a document's original processing collection) is always semantically correct. For the same-stack restore case, where the model provably matches (the bundle carries the processor:embeddings config to check against), an opt-in --with-embeddings reusing the existing embeddings msgpack shapes would be a sensible follow-up — happy to add it if you'd find it useful.

Phase-1's guard that refused knowledge-carrying bundles is replaced by the actual import path.

Testing

Unit: tests/unit/test_cli green (105 passed) — N-Quads round-trip (hostile-content escaping, malformed/RDF-star skips), bundle layout, --config-only both ways, archive round-trip, document skip/overwrite, collection discovery/registration, dry-run.

Live: verified end-to-end against a local docker-compose deployment (2.6.7: pulsar, cassandra + triple-store, qdrant, garage, embeddings-hf) — seeded two collections (one deliberately left unregistered) including a literal with newlines, quotes, CJK and emoji, plus a library document; exported; then wiped the stack (down -v) and restored from the bundle alone. Both collections' triples byte-faithful, document content identical, collections registered so the restored workspace re-exports losslessly; import onto an already-populated workspace skips/dedupes as described, and --overwrite replaces cleanly.

Groundwork for Phase 2 of trustgraph-ai#877 (knowledge export). Hand-rolled
N-Triples term encoding: rdflib's term.n3() emits Turtle-style forms
(numeric shorthand, unescaped newlines) that are invalid in
line-oriented N-Quads, so literals are escaped per the ECHAR grammar
and IRIs validated for representability. Round-trip tests parse the
output back with rdflib's nquads parser and compare term-for-term.
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

Contributor License Agreement ✅

All contributors have signed the CLA. Thank you!

…i#877)

Phase 2 of the workspace bundle commands: tg-export-workspace now
includes the workspace's knowledge by default — per-collection
knowledge-graph triples as N-Quads (the collection names the graph,
streamed through a tempfile so memory stays flat regardless of
knowledge-base size) and the document library (metadata plus content,
fetched one document at a time). --config-only skips knowledge on both
sides; --triples-limit bounds very large graphs; -f/--flow-id selects
the flow the triples services run through.

tg-import-workspace streams triples back through the bulk import per
collection and recreates library documents (children after parents).
Knowledge import is additive, unlike config's skip-existing semantics.
Embedding vectors are not carried in bundles: --process re-runs
imported documents through the flow, which regenerates extraction
output and embeddings; --process-collection targets it.

Round-trip covered by unit tests over real archives: export with a
mocked Api, re-import, and assert the bulk triples stream and library
add calls reproduce the original values (including datatyped literals
via the N-Quads path).
@sunnyadn sunnyadn force-pushed the feat/workspace-knowledge-export branch from 9bd6d18 to 36a74d8 Compare July 4, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant