Implement per-page tagging and JSON writer enhancements by martipath · Pull Request #68 · AmadeusITGroup/docs2vecs

martipath · 2026-03-03T08:55:56Z

Per-chunk change detection (JSONWriterSkill) — New checksum_path + skip_downstream_if_unchanged params. SHA-256 checksums are stored per document_id; unchanged chunks are stripped so downstream embedding/indexing is skipped for them.

Stable document_id (ConfluenceFAQSplitter) — Chunk ID now hashed from question text only (was full Q+A), so the ID is stable when only the answer changes — enabling clean Azure AI Search upserts.

Per-page tags (ScrollWordExporter) — page_ids/page_urls entries changed from bare strings to dicts ({id/url, tag?}), with per-page tag falling back to a top-level tag param.

Per-item tags (TeamsQnALoaderSkill) — Each Q&A object in the JSON can now include a "tag" field to override the skill-level default.

Config schema + docs — config_schema.yaml updated to match; indexer-skills.md adds a new use-case example (#6), a Writer Skills section, and updated YAML snippets throughout.

- scrollwordexporter: add per-URL tag overrides via page_url_tags/default_tag - json_writer: add SHA-256 checksum gate to skip unchanged downstream processing - teams_qna_loader: support per-item tag override from JSON - config_schema: add default_tag, page_url_tags, checksum_path, skip_downstream_if_unchanged, batch_size(exitsing parameter from vector skill) - docs: update indexer-skills.md accordingly

- json-writer: checksum per chunk (document_id → sha256(content)), not whole pipeline - confluence-faq-splitter: document_id from question only, stable across answer edits - indexer-skills.md: updated to reflect above points

dpomian · 2026-03-04T19:18:25Z

+                # Remove unchanged chunks from this document
+                if unchanged_chunks:


There's no need for the if statement. You can say doc.chunks -= unchanged_chunks. If the unchanged_chunks is empty, it will be a no-op

martipath force-pushed the tagging branch from 41b3aae to ce14fc9 Compare March 3, 2026 11:18

dpomian approved these changes Mar 4, 2026

View reviewed changes

dpomian merged commit fe504e4 into AmadeusITGroup:main Mar 4, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement per-page tagging and JSON writer enhancements#68

Implement per-page tagging and JSON writer enhancements#68
dpomian merged 2 commits intoAmadeusITGroup:mainfrom
martipath:tagging

martipath commented Mar 3, 2026 •

edited

Loading

Uh oh!

dpomian Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Remove unchanged chunks from this document
		if unchanged_chunks:

Conversation

martipath commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpomian Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martipath commented Mar 3, 2026 •

edited

Loading