Skip to content

Implement per-page tagging and JSON writer enhancements#68

Merged
dpomian merged 2 commits intoAmadeusITGroup:mainfrom
martipath:tagging
Mar 4, 2026
Merged

Implement per-page tagging and JSON writer enhancements#68
dpomian merged 2 commits intoAmadeusITGroup:mainfrom
martipath:tagging

Conversation

@martipath
Copy link
Copy Markdown
Contributor

@martipath martipath commented Mar 3, 2026

Per-chunk change detection (JSONWriterSkill) — New checksum_path + skip_downstream_if_unchanged params. SHA-256 checksums are stored per document_id; unchanged chunks are stripped so downstream embedding/indexing is skipped for them.

Stable document_id (ConfluenceFAQSplitter) — Chunk ID now hashed from question text only (was full Q+A), so the ID is stable when only the answer changes — enabling clean Azure AI Search upserts.

Per-page tags (ScrollWordExporter)page_ids/page_urls entries changed from bare strings to dicts ({id/url, tag?}), with per-page tag falling back to a top-level tag param.

Per-item tags (TeamsQnALoaderSkill) — Each Q&A object in the JSON can now include a "tag" field to override the skill-level default.

Config schema + docsconfig_schema.yaml updated to match; indexer-skills.md adds a new use-case example (#6), a Writer Skills section, and updated YAML snippets throughout.

- scrollwordexporter: add per-URL tag overrides via page_url_tags/default_tag
- json_writer: add SHA-256 checksum gate to skip unchanged downstream processing
- teams_qna_loader: support per-item tag override from JSON
- config_schema: add default_tag, page_url_tags, checksum_path,
  skip_downstream_if_unchanged, batch_size(exitsing parameter from vector skill)
- docs: update indexer-skills.md accordingly
- json-writer: checksum per chunk (document_id → sha256(content)), not whole pipeline
- confluence-faq-splitter: document_id from question only, stable across answer edits
- indexer-skills.md: updated to reflect above points
Comment on lines +134 to +135
# Remove unchanged chunks from this document
if unchanged_chunks:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for the if statement. You can say doc.chunks -= unchanged_chunks. If the unchanged_chunks is empty, it will be a no-op

@dpomian dpomian merged commit fe504e4 into AmadeusITGroup:main Mar 4, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants