Skip to content

Add HF dataset release pipeline and hash-aware upload flow#19

Merged
Am1n3e merged 15 commits intomainfrom
add-hf-dataset
Feb 7, 2026
Merged

Add HF dataset release pipeline and hash-aware upload flow#19
Am1n3e merged 15 commits intomainfrom
add-hf-dataset

Conversation

@Am1n3e
Copy link
Contributor

@Am1n3e Am1n3e commented Feb 7, 2026

Summary

  • add dev.release.build-hf-dataset to build full/hard parquet artifacts, validate split invariants, render README from Jinja2 template, and stamp version.json
  • add dev.release.upload-hf-dataset with --dry-run, strict version checks, and HF tag create/verify
  • implement dataset fingerprinting (dataset_hash) and hash-aware upload mode:
    • full upload when dataset changed
    • metadata-only upload when code-only release
  • add HF dataset card template at assets/hf_dataset/README.md.jinja2
  • add dev deps: datasets, huggingface_hub, jinja2

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Hugging Face (HF) dataset release pipeline (build + upload) with dataset fingerprinting to support “full” vs “metadata-only” uploads, and switches package versioning to VCS tags via hatch-vcs. Also consolidates/updates CI workflows and the shared setup composite action to support caching and consistent dependency installation.

Changes:

  • Implement dev.release.build-hf-dataset and dev.release.upload-hf-dataset, including split validation, README rendering, version.json stamping, HF tagging, and hash-aware upload mode.
  • Move project versioning to hatch-vcs (dynamic version from tags) and update Docker/CI to handle builds where .git metadata may be missing.
  • Update GitHub workflows (permissions, caching, docker publish version propagation) and consolidate env docker tests into a single workflow.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds dev dependency lock entries for HF dataset tooling (datasets, huggingface-hub, jinja2, etc.) and updates resolution markers.
pyproject.toml Switches to dynamic versioning with hatch-vcs; adds dev deps needed for HF dataset pipeline.
dev/release_tasks.py Adds HF dataset build/upload tasks, dataset hashing, README templating, and stricter release/tag checks; updates release/tag tasks to accept explicit versions.
assets/hf_dataset/README.md.jinja2 Adds HF dataset card template rendered during dataset builds.
Dockerfile Adds version “pretend” env vars/ARGs to support VCS-derived versioning without .git in Docker build context.
.github/workflows/test.yml Updates workflow permissions (incl. actions: write) to support caching behavior used by setup steps.
.github/workflows/test-envs-docker.yml Adds consolidated workflow to test docker environments across sites with matrix + caching.
.github/workflows/test-env-ctrl.yml Updates permissions and adds timeout.
.github/workflows/test-docker-wikipedia.yml Removes older per-site docker workflow (replaced by consolidated matrix workflow).
.github/workflows/test-docker-shopping.yml Removes older per-site docker workflow (replaced by consolidated matrix workflow).
.github/workflows/test-docker-shopping-admin.yml Removes older per-site docker workflow (replaced by consolidated matrix workflow).
.github/workflows/test-docker-reddit.yml Removes older per-site docker workflow (replaced by consolidated matrix workflow).
.github/workflows/test-docker-map.yml.disabled Removes disabled per-site docker workflow (superseded by consolidated workflow).
.github/workflows/test-docker-gitlab.yml Removes older per-site docker workflow (replaced by consolidated matrix workflow).
.github/workflows/release.yml Changes release flow to accept an explicit version and rely on tag-based versioning; ensures full git history for tag/version resolution.
.github/workflows/publish-pypi.yml Uses shared setup action with full git history to support hatch-vcs version resolution during builds.
.github/workflows/lint.yml Updates workflow permissions to support caching behavior used by setup steps.
.github/workflows/docs-publish.yml Uses shared setup action and adds permissions needed for caching behavior.
.github/workflows/docker-publish.yml Determines build version for docker builds and passes it into Docker build args for VCS-derived versioning fallback.
.github/workflows/copilot-setup-steps.yml Updates permissions to support caching behavior used by setup steps.
.github/actions/setup/action.yml Adds configurable uv sync args and optional git identity configuration; keeps uv caching enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 18 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Am1n3e Am1n3e merged commit a6a58b6 into main Feb 7, 2026
3 checks passed
@Am1n3e Am1n3e deleted the add-hf-dataset branch February 7, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants