Add HF dataset release pipeline and hash-aware upload flow#19
Conversation
de1e4ee to
52e1a60
Compare
There was a problem hiding this comment.
Pull request overview
Adds a Hugging Face (HF) dataset release pipeline (build + upload) with dataset fingerprinting to support “full” vs “metadata-only” uploads, and switches package versioning to VCS tags via hatch-vcs. Also consolidates/updates CI workflows and the shared setup composite action to support caching and consistent dependency installation.
Changes:
- Implement
dev.release.build-hf-datasetanddev.release.upload-hf-dataset, including split validation, README rendering,version.jsonstamping, HF tagging, and hash-aware upload mode. - Move project versioning to
hatch-vcs(dynamic version from tags) and update Docker/CI to handle builds where.gitmetadata may be missing. - Update GitHub workflows (permissions, caching, docker publish version propagation) and consolidate env docker tests into a single workflow.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds dev dependency lock entries for HF dataset tooling (datasets, huggingface-hub, jinja2, etc.) and updates resolution markers. |
pyproject.toml |
Switches to dynamic versioning with hatch-vcs; adds dev deps needed for HF dataset pipeline. |
dev/release_tasks.py |
Adds HF dataset build/upload tasks, dataset hashing, README templating, and stricter release/tag checks; updates release/tag tasks to accept explicit versions. |
assets/hf_dataset/README.md.jinja2 |
Adds HF dataset card template rendered during dataset builds. |
Dockerfile |
Adds version “pretend” env vars/ARGs to support VCS-derived versioning without .git in Docker build context. |
.github/workflows/test.yml |
Updates workflow permissions (incl. actions: write) to support caching behavior used by setup steps. |
.github/workflows/test-envs-docker.yml |
Adds consolidated workflow to test docker environments across sites with matrix + caching. |
.github/workflows/test-env-ctrl.yml |
Updates permissions and adds timeout. |
.github/workflows/test-docker-wikipedia.yml |
Removes older per-site docker workflow (replaced by consolidated matrix workflow). |
.github/workflows/test-docker-shopping.yml |
Removes older per-site docker workflow (replaced by consolidated matrix workflow). |
.github/workflows/test-docker-shopping-admin.yml |
Removes older per-site docker workflow (replaced by consolidated matrix workflow). |
.github/workflows/test-docker-reddit.yml |
Removes older per-site docker workflow (replaced by consolidated matrix workflow). |
.github/workflows/test-docker-map.yml.disabled |
Removes disabled per-site docker workflow (superseded by consolidated workflow). |
.github/workflows/test-docker-gitlab.yml |
Removes older per-site docker workflow (replaced by consolidated matrix workflow). |
.github/workflows/release.yml |
Changes release flow to accept an explicit version and rely on tag-based versioning; ensures full git history for tag/version resolution. |
.github/workflows/publish-pypi.yml |
Uses shared setup action with full git history to support hatch-vcs version resolution during builds. |
.github/workflows/lint.yml |
Updates workflow permissions to support caching behavior used by setup steps. |
.github/workflows/docs-publish.yml |
Uses shared setup action and adds permissions needed for caching behavior. |
.github/workflows/docker-publish.yml |
Determines build version for docker builds and passes it into Docker build args for VCS-derived versioning fallback. |
.github/workflows/copilot-setup-steps.yml |
Updates permissions to support caching behavior used by setup steps. |
.github/actions/setup/action.yml |
Adds configurable uv sync args and optional git identity configuration; keeps uv caching enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 18 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
dev.release.build-hf-datasetto buildfull/hardparquet artifacts, validate split invariants, render README from Jinja2 template, and stampversion.jsondev.release.upload-hf-datasetwith--dry-run, strict version checks, and HF tag create/verifydataset_hash) and hash-aware upload mode:assets/hf_dataset/README.md.jinja2datasets,huggingface_hub,jinja2