-
-
Notifications
You must be signed in to change notification settings - Fork 3
feat(ci): add HPC-grade CI/CD workflows, GPU hardware validation, and benchmark regression gate #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| { | ||
| "benchmarks": [] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| name: Benchmark Regression | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: [main, master] | ||
| schedule: | ||
| - cron: "30 2 * * *" | ||
| workflow_dispatch: | ||
|
|
||
| jobs: | ||
| benchmark-cpu: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.11" | ||
| cache: "pip" | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -e . | ||
| pip install pytest pytest-benchmark | ||
|
|
||
| - name: Run benchmark suite | ||
| run: pytest tests -m "benchmark" --benchmark-json benchmark.json -q | ||
|
|
||
| - name: Enforce regression threshold | ||
| run: python scripts/ci/check_benchmark_regression.py benchmark.json 0.05 | ||
|
|
||
| - name: Upload benchmark report | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: benchmark-report | ||
| path: benchmark.json | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| name: Docs | ||
|
|
||
| on: | ||
| pull_request: | ||
| paths: | ||
| - "docs/**" | ||
| - "README.md" | ||
| - ".github/workflows/docs.yml" | ||
| push: | ||
| branches: [main, master] | ||
| paths: | ||
| - "docs/**" | ||
| - "README.md" | ||
| workflow_dispatch: | ||
|
|
||
| jobs: | ||
| docs-build: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.11" | ||
|
|
||
| - name: Install docs dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install mkdocs mkdocs-material | ||
|
|
||
| - name: Build docs | ||
| run: mkdocs build --strict |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| name: GPU Hardware Validation | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| schedule: | ||
| - cron: "0 1 * * 0" | ||
|
|
||
| jobs: | ||
| gpu-smoke: | ||
| name: GPU smoke (${{ matrix.runner_label }}) | ||
| runs-on: [self-hosted, linux, x64, gpu, "${{ matrix.runner_label }}"] | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| include: | ||
| - runner_label: a100 | ||
| cuda_version: "11.8" | ||
| gpu_arch: sm_80 | ||
| - runner_label: h100 | ||
| cuda_version: "12.1" | ||
| gpu_arch: sm_90 | ||
| - runner_label: rtx4090 | ||
| cuda_version: "12.4" | ||
| gpu_arch: sm_89 | ||
|
|
||
| env: | ||
| EXPECTED_CUDA: ${{ matrix.cuda_version }} | ||
| EXPECTED_ARCH: ${{ matrix.gpu_arch }} | ||
|
|
||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.11" | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -e . | ||
| pip install pytest | ||
|
|
||
| - name: Verify hardware runtime | ||
| run: | | ||
| nvidia-smi | ||
| python - <<'PY' | ||
| import os | ||
| print(f"Expected CUDA: {os.environ['EXPECTED_CUDA']}") | ||
| print(f"Expected arch: {os.environ['EXPECTED_ARCH']}") | ||
| PY | ||
|
|
||
| - name: GPU integration and numerical tests | ||
| run: | | ||
| pytest tests -m "gpu or numerical" -q | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This GPU validation step filters on Useful? React with 👍 / 👎. |
||
|
|
||
| - name: Performance regression check | ||
| run: | | ||
| pytest tests -m "benchmark" --benchmark-json benchmark-${{ matrix.runner_label }}.json -q | ||
|
|
||
| - name: Upload benchmark artifact | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: benchmark-${{ matrix.runner_label }} | ||
| path: benchmark-${{ matrix.runner_label }}.json | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| name: HPC Matrix | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: [main, master] | ||
| push: | ||
| branches: [main, master] | ||
| schedule: | ||
| - cron: "0 3 * * *" | ||
| workflow_dispatch: | ||
|
|
||
| jobs: | ||
| matrix-cpu: | ||
| name: CPU matrix (py${{ matrix.python-version }} / torch${{ matrix.torch-version }}) | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| python-version: ["3.9", "3.10", "3.11", "3.12"] | ||
| torch-version: ["2.2.2", "2.3.1"] | ||
|
|
||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python ${{ matrix.python-version }} | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
| cache: "pip" | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -e . | ||
| pip install "torch==${{ matrix.torch-version }}" --index-url https://download.pytorch.org/whl/cpu | ||
| pip install pytest pytest-cov mypy ruff | ||
|
|
||
| - name: Lint and type checks | ||
| run: | | ||
| ruff check . | ||
| ruff format --check . | ||
| mypy kernels implementations | ||
|
|
||
| - name: Unit tests | ||
| run: pytest -m "not gpu" --cov=kernels --cov=implementations --cov-fail-under=85 | ||
|
|
||
| - name: Integration tests | ||
| run: pytest tests -m "integration and not gpu" -q | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The integration step uses Useful? React with 👍 / 👎. |
||
|
|
||
| cuda-compat: | ||
| name: CUDA compatibility (containerized) | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| cuda-tag: ["11.8.0-runtime-ubuntu22.04", "12.1.1-runtime-ubuntu22.04"] | ||
|
|
||
| container: | ||
| image: nvidia/cuda:${{ matrix.cuda-tag }} | ||
|
|
||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Install Python and tooling | ||
| run: | | ||
| apt-get update | ||
| apt-get install -y python3 python3-pip | ||
| python3 -m pip install --upgrade pip | ||
| pip install -e . | ||
| pip install pytest | ||
|
|
||
| - name: Import and fallback smoke check | ||
| run: | | ||
| python3 -c "import kernels; print('kernels import ok')" | ||
| pytest tests -m "fallback" -q | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
| # HPC CI/CD Architecture for KERNELS | ||
|
|
||
| This document defines a production-grade CI/CD design for compute-kernel style | ||
| repositories where correctness, determinism, and hardware compatibility are first- | ||
| class release gates. | ||
|
|
||
| ## Pipeline Layers | ||
|
|
||
| ```text | ||
| Commit | ||
| -> Static Analysis | ||
| -> Deterministic Build Matrix | ||
| -> CPU + GPU Test Matrix | ||
| -> Security / Supply Chain | ||
| -> Numerical Validation | ||
| -> Performance Regression Validation | ||
| -> Artifact Packaging + Attestation | ||
| -> Release + Registry Publish | ||
| -> Production Telemetry Hooks | ||
| ``` | ||
|
|
||
| ## Workflow Set | ||
|
|
||
| | Workflow | Trigger | Purpose | | ||
| |----------|---------|---------| | ||
| | `ci.yml` | PR + push | Fast quality checks, unit tests, package build | | ||
| | `hpc-matrix.yml` | PR + push + nightly | Expanded matrix with PyTorch/CUDA compatibility checks | | ||
| | `gpu-hardware.yml` | nightly + manual + release candidate | Hardware validation on self-hosted GPU runners | | ||
| | `security.yml` | PR + push + weekly | SAST, dependency audit, secret scan | | ||
| | `benchmark.yml` | PR + nightly | Performance baselines and regression thresholds | | ||
| | `release.yml` | tags | Build, verify, and publish release artifacts | | ||
| | `docs.yml` | docs changes + release | Build and publish documentation | | ||
|
|
||
| ## Deterministic Build Strategy | ||
|
|
||
| Determinism is enforced by: | ||
|
|
||
| 1. Pinning lock files and toolchain versions. | ||
| 2. Building wheels in isolated environments (`python -m build`). | ||
| 3. Verifying metadata (`twine check`). | ||
| 4. Capturing SBOM and provenance attestations. | ||
|
|
||
| ## Build Matrix Recommendation | ||
|
|
||
| | Axis | Values | | ||
| |------|--------| | ||
| | Python | 3.9, 3.10, 3.11, 3.12 | | ||
| | PyTorch | Supported minor versions | | ||
| | CUDA | 11.8, 12.1, 12.4 | | ||
| | GPU arch | sm_80, sm_86, sm_89, sm_90 | | ||
| | OS | ubuntu-latest, self-hosted GPU Linux | | ||
|
|
||
| ## Validation Gates | ||
|
|
||
| ### 1. Static Quality Gate | ||
|
|
||
| - `ruff check .` | ||
| - `ruff format --check .` | ||
| - `mypy kernels implementations` | ||
|
|
||
| ### 2. Correctness Gate | ||
|
|
||
| - Unit tests with coverage threshold. | ||
| - Integration tests for runtime loading and fallback behavior. | ||
| - Numerical equivalence checks against reference implementations. | ||
|
|
||
| ### 3. Hardware Gate | ||
|
|
||
| - Run kernel smoke tests on A100 / H100 / RTX-class runners. | ||
| - Enforce CPU fallback tests in every PR. | ||
|
|
||
| ### 4. Security Gate | ||
|
|
||
| - CodeQL | ||
| - `pip-audit` | ||
| - `safety` | ||
| - `gitleaks` | ||
| - Optional Semgrep policy pack | ||
|
|
||
| ### 5. Performance Gate | ||
|
|
||
| - Pytest benchmark suite with historical comparison. | ||
| - Fail CI if median latency regresses beyond threshold (default 5%). | ||
|
|
||
| ## Release Controls | ||
|
|
||
| Release jobs should execute only after all mandatory checks pass: | ||
|
|
||
| - Lint / type / tests | ||
| - Security scans | ||
| - Benchmark regression check | ||
| - Docs build | ||
|
|
||
| On tag: | ||
|
|
||
| 1. Build source + wheel distributions. | ||
| 2. Generate SBOM (`syft`) and vulnerability report (`grype` or `trivy`). | ||
| 3. Publish to PyPI. | ||
| 4. Publish container image. | ||
| 5. Attach benchmark + security artifacts to GitHub release. | ||
|
|
||
| ## Rollback Policy | ||
|
|
||
| If release validation fails for correctness or benchmark thresholds: | ||
|
|
||
| - Mark release candidate as failed. | ||
| - Prevent publication jobs from running. | ||
| - Emit structured summary and incident artifact. | ||
|
|
||
| ## Scaling Guidance | ||
|
|
||
| For large OSS adoption: | ||
|
|
||
| - Split fast and slow workflows; protect PR latency. | ||
| - Use distributed/self-hosted GPU pools by architecture label. | ||
| - Cache Python deps, build layers, and benchmark baselines. | ||
| - Nightly deep validation for expensive fuzz + hardware tests. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| site_name: KERNELS | ||
| site_description: Deterministic control planes and CI/CD governance documentation. | ||
| repo_url: https://github.com/ayais12210-hub/kernels | ||
|
|
||
| theme: | ||
| name: material | ||
|
|
||
| nav: | ||
| - Home: README.md | ||
| - Architecture: ARCHITECTURE.md | ||
| - Threat Model: THREAT_MODEL.md | ||
| - Pipelines: | ||
| - Phases: pipelines/PHASES.md | ||
| - Gates: pipelines/GATES.md | ||
| - HPC CI/CD Architecture: pipelines/HPC_CICD_ARCHITECTURE.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest -monly executes tests matching the mark expression (pytest --help), and this repository currently has no benchmark-marked tests undertests/, so this command deselects everything and exits with code 5 (No tests collected). In this workflow context (PR and nightly), that causes the benchmark job to fail before the regression checker can run.Useful? React with 👍 / 👎.