Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .ci/benchmarks/baseline.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"benchmarks": []
}
39 changes: 39 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Benchmark Regression

on:
pull_request:
branches: [main, master]
schedule:
- cron: "30 2 * * *"
workflow_dispatch:

jobs:
benchmark-cpu:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .
pip install pytest pytest-benchmark

- name: Run benchmark suite
run: pytest tests -m "benchmark" --benchmark-json benchmark.json -q

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run benchmark suite without an empty marker filter

pytest -m only executes tests matching the mark expression (pytest --help), and this repository currently has no benchmark-marked tests under tests/, so this command deselects everything and exits with code 5 (No tests collected). In this workflow context (PR and nightly), that causes the benchmark job to fail before the regression checker can run.

Useful? React with 👍 / 👎.


- name: Enforce regression threshold
run: python scripts/ci/check_benchmark_regression.py benchmark.json 0.05

- name: Upload benchmark report
uses: actions/upload-artifact@v4
with:
name: benchmark-report
path: benchmark.json
34 changes: 34 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Docs

on:
pull_request:
paths:
- "docs/**"
- "README.md"
- ".github/workflows/docs.yml"
push:
branches: [main, master]
paths:
- "docs/**"
- "README.md"
workflow_dispatch:

jobs:
docs-build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install docs dependencies
run: |
python -m pip install --upgrade pip
pip install mkdocs mkdocs-material

- name: Build docs
run: mkdocs build --strict
66 changes: 66 additions & 0 deletions .github/workflows/gpu-hardware.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
name: GPU Hardware Validation

on:
workflow_dispatch:
schedule:
- cron: "0 1 * * 0"

jobs:
gpu-smoke:
name: GPU smoke (${{ matrix.runner_label }})
runs-on: [self-hosted, linux, x64, gpu, "${{ matrix.runner_label }}"]
strategy:
fail-fast: false
matrix:
include:
- runner_label: a100
cuda_version: "11.8"
gpu_arch: sm_80
- runner_label: h100
cuda_version: "12.1"
gpu_arch: sm_90
- runner_label: rtx4090
cuda_version: "12.4"
gpu_arch: sm_89

env:
EXPECTED_CUDA: ${{ matrix.cuda_version }}
EXPECTED_ARCH: ${{ matrix.gpu_arch }}

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .
pip install pytest

- name: Verify hardware runtime
run: |
nvidia-smi
python - <<'PY'
import os
print(f"Expected CUDA: {os.environ['EXPECTED_CUDA']}")
print(f"Expected arch: {os.environ['EXPECTED_ARCH']}")
PY

- name: GPU integration and numerical tests
run: |
pytest tests -m "gpu or numerical" -q

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix GPU test selection to avoid no-tests-collected failures

This GPU validation step filters on gpu/numerical markers, but there are no such marks in the present test suite, so pytest exits 5 after deselecting everything. On scheduled/manual self-hosted runs, the workflow will fail at test selection rather than validating hardware behavior.

Useful? React with 👍 / 👎.


- name: Performance regression check
run: |
pytest tests -m "benchmark" --benchmark-json benchmark-${{ matrix.runner_label }}.json -q

- name: Upload benchmark artifact
uses: actions/upload-artifact@v4
with:
name: benchmark-${{ matrix.runner_label }}
path: benchmark-${{ matrix.runner_label }}.json
77 changes: 77 additions & 0 deletions .github/workflows/hpc-matrix.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: HPC Matrix

on:
pull_request:
branches: [main, master]
push:
branches: [main, master]
schedule:
- cron: "0 3 * * *"
workflow_dispatch:

jobs:
matrix-cpu:
name: CPU matrix (py${{ matrix.python-version }} / torch${{ matrix.torch-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
torch-version: ["2.2.2", "2.3.1"]

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: "pip"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .
pip install "torch==${{ matrix.torch-version }}" --index-url https://download.pytorch.org/whl/cpu
pip install pytest pytest-cov mypy ruff

- name: Lint and type checks
run: |
ruff check .
ruff format --check .
mypy kernels implementations

- name: Unit tests
run: pytest -m "not gpu" --cov=kernels --cov=implementations --cov-fail-under=85

- name: Integration tests
run: pytest tests -m "integration and not gpu" -q

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove HPC marker filters that select zero tests

The integration step uses -m "integration and not gpu", but there are no matching marks in the current tests/ tree, so pytest deselects all tests and returns exit code 5; the same issue also affects the fallback step in this workflow. As written, both matrix jobs can fail even when the codebase is otherwise healthy.

Useful? React with 👍 / 👎.


cuda-compat:
name: CUDA compatibility (containerized)
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
cuda-tag: ["11.8.0-runtime-ubuntu22.04", "12.1.1-runtime-ubuntu22.04"]

container:
image: nvidia/cuda:${{ matrix.cuda-tag }}

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install Python and tooling
run: |
apt-get update
apt-get install -y python3 python3-pip
python3 -m pip install --upgrade pip
pip install -e .
pip install pytest

- name: Import and fallback smoke check
run: |
python3 -c "import kernels; print('kernels import ok')"
pytest tests -m "fallback" -q
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ The format is based on Keep a Changelog, and this project adheres to Semantic Ve

### Added

- Added a production-grade HPC CI/CD architecture reference document covering deterministic build strategy, GPU matrix validation, release gating, and scaling guidance.
- Added new GitHub Actions workflows for expanded HPC matrix validation, dedicated GPU hardware testing, benchmark regression gating, and docs build verification.
- Added a benchmark regression check utility script with baseline support to enforce latency performance thresholds in CI.
- Added MkDocs configuration to enable strict documentation build checks in CI.
- Added a scheduled dependency health workflow that validates installation integrity, runs `pip check`, audits vulnerabilities with `pip-audit`, and verifies package imports across Python 3.9-3.12.
- Added a weekly dependency canary workflow that upgrades core quality tooling to latest versions and runs lint, type checks, tests, and package build validation.
- Expanded CI workflow to run a Python 3.9-3.12 matrix with linting, format checks, type checking, security scanning, coverage-enforced tests, smoke verification, and package build validation.
Expand All @@ -27,6 +31,10 @@ The format is based on Keep a Changelog, and this project adheres to Semantic Ve

### Added

- Added a production-grade HPC CI/CD architecture reference document covering deterministic build strategy, GPU matrix validation, release gating, and scaling guidance.
- Added new GitHub Actions workflows for expanded HPC matrix validation, dedicated GPU hardware testing, benchmark regression gating, and docs build verification.
- Added a benchmark regression check utility script with baseline support to enforce latency performance thresholds in CI.
- Added MkDocs configuration to enable strict documentation build checks in CI.
- Initial kernel implementation with deterministic state machine
- Core types: KernelState, KernelRequest, KernelReceipt, Decision, ReceiptStatus
- Append-only audit ledger with hash-chained entries
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ python -m kernels --help
|----------|---------|
| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | Component boundaries and data flows |
| [docs/THREAT_MODEL.md](docs/THREAT_MODEL.md) | Adversary model + mitigations |
| [docs/pipelines/HPC_CICD_ARCHITECTURE.md](docs/pipelines/HPC_CICD_ARCHITECTURE.md) | Production-grade CI/CD architecture for kernel projects |
| [docs/FAQ.md](docs/FAQ.md) | Usage clarifications |

---
Expand Down
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Documentation for understanding and integrating KERNELS.
|----------|---------|
| [ARCHITECTURE.md](ARCHITECTURE.md) | Component boundaries and data flows |
| [THREAT_MODEL.md](THREAT_MODEL.md) | Adversary model + mitigations |
| [pipelines/HPC_CICD_ARCHITECTURE.md](pipelines/HPC_CICD_ARCHITECTURE.md) | Advanced CI/CD design for deterministic kernel infrastructure |
| [FAQ.md](FAQ.md) | Usage clarifications and non-goals |

## Getting Started
Expand Down
117 changes: 117 additions & 0 deletions docs/pipelines/HPC_CICD_ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# HPC CI/CD Architecture for KERNELS

This document defines a production-grade CI/CD design for compute-kernel style
repositories where correctness, determinism, and hardware compatibility are first-
class release gates.

## Pipeline Layers

```text
Commit
-> Static Analysis
-> Deterministic Build Matrix
-> CPU + GPU Test Matrix
-> Security / Supply Chain
-> Numerical Validation
-> Performance Regression Validation
-> Artifact Packaging + Attestation
-> Release + Registry Publish
-> Production Telemetry Hooks
```

## Workflow Set

| Workflow | Trigger | Purpose |
|----------|---------|---------|
| `ci.yml` | PR + push | Fast quality checks, unit tests, package build |
| `hpc-matrix.yml` | PR + push + nightly | Expanded matrix with PyTorch/CUDA compatibility checks |
| `gpu-hardware.yml` | nightly + manual + release candidate | Hardware validation on self-hosted GPU runners |
| `security.yml` | PR + push + weekly | SAST, dependency audit, secret scan |
| `benchmark.yml` | PR + nightly | Performance baselines and regression thresholds |
| `release.yml` | tags | Build, verify, and publish release artifacts |
| `docs.yml` | docs changes + release | Build and publish documentation |

## Deterministic Build Strategy

Determinism is enforced by:

1. Pinning lock files and toolchain versions.
2. Building wheels in isolated environments (`python -m build`).
3. Verifying metadata (`twine check`).
4. Capturing SBOM and provenance attestations.

## Build Matrix Recommendation

| Axis | Values |
|------|--------|
| Python | 3.9, 3.10, 3.11, 3.12 |
| PyTorch | Supported minor versions |
| CUDA | 11.8, 12.1, 12.4 |
| GPU arch | sm_80, sm_86, sm_89, sm_90 |
| OS | ubuntu-latest, self-hosted GPU Linux |

## Validation Gates

### 1. Static Quality Gate

- `ruff check .`
- `ruff format --check .`
- `mypy kernels implementations`

### 2. Correctness Gate

- Unit tests with coverage threshold.
- Integration tests for runtime loading and fallback behavior.
- Numerical equivalence checks against reference implementations.

### 3. Hardware Gate

- Run kernel smoke tests on A100 / H100 / RTX-class runners.
- Enforce CPU fallback tests in every PR.

### 4. Security Gate

- CodeQL
- `pip-audit`
- `safety`
- `gitleaks`
- Optional Semgrep policy pack

### 5. Performance Gate

- Pytest benchmark suite with historical comparison.
- Fail CI if median latency regresses beyond threshold (default 5%).

## Release Controls

Release jobs should execute only after all mandatory checks pass:

- Lint / type / tests
- Security scans
- Benchmark regression check
- Docs build

On tag:

1. Build source + wheel distributions.
2. Generate SBOM (`syft`) and vulnerability report (`grype` or `trivy`).
3. Publish to PyPI.
4. Publish container image.
5. Attach benchmark + security artifacts to GitHub release.

## Rollback Policy

If release validation fails for correctness or benchmark thresholds:

- Mark release candidate as failed.
- Prevent publication jobs from running.
- Emit structured summary and incident artifact.

## Scaling Guidance

For large OSS adoption:

- Split fast and slow workflows; protect PR latency.
- Use distributed/self-hosted GPU pools by architecture label.
- Cache Python deps, build layers, and benchmark baselines.
- Nightly deep validation for expensive fuzz + hardware tests.
15 changes: 15 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
site_name: KERNELS
site_description: Deterministic control planes and CI/CD governance documentation.
repo_url: https://github.com/ayais12210-hub/kernels

theme:
name: material

nav:
- Home: README.md
- Architecture: ARCHITECTURE.md
- Threat Model: THREAT_MODEL.md
- Pipelines:
- Phases: pipelines/PHASES.md
- Gates: pipelines/GATES.md
- HPC CI/CD Architecture: pipelines/HPC_CICD_ARCHITECTURE.md
Loading
Loading