Skip to content

Latest commit

 

History

History
203 lines (148 loc) · 5.11 KB

File metadata and controls

203 lines (148 loc) · 5.11 KB

HPC CI/CD Architecture for KERNELS

This document defines a production-grade CI/CD topology for compute-kernel style repositories where determinism, security, hardware compatibility, and release repeatability are first-class gates.

1) CI/CD Topology

Developer
  -> Pull Request / Push
  -> Layer 1: Pre-commit Validation
  -> Layer 2: Static Analysis
  -> Layer 3: Dependency & Supply Chain
  -> Layer 4: Build Matrix
  -> Layer 5: Kernel Compilation
  -> Layer 6: Test Matrix
  -> Layer 7: Performance
  -> Layer 8: Security
  -> Layer 9: Packaging
  -> Layer 10: Release

2) Ten-Layer Workflow Catalog (44 Pipelines)

Layer 1 — Pre-commit Validation

  1. lint-python
  2. lint-cpp
  3. format-check
  4. import-order-check
  5. precommit-hooks
  6. commit-message-lint
  7. secrets-scan

Primary tools: ruff, black, clang-format, pre-commit, commitlint, gitleaks.

Layer 2 — Static Code Analysis

  1. mypy-type-check
  2. pylint-analysis
  3. code-complexity-check
  4. dead-code-detection
  5. docstring-coverage
  6. codeql-analysis

Primary tools: mypy, pylint, radon, vulture, codeql.

Layer 3 — Dependency & Supply Chain

  1. dependency-vulnerability-scan
  2. license-compliance
  3. dependency-update-check
  4. sbom-generation

Primary tools: pip-audit, safety, syft, dependabot.

Layer 4 — Build Matrix

  1. build-linux
  2. build-macos
  3. build-python-3.9
  4. build-python-3.10
  5. build-python-3.11
  6. build-cuda-11
  7. build-cuda-12

Layer 5 — Kernel Compilation

  1. compile-cuda-kernels
  2. compile-cpu-kernels
  3. compile-ptx
  4. compile-avx-optimizations

Layer 6 — Testing

  1. unit-tests
  2. integration-tests
  3. api-contract-tests
  4. gpu-runtime-tests
  5. numerical-accuracy-tests
  6. memory-safety-tests

Primary tools: pytest, cuda-memcheck, ASAN, valgrind.

Layer 7 — Performance

  1. benchmark-kernels
  2. performance-regression-detection
  3. latency-analysis
  4. memory-bandwidth-tests

Primary tools: pytest-benchmark, nvprof/nsight, Airspeed Velocity.

Layer 8 — Security

  1. container-vulnerability-scan
  2. binary-vulnerability-scan
  3. fuzz-testing

Primary tools: trivy, grype, libFuzzer, hypothesis.

Layer 9 — Packaging

  1. build-python-wheels
  2. build-docker-images

Layer 10 — Release

  1. publish-release

Primary outputs: signed PyPI wheels, signed OCI images, compiled kernels, benchmark reports, SBOM + provenance attestations.

3) Repository Layout Recommendation

repo/
├── kernels/
├── tests/
│   ├── unit/
│   ├── integration/
│   ├── gpu/
│   └── benchmarks/
├── benchmarks/
├── docker/
├── scripts/
├── .github/workflows/
└── pyproject.toml

4) KERNELS Workflow Mapping

The following files in this repository implement core parts of the architecture:

Existing workflow Layer coverage Notes
.github/workflows/ci.yml 1, 2, 4, 6 Fast lint/type/test/build gate
.github/workflows/hpc-matrix.yml 4, 5, 6 Expanded Python/Torch/CUDA compatibility
.github/workflows/gpu-hardware.yml 5, 6, 7 Hardware validation on self-hosted GPU runners
.github/workflows/security.yml 1, 2, 3, 8 CodeQL, dependency review, secret scanning
.github/workflows/benchmark.yml 7 Regression checks using benchmark artifacts
.github/workflows/dependency-health.yml 3 Dependency graph and vulnerability health
.github/workflows/dependency-canary.yml 3, 4 Canary runs against latest dependencies
.github/workflows/release.yml 9, 10 Build + publish releases
.github/workflows/docs.yml release support Documentation validation and publication

5) Performance Regression Gate

Use artifact-backed baseline comparison:

  1. Store a baseline benchmark artifact from main.
  2. Run PR benchmarks on the same hardware class.
  3. Compare with a fixed threshold (default: 5% slowdown).
  4. Fail PR when threshold is exceeded.

6) Hardware Validation Strategy

Recommended dedicated GPU runner pools:

  • Self-hosted RTX 4090
  • Self-hosted A100
  • Self-hosted H100

Alternative cloud-backed runner pools:

  • RunPod
  • Lambda Labs
  • GCP GPU nodes

7) Scaling Guidance for Large Projects

  • Cache aggressively: Python deps, build outputs, ccache, benchmark baselines.
  • Shard expensive tests across runners.
  • Keep PR checks fast; move deep validation to nightly/release candidates.
  • Use reusable workflows to avoid duplication across matrix dimensions.

8) Security Hardening Controls

  • Signed commits + signed tags.
  • SBOM generation per build.
  • Artifact signing with Sigstore/Cosign.
  • SLSA provenance attestations for release bundles.

9) Coverage Policy

Recommended minimum gate:

pytest --cov=kernels --cov=implementations --cov-fail-under=85

Raise the threshold for critical packages over time as flaky suites are removed.

10) Advanced Optional Pipelines

For kernel-heavy production systems, add:

  • Kernel fuzzing
  • PTX verification
  • ABI compatibility tests
  • GPU driver compatibility matrix
  • Binary reproducibility checks