Fix/ci hardening#48
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens CI/release workflows and improves robustness of the Candle KV-cache implementations by replacing panic-based validation with fallible APIs, plus a small CUDA kernel maintainability refactor and additional safety guards.
Changes:
- Harden GitHub Actions (pinned action SHAs, reduced default permissions, add
cargo audit, pinrustqual) and add Dependabot config. - Make
PqoCache::new,TqCache::new, andcompute_qjl_signsreturnResultinstead of panicking; update tests accordingly. - Add a CUDA head-dim guard in
GpuPrecomputed::newand deduplicate CUDA bit-packing logic via shared helpers.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/cache_type_correctness.rs | Update tests to use fallible cache constructors (Result-returning new). |
| tests/cache_storage_tests.rs | Update storage roundtrip test to handle fallible PqoCache::new. |
| tests/cache_pqo_tests.rs | Update PQO tests (CPU/GPU) to handle fallible PqoCache::new. |
| src/qjl.rs | Change compute_qjl_signs to return Result on dimension mismatch; propagate at call sites. |
| src/cache/tq.rs | Change TqCache::new to return Result; adapt QJL sign computation to new error type. |
| src/cache/precomputed/mod.rs | Add guard against large head_dim to prevent CUDA shared memory overflow. |
| src/cache/pqo.rs | Change PqoCache::new to return Result instead of panicking. |
| src/cache/cuda/kernels/tq_quant_kernel.cu | Replace duplicated packing logic with shared helper functions. |
| src/cache/cuda/kernels/tq_common.h | Add atomic byte-OR helper and pack helpers for 2/3/4-bit packing. |
| CHANGELOG.md | Document CI hardening, API changes, head-dim guard, and CUDA helper refactor. |
| .github/workflows/release.yml | Harden/pin actions, set permissions, add audit step, pin rustqual. |
| .github/workflows/ci.yml | Harden/pin actions, set permissions, add audit step, pin rustqual. |
| .github/dependabot.yml | Add weekly Dependabot updates for Cargo and GitHub Actions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Maximum head dimension supported by the current CUDA kernel launch | ||
| /// configuration. The kernels launch with `head_dim` threads per block, | ||
| /// so this is bounded by the device's max threads-per-block limit. | ||
| const CUDA_MAX_THREADS_PER_BLOCK: usize = 1024; | ||
| if device.is_cuda() && config.head_dim > CUDA_MAX_THREADS_PER_BLOCK { | ||
| return Err(super::cache_err(format!( | ||
| "head_dim {} exceeds the CUDA kernel thread-block limit \ | ||
| ({CUDA_MAX_THREADS_PER_BLOCK}); launching a kernel with \ | ||
| head_dim threads per block would exceed the device maximum.", | ||
| config.head_dim |
There was a problem hiding this comment.
The new CUDA head_dim > 1024 guard doesn’t match the actual constraints of the CUDA fast paths. The fused attention kernel has compile-time limits (TQ_MAX_QBLOCKS=8, TQ_NUM_WARPS=8) and effectively requires num_qblocks = head_dim / 32 <= 8 (i.e., head_dim <= 256); larger head_dim will make warps_per_token exceed available warps and can lead to incorrect behavior or OOB accesses. Consider tightening this validation (and/or gating the fused CUDA decode path to supported head_dim/num_qblocks and falling back to CPU dequant decode when unsupported).
| /// Maximum head dimension supported by the current CUDA kernel launch | |
| /// configuration. The kernels launch with `head_dim` threads per block, | |
| /// so this is bounded by the device's max threads-per-block limit. | |
| const CUDA_MAX_THREADS_PER_BLOCK: usize = 1024; | |
| if device.is_cuda() && config.head_dim > CUDA_MAX_THREADS_PER_BLOCK { | |
| return Err(super::cache_err(format!( | |
| "head_dim {} exceeds the CUDA kernel thread-block limit \ | |
| ({CUDA_MAX_THREADS_PER_BLOCK}); launching a kernel with \ | |
| head_dim threads per block would exceed the device maximum.", | |
| config.head_dim | |
| /// Maximum number of quantization blocks supported by the fused CUDA | |
| /// fast path. The kernel is compiled with a fixed qblock/warp layout, | |
| /// so larger heads are unsupported even if they fit within the device's | |
| /// generic threads-per-block limit. | |
| const CUDA_MAX_QBLOCKS: usize = 8; | |
| const CUDA_MAX_HEAD_DIM: usize = QUANT_BLOCK_SIZE * CUDA_MAX_QBLOCKS; | |
| if device.is_cuda() && config.head_dim > CUDA_MAX_HEAD_DIM { | |
| return Err(super::cache_err(format!( | |
| "head_dim {} exceeds the CUDA fused-kernel limit (max {} = \ | |
| {} qblocks * block size {}); larger heads are unsupported on \ | |
| the CUDA fast path.", | |
| config.head_dim, | |
| CUDA_MAX_HEAD_DIM, | |
| CUDA_MAX_QBLOCKS, | |
| QUANT_BLOCK_SIZE |
No description provided.