Upstream merge Dec 22, 2025 by lukekim · Pull Request #9 · spiceai/candle

lukekim · 2025-11-01T22:02:01Z

No description provided.

* place `ug` behind not wasm32 attr so that wasm32 can compile * mv `ug` to conditional target dep assuming every non-wasm32 user wants this

* Improve reduce perf and add contiguous impl * Improve arg reduce and add contiguous impl * Improve softmax kernel. 33%-39% higher thrpt * fmt * Fixed all bugs. Improved code quality. Added tests. * Stash for debugging * Stash for debugging 2 * Fixing argmax bug and improve performance Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com> * Fix test and add is_valid_simgroup_reduce_type trait * Online softmax. Improved threadgroup reduce. Tidying up a bit. * Remove redundant threadgroup_barrier from arg reduce * Mostly tidying up. Some improvements * Simplify indexed struct * tidying * Reuse operation operator instead of passing it in as a parameter * Fix how operators are applied to indexed<vec<T,N>> * Vectorized load. Scalar block reduce. Hitting max throughput for f32 reduce. * Vectorized load for online softmax. Involves a reinterpret_cast of src which may be suboptimal. * Metal as_type casting vec<bfloat, N> -> vec<float, N/2> for simd and fast math * Use constant for input instead of const device. Fix strided reduce. * Use contiguous reduce in tests * Rename finalize -> to_scalar * Support integer types max/min (switch with trait-inferred impl later) * Was worried I was skipping work -> shuffling the 1D test cases * Add build.rs to avoid metal kernel jit compile overhead * Improve build. Extract utils * Compile metal kernels for both macos and ios * Fixed over xmas and then forgot about it * Add calculate_reduce_threads util * Remove old reduce.metal * Improve f16/bf16 softmax precision by accumulating in f32 * Remove build.rs (for now) * Move softmax bench to candle-nn * Remove redundant thread calc util fn * Use uint over ushort for indices etc * Use fast exp in MDReduceOp * Remove nested metal define for softmax * Fix some clippy lint. --------- Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com> Co-authored-by: Laurent <laurent.mazare@gmail.com>

* add dynamic position encoding * remove debug messages

Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>

* update to cudarc to v0.13.5 to support cuda 12.8 * Bump the crate version. --------- Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>

* Add deepseek v2 * Fix * Remove unused * Add kv cache * Remove from cargo.toml * Fix dtype selection logic * Fix unnecessary u32->f32->gather->u32 * Remove fromstr impl * Use local scopes for some clarity * Typo * Repeat k_pe * Chain calls to remove mut * Actually, remove all muts * Update readme

…ingface#2762)

* Avoid some clippy lints on 1.85. * Upload artifacts v4.

* Parse the json config for siglip models. * Bump the tokenizers dependency. * Add a v2 model. * Support more v2 model.s

* Gemma 3 initial setup (text only). * Use the rotating kv cache for the sliding window.

fix lints

- Updates the Gemma example to include Gemma 3 1b instruction tuned.

…ngface#2811)

* Pickle decoder changes: added Long1 opcode, fixed tensor offset calculation * Apply rustfmt. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

* fix: candle-flash-attn linux and msvc build * Missing newline at eof. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

…uggingface#2843) * quantized deepseek qwen generating tokens * removed is_deepseek from Args and replaced prompt if statement with pattern matching

* added deepseekr1 llama8b variant to quantized example * lint

* Add `flip` to `tensor` * Move the tests to the proper places. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

* added new language pairs to marian-mt * lint * seperated python code for converting tokenizers into its own file and and added a reqirements.txt for dependencies, updated instructions in readme and included python version * Cleanup. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

…ngface#3200) * fix(cuda): fix integer reduction initialization Replace hardcoded INFINITY/-INFINITY values with type-safe template functions for reduction initialization. Using floating-point infinity values with integer types causes undefined behavior and crashes on newer GPU architectures like Blackwell. The new template specializations use appropriate numeric_limits values for integer types while preserving the original behavior for floating-point types. * fix(cuda): replace limits import with cuda std equivalents --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* make qwen3 vl config public

* Add dummy i32/i16/f6e2m3/f6e3m2/f4/f8e8m0 dtypes * Metal fixes * Fix candle-onnx build * Apply review comments * Residual fixes * Apply review comments * Apply review comments * Revert some things * Free more space --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* Merge with fork Co-authored-by Guoqing Bao <topon@outlook.com> * Update sdpa * Fix flash attn bf16 case * Metal fixes * Add metal methods * Add new_private_buffer * Fix metal tests * Format * Apply review comments * Update CI (huggingface#3194) * Update CI * I have no clue what was going on with this maturin file, but I don't like it * update cuda container options * Add compute cap to cuda wf * Fix rust toolchain call * update cuda ci runner and bindgen_cuda * Add initial support for imatrix quantization (huggingface#3193) * add clear kv cache to quantized qwen3 weights (huggingface#3189) * Fix metal bug * Apply review comments * Fix merge * Add lld installation and test steps for Linux (huggingface#3213) --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com> Co-authored-by: anonenity <anonentity.mail@gmail.com> Co-authored-by: Nicolas PASCAL <344493+haricot@users.noreply.github.com>

* Add ability to toggle fast math mode in metal. Chose how to apply based on os version. * Move available macro and friends to utils * Isolate #[allow(deprecated)] to the actually deprecated method * doc * Use objc2::available macro instead

* Initial pyo3 update * pyo3 onnx update

* Update unary.metal * update metal unary tests * Remove metal tiled unary kernels (now automated) * Optimize metal affine * Optimize metal powf * Optimize metal elu

…ngface#3233)

* Improve performance of contiguous unary/binary kernels * Improved strided binary performance. Especially when only one of the tensors are strided.

DougAnderson444 and others added 30 commits February 1, 2025 23:05

fix: place ug dep behind not wasm32 flag (huggingface#2760)

0af3e42

* place `ug` behind not wasm32 attr so that wasm32 can compile * mv `ug` to conditional target dep assuming every non-wasm32 user wants this

add dynamic position encoding to Siglip (huggingface#2770)

2423d63

* add dynamic position encoding * remove debug messages

update to cudarc to v0.13.5 to support cuda 12.8 (huggingface#2771)

3ddd20a

Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>

Bump the crate version to 0.8.3 (huggingface#2772)

fd7f724

* update to cudarc to v0.13.5 to support cuda 12.8 * Bump the crate version. --------- Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>

Refactor From<Tuple> implementations by using macros, add tests (hugg…

ac9cdbd

…ingface#2762)

Avoid some clippy lints on 1.85. (huggingface#2778)

9e8bf70

* Avoid some clippy lints on 1.85. * Upload artifacts v4.

Make sorted_nodes pub function (huggingface#2780)

26c1692

phi-4-mini (huggingface#2790)

add3a71

Allow ModernBert to be used to generate embeddings. (huggingface#2791)

37db86f

Add ModernBert sentency classifier (huggingface#2796)

e4ffb85

Parse the json config for siglip models. (huggingface#2800)

e286cf7

* Parse the json config for siglip models. * Bump the tokenizers dependency. * Add a v2 model. * Support more v2 model.s

Gemma 3 initial setup (text only). (huggingface#2802)

111edbc

* Gemma 3 initial setup (text only). * Use the rotating kv cache for the sliding window.

upgrade half library to fix rand (huggingface#2806)

c930ab7

fix lints

Bump the crate version to 0.8.4. (huggingface#2808)

468d1d5

Add Gemma 3 1b IT toe Gemma examples (huggingface#2809)

cbf5fc8

- Updates the Gemma example to include Gemma 3 1b instruction tuned.

Allow for growing the default KV cache when needed. (huggingface#2810)

3afb049

Fix for whisper example. rand::distribution is now rand::distr (huggi…

0b24f7f

…ngface#2811)

Pickle decoder fix and Long1 opcode addition. (huggingface#2824)

67b85f7

* Pickle decoder changes: added Long1 opcode, fixed tensor offset calculation * Apply rustfmt. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

fix: candle-flash-attn linux and msvc build (huggingface#2829)

f3d4729

* fix: candle-flash-attn linux and msvc build * Missing newline at eof. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

fixed rand imports for whisper-microphone example (huggingface#2834)

10853b8

fixed rand import for mnist-training (huggingface#2833)

0d40970

Fix reinforcement learning example (huggingface#2837)

cb02b38

Fix CIFAR10 dataset types and dimension ordering (huggingface#2845)

59c2619

Added DeepseekR1 Qwen7B variant to quantized-qwen2-instruct example (h…

ba47329

…uggingface#2843) * quantized deepseek qwen generating tokens * removed is_deepseek from Args and replaced prompt if statement with pattern matching

Added Deepseekr1 Llama8b variant to quantized example (huggingface#2842)

6429609

* added deepseekr1 llama8b variant to quantized example * lint

Add flip to tensor (huggingface#2855)

9541467

* Add `flip` to `tensor` * Move the tests to the proper places. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

add as_cuda_slice_mut to CudaStorage and CudaDType (huggingface#2859)

b4daa03

anonenity and others added 27 commits November 18, 2025 11:12

add clear kv cache to quantized qwen3 weights (huggingface#3189)

eb651c8

fix typo preventing usage on mac (huggingface#3201)

3390caa

fix for huggingface#3203 (huggingface#3204)

9ca71de

* make qwen3 vl config public

Add lld installation and test steps for Linux (huggingface#3213)

b801ef6

.gitignore: add .zed to ignored editor configs (huggingface#3218)

2ac3fe0

chore(dep): bump cudarc to 0.18.1 (huggingface#3219)

c39d5f0

Hotfix: Bump float8 to 0.5.0 (huggingface#3223)

08d7b64

Update pyo3 (huggingface#3202)

9ede204

* Initial pyo3 update * pyo3 onnx update

[Metal] unary and affine improvements (huggingface#3230)

3d3cc49

* Update unary.metal * update metal unary tests * Remove metal tiled unary kernels (now automated) * Optimize metal affine * Optimize metal powf * Optimize metal elu

[Metal] binary improvements (huggingface#3231)

72238a7

fix(metal): add missing softcapping field to AttnParams struct (huggi…

d91be02

…ngface#3233)

Format sdpa (huggingface#3235)

2a797ea

Fix metal argmax (huggingface#3238)

d23664f

[Metal] further improve unary and binary (huggingface#3239)

73fd9c3

* Improve performance of contiguous unary/binary kernels * Improved strided binary performance. Especially when only one of the tensors are strided.

[Metal] cast improvements (huggingface#3241)

e33d776

[Metal] Improve ternary further (huggingface#3242)

4b46187

Bump candle version to 0.9.2-alpha.2 (huggingface#3248)

8839457

add candle flash attention 3 copyright markers (huggingface#3256)

689d255

WIP

590d2ad

Merge remote-tracking branch 'origin/main' into lukim/upgrade-candle

4d42f63

Formatting

beeadf9

Fixes

0710642

Formatting

ceec35c

lukekim changed the title ~~Upstream merge Nov 1, 2025~~ Upstream merge Dec 22, 2025 Dec 23, 2025

phillipleblanc approved these changes Dec 23, 2025

View reviewed changes

lukekim closed this Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream merge Dec 22, 2025#9

Upstream merge Dec 22, 2025#9
lukekim wants to merge 315 commits into
spiceaifrom
lukim/upgrade-candle

lukekim commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

lukekim commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants