Upstream merge Dec 22, 2025#9
Closed
lukekim wants to merge 315 commits into
Closed
Conversation
* place `ug` behind not wasm32 attr so that wasm32 can compile * mv `ug` to conditional target dep assuming every non-wasm32 user wants this
* Improve reduce perf and add contiguous impl * Improve arg reduce and add contiguous impl * Improve softmax kernel. 33%-39% higher thrpt * fmt * Fixed all bugs. Improved code quality. Added tests. * Stash for debugging * Stash for debugging 2 * Fixing argmax bug and improve performance Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com> * Fix test and add is_valid_simgroup_reduce_type trait * Online softmax. Improved threadgroup reduce. Tidying up a bit. * Remove redundant threadgroup_barrier from arg reduce * Mostly tidying up. Some improvements * Simplify indexed struct * tidying * Reuse operation operator instead of passing it in as a parameter * Fix how operators are applied to indexed<vec<T,N>> * Vectorized load. Scalar block reduce. Hitting max throughput for f32 reduce. * Vectorized load for online softmax. Involves a reinterpret_cast of src which may be suboptimal. * Metal as_type casting vec<bfloat, N> -> vec<float, N/2> for simd and fast math * Use constant for input instead of const device. Fix strided reduce. * Use contiguous reduce in tests * Rename finalize -> to_scalar * Support integer types max/min (switch with trait-inferred impl later) * Was worried I was skipping work -> shuffling the 1D test cases * Add build.rs to avoid metal kernel jit compile overhead * Improve build. Extract utils * Compile metal kernels for both macos and ios * Fixed over xmas and then forgot about it * Add calculate_reduce_threads util * Remove old reduce.metal * Improve f16/bf16 softmax precision by accumulating in f32 * Remove build.rs (for now) * Move softmax bench to candle-nn * Remove redundant thread calc util fn * Use uint over ushort for indices etc * Use fast exp in MDReduceOp * Remove nested metal define for softmax * Fix some clippy lint. --------- Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com> Co-authored-by: Laurent <laurent.mazare@gmail.com>
* add dynamic position encoding * remove debug messages
Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>
* update to cudarc to v0.13.5 to support cuda 12.8 * Bump the crate version. --------- Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>
* Add deepseek v2 * Fix * Remove unused * Add kv cache * Remove from cargo.toml * Fix dtype selection logic * Fix unnecessary u32->f32->gather->u32 * Remove fromstr impl * Use local scopes for some clarity * Typo * Repeat k_pe * Chain calls to remove mut * Actually, remove all muts * Update readme
* Avoid some clippy lints on 1.85. * Upload artifacts v4.
* Parse the json config for siglip models. * Bump the tokenizers dependency. * Add a v2 model. * Support more v2 model.s
* Gemma 3 initial setup (text only). * Use the rotating kv cache for the sliding window.
- Updates the Gemma example to include Gemma 3 1b instruction tuned.
* Pickle decoder changes: added Long1 opcode, fixed tensor offset calculation * Apply rustfmt. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>
* fix: candle-flash-attn linux and msvc build * Missing newline at eof. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
…uggingface#2843) * quantized deepseek qwen generating tokens * removed is_deepseek from Args and replaced prompt if statement with pattern matching
* added deepseekr1 llama8b variant to quantized example * lint
* added new language pairs to marian-mt * lint * seperated python code for converting tokenizers into its own file and and added a reqirements.txt for dependencies, updated instructions in readme and included python version * Cleanup. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>
…ngface#3200) * fix(cuda): fix integer reduction initialization Replace hardcoded INFINITY/-INFINITY values with type-safe template functions for reduction initialization. Using floating-point infinity values with integer types causes undefined behavior and crashes on newer GPU architectures like Blackwell. The new template specializations use appropriate numeric_limits values for integer types while preserving the original behavior for floating-point types. * fix(cuda): replace limits import with cuda std equivalents --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* make qwen3 vl config public
* Add dummy i32/i16/f6e2m3/f6e3m2/f4/f8e8m0 dtypes * Metal fixes * Fix candle-onnx build * Apply review comments * Residual fixes * Apply review comments * Apply review comments * Revert some things * Free more space --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* Merge with fork Co-authored-by Guoqing Bao <topon@outlook.com> * Update sdpa * Fix flash attn bf16 case * Metal fixes * Add metal methods * Add new_private_buffer * Fix metal tests * Format * Apply review comments * Update CI (huggingface#3194) * Update CI * I have no clue what was going on with this maturin file, but I don't like it * update cuda container options * Add compute cap to cuda wf * Fix rust toolchain call * update cuda ci runner and bindgen_cuda * Add initial support for imatrix quantization (huggingface#3193) * add clear kv cache to quantized qwen3 weights (huggingface#3189) * Fix metal bug * Apply review comments * Fix merge * Add lld installation and test steps for Linux (huggingface#3213) --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com> Co-authored-by: anonenity <anonentity.mail@gmail.com> Co-authored-by: Nicolas PASCAL <344493+haricot@users.noreply.github.com>
* Add ability to toggle fast math mode in metal. Chose how to apply based on os version. * Move available macro and friends to utils * Isolate #[allow(deprecated)] to the actually deprecated method * doc * Use objc2::available macro instead
* Initial pyo3 update * pyo3 onnx update
* Update unary.metal * update metal unary tests * Remove metal tiled unary kernels (now automated) * Optimize metal affine * Optimize metal powf * Optimize metal elu
* Improve performance of contiguous unary/binary kernels * Improved strided binary performance. Especially when only one of the tensors are strided.
phillipleblanc
approved these changes
Dec 23, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.