Highlights of this release are initial support for llama.cpp and improved troubleshooting mechanism.
For more details check our progress reports from the last three quarters:
https://vosen.github.io/ZLUDA/blog/zluda-update-q1-2025/
https://vosen.github.io/ZLUDA/blog/zluda-update-q2-2025/
https://vosen.github.io/ZLUDA/blog/zluda-update-q3-2025/
Detailed changelog
- Update README by @vosen in #315
 - Fix test zluda_dump by @JoelleJS in #316
 - feat: enable LTO and codegen-units = 1 optimization by @zamazan4ik in #318
 - fix: missing inherits in a release-lto profile by @zamazan4ik in #319
 - Improve build system by @vosen in #329
 - LLVM unit tests by @JoelleJS in #324
 - Implement mode tracking for AMD GPU by @vosen in #342
 - Implement mul24 by @JoelleJS in #351
 - Explicitly fail compilation on ROCm 6.4 by @vosen in #361
 - Create infrastructure for performance libraries by @vosen in #363
 - Fix ROCm 6.4 failures by @vosen in #364
 - Work around broken AMD Adrenalin 25.5.1 driver by @vosen in #366
 - Redo logging to better log dark API and performance libraries by @vosen in #372
 - Fix mad.wide, replace external CUDA library in tests with our own by @vosen in #376
 - Implement cuGetProcAddress and cuGetProcAddress_v2 by @zluda-violet in #377
 - Implement runtime_callback_hooks_fn2 by @zluda-violet in #380
 - Implement cuModuleGetLoadingMode by @zluda-violet in #381
 - Implement cudart_interface_fn2 by @zluda-violet in #382
 - Add automated builds by @vosen in #358
 - Handle new attributes in 
cuDeviceGetAttributeby @zluda-violet in #383 - Implement 
runtime_callback_hooks_fn6by @zluda-violet in #386 - Add fp saturation, fix various bugs in cvt instruction exposed by ptx_tests by @vosen in #379
 - Use 
integrity_checkimplementation by @zluda-violet in #387 - Implement 
cuLibraryLoadDataby @zluda-violet in #388 - Fix bug in get_payload by @zluda-violet in #389
 - Remove trailing zeroes from end of ptx by @zluda-violet in #390
 - Error instead of infinite loop in 
derive_parser!by @zluda-violet in #391 - Bump dependencies by @vosen in #392
 - Check LLVM IR for 
test_ptx!with no input/output by @zluda-violet in #394 - Unified fatbin versions behind a single iterator. by @aiwhskruht in #398
 - Make 
derive_parserwork with all optional arguments by @zluda-violet in #397 - Read test files at runtime for development ergonomics by @zluda-violet in #395
 - Fix floating point min/max by @vosen in #399
 - Add warp-wide tests by @zluda-violet in #400
 - Add support for 
bar.red.and.predby @zluda-violet in #402 - Run unit tests on every commit by @vosen in #401
 - Add initialized check to protect zluda from calls during shutdown by @aiwhskruht in #404
 - Implement more CUDA driver API to enable simple cuda-samples by @aiwhskruht in #405
 - [WIP] Start working on PhysX 32bit by @vosen in #374
 - Update README.md by @zluda-violet in #407
 - Add support for multiple return arguments by @zluda-violet in #406
 - Enable sccache in Rust builds, publish prerelease builds by @vosen in #408
 - Remove duplicate call to linker by @zluda-violet in #410
 - More descriptive message for unknown symbol by @zluda-violet in #411
 - Handle 
WARP_SZby @zluda-violet in #412 - Fix typo in README.md(self-contained) by @brlin-tw in #413
 - Fix grammar errors in README.md by @brlin-tw in #414
 - Allow messages for error_todo by @zluda-violet in #415
 - Only allow (.b32, .pred) for multiple return by @zluda-violet in #417
 - Apply rounding mode to fp div by @vosen in #416
 - Add support for 
shfl.sync.MODE.b32by @zluda-violet in #409 - Fix typo in README.md (either) by @brlin-tw in #419
 - Improve error recovery by @vosen in #418
 - Add parser support for hyphenated IDs in arguments by @zluda-violet in #425
 - Remove unnecessary unsafe block by @zluda-violet in #426
 - Implement 
nanosleep.u32by @zluda-violet in #421 - Remove 
Type::Pointerby @zluda-violet in #428 - Emit correct alignment for loads and stores by @zluda-violet in #429
 - Assorted instruction fixes by @vosen in #423
 - Add support for 
cp.asyncby @zluda-violet in #427 - Add cuCtxCreate_v2 and cuCtxDestroy_v2 by @vosen in #430
 - Set newly created context as current by @vosen in #431
 - Add nightly tests by @vosen in #433
 - Silence unused variable warnings by @zluda-violet in #434
 - Rename 
cuda_basetocuda_macrosby @zluda-violet in #435 - Use implicit FromCuda for library::get_module by @zluda-violet in #439
 - Try to make ZLUDA more robust on Windows by @vosen in #442
 - Add support for 
rocblastozluda_bindgenby @zluda-violet in #440 - Fix Windows linkage by @zluda-violet in #445
 - Update rocm_setup_build.sh by @zluda-violet in #446
 - Convert CUDA performance lib statuses to Rust result types by @zluda-violet in #444
 - Format file by @zluda-violet in #450
 - Use 
normalize_fnfor performance libraries by @zluda-violet in #449 - Check Rust formatting on pull requests by @zluda-violet in #451
 - Always use Unix line endings by @zluda-violet in #453
 - Move 
FromCudaandZludaObjectinto a common crate by @zluda-violet in #452 - Implement 
cuModuleGetGlobal_v2by @zluda-violet in #454 - Rename zluda_dump to zluda_trace by @vosen in #456
 - Use 
FromCudainzluda_blasby @zluda-violet in #455 - Implement cublas functions needed for llm.c by @zluda-violet in #457
 - Force loading ZLUDA through LD_PRELOAD by @vosen in #447
 - Fix nightly tests trigger by @vosen in #458
 - Fix nightly run by @vosen in #459
 - Update nightly ROCm version to 6.3.4 by @vosen in #460
 - Install curl correctly in nightly tests by @vosen in #461
 - In nightly runs, make sure cargo is in $PATH by @vosen in #462
 - Implement 
shfinstruction by @zluda-violet in #463 - Fix version in nightly trigger by @vosen in #464
 - More descriptive syntax errors by @zluda-violet in #466
 - Add tracing to custom parsers by @zluda-violet in #469
 - More tracing for custom parsers by @zluda-violet in #471
 - Implement kernel cache by @vosen in #465
 - zoc (ZLUDA offline compiler) by @JoelleJS in #344
 - Add test for unrecognized statement error with vector braces by @zluda-violet in #472
 - Fix how full-precision fp32 sqrt and div are handled by @vosen in #467
 - Add more instructions, tighten generated assembly by @vosen in #475
 - Add nvml tracing by @vosen in #476
 - Add pass test mechanism for insert_implicit_conversions by @zluda-violet in #477
 - Add test for conversion from .f16x2 to .b32 by @zluda-violet in #479
 - Add more host-side functionality by @vosen in #480
 - Add more NVML and cuBLAS coverage by @vosen in #481
 - Some fixes to BLASLt by @vosen in #482
 - Add support for fp8 to 
cvtby @zluda-violet in #468 - Implement vote instruction and add support for %laneid by @vosen in #484
 - Do a better job in zluda_trace when saving opaque ELF binaries by @vosen in #486
 - Api traits test code by @aiwhskruht in #487
 - Update broken tests by @zluda-violet in #489
 - Reorganize driver host tests, fix bugs around pointer host code by @vosen in #492
 - Add fake ptxas binary by @vosen in #491
 - Rework the documentation and landing page by @vosen in #474
 - Update quick start link in README.md by @vosen in #493
 - Add issue form for zluda_trace logs by @zluda-violet in #432
 - Add ZLUDA version input to issue template by @vosen in #494
 - Implement 
fma.rn.fn.bf16x2by @zluda-violet in #496 - Use 
Vec<RegOrImmediate>as const/global variable initializer by @zluda-violet in #490 - Fix const_ident.ll by @zluda-violet in #497
 - Support immediates in vector operands by @zluda-violet in #488
 - Remove accidentally committed file by @zluda-violet in #499
 - Progress compilation despite parsing errors by @vosen in #495
 - Implement redux.sync for u32 and s32 by @zluda-violet in #500
 - Add support for cvt_rn_bf16x2_f32 by @zluda-violet in #501
 - Make blame ignore formatting commit by @zluda-violet in #502
 - Implement ldmatrix by @zluda-violet in #503
 - Random fixes by @vosen in #504
 - Fix min.ftz.nan.f16 for ROCm 6.3.4 by @zluda-violet in #506
 - Update devcontainer by @vosen in #507
 - Use LD_AUDIT instead of LD_PRELOAD by @vosen in #508
 - Implement cuStreamCreate by @zluda-violet in #511
 - Create bindings for hipblasLt by @zluda-violet in #510
 - Add support for cuBLASLt functions used by llm.c by @zluda-violet in #512
 - Fix devcontainer by @zluda-violet in #514
 - More compiler fixes by @vosen in #509
 - Disable virtual memory management by @zluda-violet in #515
 - Support lists of variables to be declared by @zluda-violet in #516
 - Fix cuCtxPopCurrent by @vosen in #519
 - Handle PrmtSlow by @zluda-violet in #518
 - Add atomic loads and stores by @vosen in #526
 - No-op implementation of vprintf by @zluda-violet in #527
 - Remove unnecessary logging of cuLibraryGetModule by @vosen in #529
 - Implement cuGraphExecUpdate_v2 by @zluda-violet in #528
 - Add noop nvmlDeviceGetComputeRunningProcesses, fix nvmlDeviceGetHandleByPciBusId_v2 by @vosen in #531
 
New Contributors
- @zamazan4ik made their first contribution in #318
 - @zluda-violet made their first contribution in #377
 - @aiwhskruht made their first contribution in #398
 - @brlin-tw made their first contribution in #413
 
Full Changelog: v4...v5