mozilla-ai
diff --git a/‎.llamafile_plugin/.claude-plugin/marketplace.json‎
Lines changed: 1 addition & 1 deletion b/‎.llamafile_plugin/.claude-plugin/marketplace.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.llamafile_plugin/.claude-plugin/plugin.json‎
Lines changed: 1 addition & 1 deletion b/‎.llamafile_plugin/.claude-plugin/plugin.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/skills/llamafile/SKILL.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/skills/llamafile/SKILL.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/skills/llamafile/update_llamacpp.md‎
Lines changed: 14 additions & 1 deletion b/‎docs/skills/llamafile/update_llamacpp.md‎
Lines changed: 14 additions & 1 deletion
diff --git a/‎llama.cpp‎ b/‎llama.cpp‎
diff --git a/‎llama.cpp.patches/README.md‎
Lines changed: 40 additions & 1 deletion b/‎llama.cpp.patches/README.md‎
Lines changed: 40 additions & 1 deletion
diff --git a/‎llama.cpp.patches/apply-patches.sh‎
Lines changed: 3 additions & 0 deletions b/‎llama.cpp.patches/apply-patches.sh‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎llama.cpp.patches/fetch-ui-assets.sh‎
Lines changed: 135 additions & 0 deletions b/‎llama.cpp.patches/fetch-ui-assets.sh‎
Lines changed: 135 additions & 0 deletions
@@ -9,7 +9,7 @@
     {
       "name": "llamafile",
       "description": "Build guidance and commands for the llamafile project",
-      "version": "0.1.1",
+      "version": "0.1.2",
       "author": {
         "name": "Mozilla AI",
         "email": "davide@mozilla.ai"
 
@@ -1,5 +1,5 @@
 {
   "name": "llamafile",
-  "version": "0.1.1",
+  "version": "0.1.2",
   "description": "Build guidance and commands for the llamafile project"
 }
@@ -1,7 +1,7 @@
 ---
 name: llamafile
 description: This skill should be used when the user asks to "build llamafile", "rebuild llamafile", "run llamafile", "run llamafile tests", "debug llamafile", "set up llamafile", "update patches", "fix patch conflict", "update llama.cpp", "pull latest llama.cpp", "sync upstream llama.cpp", "reset submodules", "write a test for llamafile", "how does llamafile work", "llamafile architecture", or needs guidance on the llamafile build system, patch workflow, submodule integration, cosmocc toolchain, or development practices.
-version: 0.1.2
+version: 0.1.3
 ---
 
 # Llamafile Development Guide
 
@@ -4,7 +4,18 @@ llamafile relies on llama.cpp for many of its functionalities. Keeping it up-to-
 with the latest version upstream is generally a good practice, as it brings both
 bugfixes and support for recent models and features.
 
-This document describes the steps to keep llamafile updated with upstream.
+This document describes the steps to keep llamafile updated with upstream. At a high-level,
+a llama.cpp consists in the following tasks
+
+1. Updating the submodule
+2. Verifying and updating the current patches (important: never create patches manually,
+   but run `../tools/generate-patches.sh --output-dir ../llama.cpp.patches` from the
+   llama.cpp directory)
+3. Updating build dependencies, to make sure new/changed deps are taken into account
+4. Updating llamafile code: llamafile is built on top of llama.cpp and you want to
+   make sure its code still works with the new, updated submodule. NOTE that this also
+   includes GPU acceleration libs (cuda, rocm, vulkan .sh and .bat scripts in the
+   `llamafile/` directory)
 
 ## Step 1: Update the submodule
 
@@ -78,6 +89,8 @@ target.
 - Check if the llamafile code that calls llama.cpp server/main needs updates
 - Review `llamafile/` for any API changes in llama.cpp that need to be reflected
 - Pay attention to changes in `llama.cpp/include/` for API modifications
+- Also check GPU acceleration libraries code (cuda, rocm, vulkan .sh and .bat
+  scripts in the `llamafile/` directory)
 
 At the end of this step, you should be able to build all targets in this repo,
 i.e. the following verification step should return a successful result
 
@@ -52,6 +52,7 @@ The `GGML_CALL` macro (defined as `__attribute__((__ms_abi__))` when `GGML_MULTI
 | `ggml_src_ggml-cuda_ggml-cuda.cu.patch` | Adds `GGML_CALL` to all CUDA backend callback implementations (60+ functions); also adds `free_struct` and TinyBLAS BF16 guard (see below) |
 | `ggml_src_ggml-metal_ggml-metal.cpp.patch` | Adds `GGML_CALL` to all Metal backend callback implementations (62 functions); also adds `free_struct` (see below) |
 | `ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch` | Adds `GGML_CALL` to all Vulkan backend callback implementations; also adds `free_struct` and a heap memory underflow fix (see below) |
+| `ggml_src_ggml-backend-meta.cpp.patch` | Adds `GGML_CALL` to all meta-device, meta-buffer-type, meta-buffer, and meta-backend callback implementations (the meta backend aggregates several simple backends behind one interface, so its callbacks are reached through the same function-pointer structs) |
 
 ### Cross-Module Memory Management
 
@@ -97,6 +98,28 @@ Llamafile uses TinyBLAS as a lightweight replacement for cuBLAS, enabling GPU su
 | `ggml_src_ggml-cuda_common.cuh.patch` | Disables BF16 MMA when using TinyBLAS (TinyBLAS would incorrectly interpret BF16 as FP16) |
 | `ggml_src_ggml-cuda_ggml-cuda.cu.patch` | Disables BF16 in `ggml_cuda_op_mul_mat_cublas` when using TinyBLAS |
 
+### Optional IQ-Quant Exclusion (CUDA)
+
+The IQ ("importance") quantization formats (`IQ1_S`, `IQ2_XXS`/`XS`/`S`, `IQ3_S`/`XXS`, `IQ4_NL`/`XS`) pull in a large amount of CUDA template instantiation that inflates compile time and binary size. These patches gate every IQ code path behind `#ifndef GGML_CUDA_NO_IQ_QUANTS`, so a build can compile them out by defining `GGML_CUDA_NO_IQ_QUANTS`. When the macro is undefined (the default), behavior is unchanged.
+
+| Patch | Description |
+|-------|-------------|
+| `ggml_src_ggml-cuda_convert.cu.patch` | Guards IQ dequantization cases in `ggml_get_to_fp16_cuda` and `ggml_get_to_fp32_cuda` |
+| `ggml_src_ggml-cuda_cpy.cu.patch` | Guards the `f32 → IQ4_NL` copy helper and its dispatch case |
+| `ggml_src_ggml-cuda_mmq.cu.patch` | Guards IQ cases in `ggml_cuda_mul_mat_q_switch_type` and in the `ggml_cuda_should_use_mmq` support/heuristic switches |
+| `ggml_src_ggml-cuda_mmq.cuh.patch` | Guards the `extern DECL_MMQ_CASE(...)` declarations for IQ types |
+| `ggml_src_ggml-cuda_mmvq.cu.patch` | Guards IQ cases in `get_vec_dot_q_cuda` and `get_vdr_mmvq` |
+
+### CPU Performance Optimizations (llamafile #975)
+
+These patches restore llamafile's optimized CPU kernels (TinyBLAS matmul, AVX-512 flash-attention helpers) on top of upstream's CPU backend, and tune CPU-only defaults. The hooks call into symbols exported from `llamafile/sgemm.cpp` and are compiled only when `GGML_USE_LLAMAFILE` is defined.
+
+| Patch | Description |
+|-------|-------------|
+| `ggml_src_ggml-cpu_ggml-cpu.c.patch` | Routes MoE matmul (`ggml_compute_forward_mul_mat_id`) through `llamafile_mixmul` / `llamafile_mixmul_iqk`, mirroring the dense-matmul `llamafile_sgemm` hook; reserves work-buffer space for the MoE kernel in `ggml_graph_plan` via `llamafile_mixmul_needs` |
+| `ggml_src_ggml-cpu_ops.cpp.patch` | Routes flash-attention inner loops through llamafile's AVX-512 helpers (`llamafile_fa_vec_dot_f16`, `llamafile_fa_fp16_to_fp32_row`, `llamafile_fa_simd_gemm`) in both the one-chunk and tiled FA paths; also accumulates VKQ in f32 on CPUs lacking native f16 FMA (avoiding costly f16↔f32 round-trips per KV step) |
+| `src_llama-context.cpp.patch` | Defaults `-fa auto` to **off** on CPU-only setups (no GPU devices), since the CPU flash-attention path is slower than the non-FA path on x86; users can still force `-fa on` for memory savings on long contexts |
+
 ### Llamafile File Handling
 
 These patches integrate llamafile's file handling APIs for loading models from bundled zip archives and `.llamafile` containers.
@@ -111,14 +134,30 @@ These patches integrate llamafile's file handling APIs for loading models from b
 
 | Patch | Description |
 |-------|-------------|
-| `tools_server_server.cpp.patch` | Renames `main()` to `server_main()` with `on_ready`/`on_shutdown_available` callbacks for combined TUI+server mode; adds Metal/GPU backend trigger before `common_init()`; adds Cosmopolitan-specific standalone `main()` with `cosmo_args`, verbose flag handling, and GPU pre-initialization; handles `LLAMAFILE_TUI` exit to avoid Metal cleanup crashes |
+| `tools_server_server.cpp.patch` | Renames upstream's `llama_server()` to `server_main()` and adds `on_ready`/`on_shutdown_available` callbacks for combined TUI+server mode; adds Metal/GPU backend trigger before `common_init()`; adds Cosmopolitan-specific standalone `main()` with `cosmo_args`, verbose flag handling, and GPU pre-initialization; handles `LLAMAFILE_TUI` exit to avoid Metal cleanup crashes |
+
+The web UI moved upstream from prebuilt `tools/server/public/*` assets to
+a Svelte project under `tools/ui/`, embedded at CMake time via
+`tools/ui/embed.cpp`. cosmocc has no JS toolchain, so `apply-patches.sh`
+(run by `make setup`) downloads the prebuilt Svelte bundle from the
+`ggml-org/llama-ui` Hugging Face bucket into `llama.cpp/tools/ui/dist/`.
+At build time, `llama.cpp/BUILD.mk` compiles `tools/ui/embed.cpp` with
+`cosmoc++` (its APE output runs on the host) and runs it against the
+downloaded assets to emit `o/$(MODE)/llama.cpp/tools/ui/ui.{cpp,h}`,
+which is then compiled like any other C++ source and linked into
+`llama-server` and `llamafile`. If the download fails (offline, version
+not yet on HF) `apply-patches.sh` warns and continues with `dist/`
+empty — `embed.cpp` then emits a no-op `llama_ui_find_asset`, and
+`server-http.cpp` skips UI route registration via its
+`LLAMA_UI_HAS_ASSETS` guard so the API still works.
 
 ### Bug Fixes
 
 | Patch | Description |
 |-------|-------------|
 | `ggml_src_ggml-backend-reg.cpp.patch` | Suppresses debug log noise for non-existent backend search paths (irrelevant for llamafile's DSO loading approach) |
 | `ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch` | Fixes unsigned integer underflow in `ggml_backend_vk_get_device_memory` where Vulkan's `heapUsage` can exceed `heapBudget` (clamps to zero instead of wrapping) |
+| `src_models_t5.cpp.patch` | Forward-declares the `graph<false>`/`graph<true>` explicit specializations before `build_arch_graph` so clang's `-std=gnu++23` doesn't reject them as specializations after implicit instantiation |
 
 ## Creating New Patches
 
 
@@ -37,6 +37,9 @@ for patch_file in "$PATCHES_DIR"/*.patch; do
     fi
 done
 
+# Fetch the prebuilt web UI assets (see fetch-ui-assets.sh for details).
+"$SCRIPT_DIR/fetch-ui-assets.sh"
+
 echo ""
 echo "Patches applied successfully!"
 echo "Note: These changes are not committed to the submodule."
 
@@ -0,0 +1,135 @@
+#!/bin/bash
+# Fetch the prebuilt llama.cpp web UI assets from Hugging Face.
+#
+# Upstream's tools/ui/scripts/ui-assets.cmake pulls the Svelte build outputs
+# from the ggml-org/llama-ui HF bucket. We do the same here so the cosmocc
+# build never has to run a JS toolchain. If the fetch fails (no network,
+# version not yet published, HF down) we leave tools/ui/dist empty; BUILD.mk's
+# embed step then generates a no-asset ui.cpp and the server still works, just
+# without the web UI.
+#
+# Run by apply-patches.sh (i.e. `make setup`); also safe to run standalone to
+# re-fetch the UI without re-applying patches.
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+LLAMA_DIR="$SCRIPT_DIR/../llama.cpp"
+
+HF_BUCKET="${LLAMAFILE_UI_HF_BUCKET:-llama-ui}"
+HF_BASE="https://huggingface.co/buckets/ggml-org/${HF_BUCKET}/resolve"
+HF_TREE_API="https://huggingface.co/api/buckets/ggml-org/${HF_BUCKET}/tree"
+UI_DIST="$LLAMA_DIR/tools/ui/dist"
+UI_ASSETS=(bundle.css bundle.js index.html loading.html)
+
+# Echo the highest bNNNN tag in the bucket whose build number is <= $1, or
+# nothing. The tree API returns directories in ascending order, 100 per page,
+# with a `Link: <...>; rel="next"` header for the following page.
+pick_ui_tag() {
+    local cur="$1"
+    local url="${HF_TREE_API}?limit=100&recursive=false"
+    local hdrs best="" n saw_newer guard=0 body
+    hdrs="$(mktemp)"
+    while [ -n "$url" ] && [ "$guard" -lt 100 ]; do
+        guard=$((guard + 1))
+        body="$(curl -fsSL --max-time 30 -D "$hdrs" "$url" 2>/dev/null)" || break
+        saw_newer=0
+        for n in $(printf '%s' "$body" | grep -oE '"path":"b[0-9]+"' \
+                | grep -oE '[0-9]+'); do
+            if [ "$n" -le "$cur" ]; then
+                if [ -z "$best" ] || [ "$n" -gt "$best" ]; then
+                    best="$n"
+                fi
+            else
+                saw_newer=1
+            fi
+        done
+        # Once a page holds a tag newer than us, all later pages are newer too,
+        # so there is nothing better to find.
+        if [ "$saw_newer" -eq 1 ]; then
+            break
+        fi
+        url="$(grep -i '^link:' "$hdrs" \
+            | grep -oE '<[^>]+>; *rel="next"' | grep -oE 'https?://[^>]+' | head -1)"
+    done
+    rm -f "$hdrs"
+    [ -n "$best" ] && printf 'b%s' "$best"
+}
+
+echo ""
+echo "Fetching prebuilt web UI assets from Hugging Face..."
+
+# Pick the version to download. Upstream's CMake just tries the exact build
+# number and then "latest", but for llamafile that's fragile: the exact tag for
+# our pinned commit is often not published yet, and "latest" can be built
+# against a newer backend than the llama.cpp we've pinned. Instead we enumerate
+# the tags actually present in the bucket and pick the newest one that is <= our
+# build number. "latest" is kept only as a last-resort fallback (enumeration
+# failed, e.g. offline / API change, or our commit predates every published
+# tag) so the build can still get *some* UI rather than none.
+UI_CUR_BUILD="$(cd "$LLAMA_DIR" && git describe --tags --always 2>/dev/null \
+    | grep -oE '^b[0-9]+' | grep -oE '[0-9]+' || true)"
+
+UI_CANDIDATES=()
+if [ -n "$UI_CUR_BUILD" ]; then
+    echo "  resolving newest UI tag <= b$UI_CUR_BUILD ..."
+    UI_BEST_TAG="$(pick_ui_tag "$UI_CUR_BUILD")"
+    if [ -n "$UI_BEST_TAG" ]; then
+        echo "  selected UI tag $UI_BEST_TAG"
+        UI_CANDIDATES+=("$UI_BEST_TAG")
+    else
+        echo "  no UI tag <= b$UI_CUR_BUILD found in bucket; will try 'latest'"
+    fi
+fi
+UI_CANDIDATES+=("latest")
+
+mkdir -p "$UI_DIST"
+ui_ok=false
+for v in "${UI_CANDIDATES[@]}"; do
+    echo "  trying $HF_BASE/$v ..."
+    fail=false
+    for asset in "${UI_ASSETS[@]}" checksums.txt; do
+        if ! curl -fsSL --max-time 60 -o "$UI_DIST/$asset" \
+                "$HF_BASE/$v/$asset?download=true"; then
+            fail=true
+            break
+        fi
+    done
+    if $fail; then
+        continue
+    fi
+
+    # Best-effort sha256 verification against checksums.txt (one "<hash>  <name>"
+    # line per asset). Skip if shasum/sha256sum isn't around.
+    if command -v shasum >/dev/null 2>&1; then
+        sha_cmd="shasum -a 256"
+    elif command -v sha256sum >/dev/null 2>&1; then
+        sha_cmd="sha256sum"
+    else
+        sha_cmd=""
+    fi
+    if [ -n "$sha_cmd" ] && [ -f "$UI_DIST/checksums.txt" ]; then
+        bad=false
+        for asset in "${UI_ASSETS[@]}"; do
+            want=$(awk -v a="$asset" '$2 == a { print $1 }' "$UI_DIST/checksums.txt")
+            got=$($sha_cmd "$UI_DIST/$asset" | awk '{print $1}')
+            if [ -z "$want" ] || [ "$want" != "$got" ]; then
+                echo "  checksum mismatch for $asset (want=$want got=$got)"
+                bad=true
+                break
+            fi
+        done
+        if $bad; then
+            continue
+        fi
+    fi
+
+    echo "  fetched UI assets from $v"
+    ui_ok=true
+    break
+done
+
+if ! $ui_ok; then
+    echo "  warning: could not download UI assets; server will build without the web UI"
+    rm -f "$UI_DIST"/*
+fi
Original file line number	Diff line number	Diff line change
`@@ -9,7 +9,7 @@`
`9`	`9`	`{`
`10`	`10`	`"name": "llamafile",`
`11`	`11`	`"description": "Build guidance and commands for the llamafile project",`
`12`		`- "version": "0.1.1",`
	`12`	`+ "version": "0.1.2",`
`13`	`13`	`"author": {`
`14`	`14`	`"name": "Mozilla AI",`
`15`	`15`	`"email": "davide@mozilla.ai"`
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,5 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "llamafile",`
`3`		`- "version": "0.1.1",`
	`3`	`+ "version": "0.1.2",`
`4`	`4`	`"description": "Build guidance and commands for the llamafile project"`
`5`	`5`	`}`