Skip to content

WeightsCacheManager: per-cache-path instances (#20337)#20337

Open
doggeral wants to merge 1 commit into
pytorch:mainfrom
doggeral:export-D108431510
Open

WeightsCacheManager: per-cache-path instances (#20337)#20337
doggeral wants to merge 1 commit into
pytorch:mainfrom
doggeral:export-D108431510

Conversation

@doggeral

@doggeral doggeral commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary:

Split XNNWeightsCache out of its de-facto singleton lifetime (one instance per XnnpackBackend, shared across every PTE that opted into the file-backed cache) into a per-cache-file-path instance dispensed by a new XNNWeightsCacheManager. Mirrors the XNNWorkspaceManager PerModel pattern: same path → same shared instance; different paths → independent instances; empty path → one shared heap-only instance so XNNPACK's in-memory name dedup still works across PTEs.

The singleton design was forced today because XnnpackBackendOptions::weights_cache_ is a by-value member and XnnpackBackend itself is a namespace-scope global (XNNPACKBackend.cpp:246). For PTEs that genuinely share a cache file the singleton's per-entry refcounting works, but two PTEs with different packed-cache paths hit the same XNNWeightsCache instance, so the second PTE's initialize_for_runtime calls ftruncate(0) on the file under the first PTE's still-live mmap regions — every subsequent access SIGBUSes (P2369924970 traces this). The clue the prior fix added — the warning in XNNWeightsCache.h:147-151 "the path MUST be unique per XNNWeightsCache instance" — is enforced here at the manager level rather than relying on each caller to honor it.

Design:

XNNWeightsCacheManager (new, mirrors XNNWorkspaceManager):

  • std::unordered_map<std::string, std::weak_ptr<XNNWeightsCache>> caches_ keyed by absolute cache file path.
  • get_or_create(path) looks up the path under a single meta_mutex_; if a live weak_ptr exists, returns the shared instance, otherwise constructs a new one, calls set_packed_cache_path(path) BEFORE registering it, and stores a weak_ptr.
  • meta_mutex_ is held only during the map op — never across any call into XNNWeightsCache, so different-path callers proceed in parallel after the brief window.
  • save_all() snapshots live shared_ptrs under meta_mutex_, then iterates outside the meta lock and acquires each instance's own mutex around save_packed_index().
  • Empty path uses a separate empty_path_cache_ (weak_ptr) + empty_path_mutex_: all heap-only callers (NGTTS sub-runners, FLLM classifier, PLLM methods when mmap MC is off) share one instance so XNNPACK's in-memory look_up_or_insert name dedup catches duplicate weights across PTEs and across methods within a PTE. Without this sharing, every XnnpackBackend::init allocated its own packed copy of every weight, regressing heap-only memory by ~500 MB on LoRA-multimethod PLLM (paste P2380809516: app_phys peak 1731 MB with per-instance vs ~1260 MB with the prior process-singleton). The hazard the per-path keying motivates — non-opt-in PTE inheriting an opt-in PTE's path and writing into its mmap file — never applies to empty-path callers because they hold no path / no fd / no mmap regions, so sharing among empty-path callers carries no isolation cost.
  • Expired weak_ptr entries are erased opportunistically on the next get_or_create / save_all for that path. Stale entries from never-revisited paths linger; cost is one string + weak_ptr per dead entry. Acceptable per XNNWorkspaceManager precedent.

XNNWeightsCache:

  • New std::mutex instance_mutex_ member + mutex() accessor. The class has no internal synchronization; callers are responsible for holding mutex() around every method invocation, INCLUDING the XNNPACK callback paths (look_up, reserve_space, look_up_or_insert) that fire during xnn_create_runtime.
  • set_packed_cache_path is documented as call-once-before-publish: production callers go through the manager, which sets the path before installing the shared_ptr in the map, so no other thread can observe the instance yet. Tests that construct the class directly must respect this contract.

XnnpackBackendOptions:

  • Replaced XNNWeightsCache weights_cache_ + std::mutex weights_cache_mutex_ with XNNWeightsCacheManager weights_cache_manager_.
  • New get_or_create_weights_cache(path) thin wrapper around the manager.
  • save_weights_cache_locked() now walks every live cache via the manager's save_all().
  • packed_cache_path_ keeps a small path_mutex_ to serialize the set_option(packed_cache_path_option_key)init() read; this is just transport for the option value, the path's authoritative home is per-instance inside each cache.

XNNPACKBackend:

  • init: pulls the path from per-PTE runtime_spec (see "Per-PTE caller signal" below), asks the manager for the shared XNNWeightsCache, locks its mutex() for the entire init→compileModel sequence, then publishes the shared_ptr into the executor via the new XNNExecutor::set_weights_cache. Same-path PTEs serialize on the same instance mutex; different-path PTEs hold different mutexes and proceed in parallel — the singleton design forced full serialization here.
  • execute: lock the per-executor cache's mutex (if any) instead of the global one. Concurrent execute on independent caches now runs in parallel.
  • destroy: lock the per-executor cache's mutex, call delete_packed_data. The local shared_ptr keeps the instance alive across delete_packed_data even if dropping it from the executor was the last outside reference.

XNNExecutor:

  • New std::shared_ptr<XNNWeightsCache> weights_cache_ member, set once after compileModel. Forward-declared (rather than including XNNWeightsCache.h) to keep the transitive pte_data_map.h dependency out of the executor's public header — preserves the existing xnnexecutor_test build dep set.

Per-PTE caller signal (the FLLM / NGTTS isolation guarantee):

The manager's per-path dedup is necessary but not sufficient — if a non-opt-in PTE inherits an opt-in PTE's globally-set path, the manager hands it the same shared instance and the non-opt-in PTE's reserve_space writes into the opt-in model's mmap file. Investigation of the three on-device loaders confirms the concrete risk: cria PLLM pushes packed_cache_path_option_key globally (runner_interface.h:365-373), but NGTTS sub-runners (AcousticRunner / HfMimiRunner / SemanticLmRunner at executorch/examples/models/fb/llama4/runner/*.cpp) bypass cria entirely and never push a path, and the cria FLLM classifier path skips the push when FactoryMetaData::useMmapPackedWeights = false (CriaHost.cpp:220-224). All three loaders run in the same process and share one XnnpackBackend global (XNNPACKBackend.cpp:246).

Two complementary changes lock this down:

  1. XNNPACKBackend::init no longer reads options_.get_packed_cache_path() (shared backend-singleton state). It reads the path strictly from BackendInitContext::get_runtime_spec<const char*>(packed_cache_path_option_key) — the only per-PTE signal that proves THIS PTE explicitly opted in. If runtime_spec carries no path, cache_path is empty and the manager hands the shared empty_path_cache_ (per the empty-path branch above). Non-opt-in PTEs are guaranteed isolated from the mmap-path file regardless of what the global path happens to hold; they still dedupe against one another in the shared empty-path cache.

  2. cria runner_interface.h::loadModel() no longer pushes XNNPACK options globally via executorch::runtime::set_option. It now builds a BackendOptions<3> carrying path / weight_cache_option_key / workspace_sharing_mode_option_key, wraps it in a LoadBackendOptionsMap, and passes that map to every Module::load_method call (primary, multimethod loop, YOCO prefill/decode). The BackendOptions and map both live on the loadModel stack frame, which extends through every load_method call — Span lifetime requirements satisfied. Per-PTE options propagate into the backend's BackendInitContext::runtime_spec via Method::init's LoadBackendOptionsMap path (method.cpp:957-963). Non-opt-in cria PTEs and non-cria loaders (NGTTS, direct-Module) simply don't pass a map → empty runtime_spec → init forces empty path → shared heap-only instance with dedup.

Lock hierarchy (updated):

  • weights_cache_manager_.meta_mutex_ (leaf — only during path-keyed map ops, never held across calls into instances)
  • weights_cache_manager_.empty_path_mutex_ (leaf — only during empty-path weak_ptr lookup/store)
  • XNNWeightsCache::instance_mutex_ (one per cache)
  • workspace_meta_mutex_
  • workspace_mutex_ (owned by executor)

Race-condition / corner-case coverage:

  • Same-path concurrent get_or_create: serialize on meta_mutex_, both return the same shared instance.
  • Different-path concurrent get_or_create: parallel after the brief meta_mutex_ window.
  • Mid-load contention: same-path callers serialize on the instance mutex around initialize_for_runtime.
  • Cross-PTE clobbering (the original bug): impossible — each path owns its own instance.
  • Cross-process same-path: existing flock(LOCK_EX|LOCK_NB) defense untouched.
  • Cache file deleted on disk: existing mmap stays valid (unix unlink semantics); manager doesn't track disk state.
  • Process shutdown mid-save: executor-held shared_ptr outlives the manager map; instance destruction follows the executor's normal teardown.
  • XNNPACK seed mismatch / cache format bump: existing per-entry seed reject + v1-trailer reject paths untouched.
  • Empty path: shared via empty_path_cache_ weak_ptr; recreated when all shared_ptrs drop; never collides with any mmap-path instance.
  • Concurrent same-cache execute + destroy: serialize on the instance mutex.
  • Stale global path inherited by non-opt-in PTE: prevented by the runtime_spec-only path read in init.

Mirrored to fbcode/executorch/backends/xnnpack/runtime/. The cria change lives only under xplat/cria/ (no fbcode mirror).

Test plan

Built fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend on linux, Apple, Android, and fbcode//executorch/backends/xnnpack:xnnpack_backend on linux — all green. Built downstream consumers to verify the API change is binary-compatible: fbsource//xplat/cria/core:cria{Apple,Android}, fbsource//xplat/sgr/ml_service/modules/llm:lib_sgr_llmApple, fbsource//xplat/assistant/oacr/trims/modules/ondevice_modules:mwa_ondevice_moduleApple — all green.

buck2 test \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache_manager \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache \
  fbcode//executorch/backends/xnnpack/test:test_workspace_manager \
  fbcode//executorch/backends/xnnpack/test:xnnexecutor_test
→ Pass 38. Fail 0. Build failure 0.

The new test_xnn_weights_cache_manager exercises 13 hazard cases the manager handles: SamePathReturnsSameInstance, DifferentPathsReturnDifferentInstances, EmptyPathSharedAcrossCallers, EmptyPathRecreatedAfterAllRefsDrop, EmptyPathDoesNotShareWithMmapPath, ExpiredEntryDoesNotLeak, ExpiredEntryRecreatedOnNextCall, ConcurrentSamePathSameInstance (16-thread fan-in), ConcurrentDifferentPathsIndependent (8-thread fan-out), SaveAllNoLiveInstancesIsOk, SaveAllWalksLiveCaches, SaveAllSkipsExpiredEntries, NonEmptyPathRegistersInMap.

Cria runner tests (fbsource//xplat/cria/core/runner/tests/...): Pass 938, Fail 0. The 18 Fatal entries reported by buck2 (PrefillReturnsLogits, PrefillMapsParams, PrefillStringPrompt, etc.) reproduce identically on this commit's parent (605126226e, no cria change, no init guard) with the same OpenMP/MKL/ASan SEGV stack — kmp_basic_flag_native::done_check__kmp_hyper_barrier_release triggered by mkl_blas_sgemm_omp_driver_v1 racing with pthread_create from pthreadpool_create_v2. These are pre-existing flakes in the asan-ubsan platform configuration, not caused by either the manager refactor or the runtime_spec migration.

arc lint -a clean across all 22 changed/added files (11 xplat + 11 fbcode mirrors; cria is xplat-only).

Differential Revision: D108431510

@doggeral doggeral requested a review from digantdesai as a code owner June 17, 2026 17:12
@pytorch-bot

pytorch-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20337

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Unrelated Failure

As of commit 57736af with merge base 574bfca (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2026
@meta-codesync

meta-codesync Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@doggeral has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108431510.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@doggeral doggeral requested a review from GregoryComer June 17, 2026 17:14
@meta-codesync meta-codesync Bot changed the title WeightsCacheManager: per-cache-path instances WeightsCacheManager: per-cache-path instances (#20337) Jun 18, 2026
@doggeral doggeral force-pushed the export-D108431510 branch from d90c4cb to e3dfec7 Compare June 18, 2026 00:59
doggeral added a commit to doggeral/executorch that referenced this pull request Jun 18, 2026
Summary:

Split `XNNWeightsCache` out of its de-facto singleton lifetime (one instance per `XnnpackBackend`, shared across every PTE that opted into the file-backed cache) into a per-cache-file-path instance dispensed by a new `XNNWeightsCacheManager`. Mirrors the `XNNWorkspaceManager` PerModel pattern: same path → same shared instance; different paths → independent instances; empty path → one shared heap-only instance so XNNPACK's in-memory name dedup still works across PTEs.

The singleton design was forced today because `XnnpackBackendOptions::weights_cache_` is a by-value member and `XnnpackBackend` itself is a namespace-scope global (`XNNPACKBackend.cpp:246`). For PTEs that genuinely share a cache file the singleton's per-entry refcounting works, but two PTEs with **different** packed-cache paths hit the same `XNNWeightsCache` instance, so the second PTE's `initialize_for_runtime` calls `ftruncate(0)` on the file under the first PTE's still-live mmap regions — every subsequent access SIGBUSes (P2369924970 traces this). The clue the prior fix added — the warning in `XNNWeightsCache.h:147-151` "the path MUST be unique per `XNNWeightsCache` instance" — is enforced here at the manager level rather than relying on each caller to honor it.

Design:

`XNNWeightsCacheManager` (new, mirrors `XNNWorkspaceManager`):
- `std::unordered_map<std::string, std::weak_ptr<XNNWeightsCache>> caches_` keyed by absolute cache file path.
- `get_or_create(path)` looks up the path under a single `meta_mutex_`; if a live `weak_ptr` exists, returns the shared instance, otherwise constructs a new one, calls `set_packed_cache_path(path)` BEFORE registering it, and stores a `weak_ptr`.
- `meta_mutex_` is held only during the map op — never across any call into `XNNWeightsCache`, so different-path callers proceed in parallel after the brief window.
- `save_all()` snapshots live shared_ptrs under `meta_mutex_`, then iterates outside the meta lock and acquires each instance's own mutex around `save_packed_index()`.
- Empty path uses a separate `empty_path_cache_` (weak_ptr) + `empty_path_mutex_`: all heap-only callers (NGTTS sub-runners, FLLM classifier, PLLM methods when mmap MC is off) share one instance so XNNPACK's in-memory `look_up_or_insert` name dedup catches duplicate weights across PTEs and across methods within a PTE. Without this sharing, every `XnnpackBackend::init` allocated its own packed copy of every weight, regressing heap-only memory by ~500 MB on LoRA-multimethod PLLM (paste P2380809516: app_phys peak 1731 MB with per-instance vs ~1260 MB with the prior process-singleton). The hazard the per-path keying motivates — non-opt-in PTE inheriting an opt-in PTE's path and writing into its mmap file — never applies to empty-path callers because they hold no path / no fd / no mmap regions, so sharing among empty-path callers carries no isolation cost.
- Expired `weak_ptr` entries are erased opportunistically on the next `get_or_create` / `save_all` for that path. Stale entries from never-revisited paths linger; cost is one string + weak_ptr per dead entry. Acceptable per `XNNWorkspaceManager` precedent.

`XNNWeightsCache`:
- New `std::mutex instance_mutex_` member + `mutex()` accessor. The class has no internal synchronization; callers are responsible for holding `mutex()` around every method invocation, INCLUDING the XNNPACK callback paths (`look_up`, `reserve_space`, `look_up_or_insert`) that fire during `xnn_create_runtime`.
- `set_packed_cache_path` is documented as call-once-before-publish: production callers go through the manager, which sets the path before installing the `shared_ptr` in the map, so no other thread can observe the instance yet. Tests that construct the class directly must respect this contract.

`XnnpackBackendOptions`:
- Replaced `XNNWeightsCache weights_cache_` + `std::mutex weights_cache_mutex_` with `XNNWeightsCacheManager weights_cache_manager_`.
- New `get_or_create_weights_cache(path)` thin wrapper around the manager.
- `save_weights_cache_locked()` now walks every live cache via the manager's `save_all()`.
- `packed_cache_path_` keeps a small `path_mutex_` to serialize the `set_option(packed_cache_path_option_key)` → `init()` read; this is just transport for the option value, the path's authoritative home is per-instance inside each cache.

`XNNPACKBackend`:
- `init`: pulls the path from per-PTE `runtime_spec` (see "Per-PTE caller signal" below), asks the manager for the shared `XNNWeightsCache`, locks its `mutex()` for the entire init→compileModel sequence, then publishes the `shared_ptr` into the executor via the new `XNNExecutor::set_weights_cache`. Same-path PTEs serialize on the same instance mutex; different-path PTEs hold different mutexes and proceed in parallel — the singleton design forced full serialization here.
- `execute`: lock the per-executor cache's mutex (if any) instead of the global one. Concurrent execute on independent caches now runs in parallel.
- `destroy`: lock the per-executor cache's mutex, call `delete_packed_data`. The local `shared_ptr` keeps the instance alive across `delete_packed_data` even if dropping it from the executor was the last outside reference.

`XNNExecutor`:
- New `std::shared_ptr<XNNWeightsCache> weights_cache_` member, set once after `compileModel`. Forward-declared (rather than including `XNNWeightsCache.h`) to keep the transitive `pte_data_map.h` dependency out of the executor's public header — preserves the existing `xnnexecutor_test` build dep set.

Per-PTE caller signal (the FLLM / NGTTS isolation guarantee):

The manager's per-path dedup is necessary but not sufficient — if a non-opt-in PTE inherits an opt-in PTE's globally-set path, the manager hands it the same shared instance and the non-opt-in PTE's `reserve_space` writes into the opt-in model's mmap file. Investigation of the three on-device loaders confirms the concrete risk: cria PLLM pushes `packed_cache_path_option_key` globally (`runner_interface.h:365-373`), but NGTTS sub-runners (`AcousticRunner` / `HfMimiRunner` / `SemanticLmRunner` at `executorch/examples/models/fb/llama4/runner/*.cpp`) bypass cria entirely and never push a path, and the cria FLLM classifier path skips the push when `FactoryMetaData::useMmapPackedWeights = false` (`CriaHost.cpp:220-224`). All three loaders run in the same process and share one `XnnpackBackend` global (`XNNPACKBackend.cpp:246`).

Two complementary changes lock this down:

1. `XNNPACKBackend::init` no longer reads `options_.get_packed_cache_path()` (shared backend-singleton state). It reads the path strictly from `BackendInitContext::get_runtime_spec<const char*>(packed_cache_path_option_key)` — the only per-PTE signal that proves THIS PTE explicitly opted in. If `runtime_spec` carries no path, `cache_path` is empty and the manager hands the shared `empty_path_cache_` (per the empty-path branch above). Non-opt-in PTEs are guaranteed isolated from the mmap-path file regardless of what the global path happens to hold; they still dedupe against one another in the shared empty-path cache.

2. cria `runner_interface.h::loadModel()` no longer pushes XNNPACK options globally via `executorch::runtime::set_option`. It now builds a `BackendOptions<3>` carrying path / `weight_cache_option_key` / `workspace_sharing_mode_option_key`, wraps it in a `LoadBackendOptionsMap`, and passes that map to every `Module::load_method` call (primary, multimethod loop, YOCO prefill/decode). The `BackendOptions` and map both live on the `loadModel` stack frame, which extends through every `load_method` call — Span lifetime requirements satisfied. Per-PTE options propagate into the backend's `BackendInitContext::runtime_spec` via `Method::init`'s `LoadBackendOptionsMap` path (`method.cpp:957-963`). Non-opt-in cria PTEs and non-cria loaders (NGTTS, direct-Module) simply don't pass a map → empty runtime_spec → init forces empty path → shared heap-only instance with dedup.

Lock hierarchy (updated):
- `weights_cache_manager_.meta_mutex_` (leaf — only during path-keyed map ops, never held across calls into instances)
- `weights_cache_manager_.empty_path_mutex_` (leaf — only during empty-path weak_ptr lookup/store)
- `XNNWeightsCache::instance_mutex_` (one per cache)
- `workspace_meta_mutex_`
- `workspace_mutex_` (owned by executor)

Race-condition / corner-case coverage:
- Same-path concurrent `get_or_create`: serialize on `meta_mutex_`, both return the same shared instance.
- Different-path concurrent `get_or_create`: parallel after the brief `meta_mutex_` window.
- Mid-load contention: same-path callers serialize on the instance mutex around `initialize_for_runtime`.
- Cross-PTE clobbering (the original bug): impossible — each path owns its own instance.
- Cross-process same-path: existing `flock(LOCK_EX|LOCK_NB)` defense untouched.
- Cache file deleted on disk: existing mmap stays valid (unix unlink semantics); manager doesn't track disk state.
- Process shutdown mid-save: executor-held `shared_ptr` outlives the manager map; instance destruction follows the executor's normal teardown.
- XNNPACK seed mismatch / cache format bump: existing per-entry seed reject + v1-trailer reject paths untouched.
- Empty path: shared via `empty_path_cache_` weak_ptr; recreated when all shared_ptrs drop; never collides with any mmap-path instance.
- Concurrent same-cache execute + destroy: serialize on the instance mutex.
- Stale global path inherited by non-opt-in PTE: prevented by the runtime_spec-only path read in init.

Mirrored to `fbcode/executorch/backends/xnnpack/runtime/`. The cria change lives only under `xplat/cria/` (no fbcode mirror).

### Test plan
Built `fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend` on linux, Apple, Android, and `fbcode//executorch/backends/xnnpack:xnnpack_backend` on linux — all green. Built downstream consumers to verify the API change is binary-compatible: `fbsource//xplat/cria/core:cria{Apple,Android}`, `fbsource//xplat/sgr/ml_service/modules/llm:lib_sgr_llmApple`, `fbsource//xplat/assistant/oacr/trims/modules/ondevice_modules:mwa_ondevice_moduleApple` — all green.

```
buck2 test \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache_manager \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache \
  fbcode//executorch/backends/xnnpack/test:test_workspace_manager \
  fbcode//executorch/backends/xnnpack/test:xnnexecutor_test
→ Pass 38. Fail 0. Build failure 0.
```

The new `test_xnn_weights_cache_manager` exercises 13 hazard cases the manager handles: `SamePathReturnsSameInstance`, `DifferentPathsReturnDifferentInstances`, `EmptyPathSharedAcrossCallers`, `EmptyPathRecreatedAfterAllRefsDrop`, `EmptyPathDoesNotShareWithMmapPath`, `ExpiredEntryDoesNotLeak`, `ExpiredEntryRecreatedOnNextCall`, `ConcurrentSamePathSameInstance` (16-thread fan-in), `ConcurrentDifferentPathsIndependent` (8-thread fan-out), `SaveAllNoLiveInstancesIsOk`, `SaveAllWalksLiveCaches`, `SaveAllSkipsExpiredEntries`, `NonEmptyPathRegistersInMap`.

Cria runner tests (`fbsource//xplat/cria/core/runner/tests/...`): Pass 938, Fail 0. The 18 `Fatal` entries reported by buck2 (`PrefillReturnsLogits`, `PrefillMapsParams`, `PrefillStringPrompt`, etc.) reproduce identically on this commit's parent (605126226e, no cria change, no init guard) with the same OpenMP/MKL/ASan SEGV stack — `kmp_basic_flag_native::done_check` → `__kmp_hyper_barrier_release` triggered by `mkl_blas_sgemm_omp_driver_v1` racing with `pthread_create` from `pthreadpool_create_v2`. These are pre-existing flakes in the asan-ubsan platform configuration, not caused by either the manager refactor or the runtime_spec migration.

`arc lint -a` clean across all 22 changed/added files (11 xplat + 11 fbcode mirrors; cria is xplat-only).

Differential Revision: D108431510
doggeral added a commit to doggeral/executorch that referenced this pull request Jun 18, 2026
Summary:

Split `XNNWeightsCache` out of its de-facto singleton lifetime (one instance per `XnnpackBackend`, shared across every PTE that opted into the file-backed cache) into a per-cache-file-path instance dispensed by a new `XNNWeightsCacheManager`. Mirrors the `XNNWorkspaceManager` PerModel pattern: same path → same shared instance; different paths → independent instances; empty path → one shared heap-only instance so XNNPACK's in-memory name dedup still works across PTEs.

The singleton design was forced today because `XnnpackBackendOptions::weights_cache_` is a by-value member and `XnnpackBackend` itself is a namespace-scope global (`XNNPACKBackend.cpp:246`). For PTEs that genuinely share a cache file the singleton's per-entry refcounting works, but two PTEs with **different** packed-cache paths hit the same `XNNWeightsCache` instance, so the second PTE's `initialize_for_runtime` calls `ftruncate(0)` on the file under the first PTE's still-live mmap regions — every subsequent access SIGBUSes (P2369924970 traces this). The clue the prior fix added — the warning in `XNNWeightsCache.h:147-151` "the path MUST be unique per `XNNWeightsCache` instance" — is enforced here at the manager level rather than relying on each caller to honor it.

Design:

`XNNWeightsCacheManager` (new, mirrors `XNNWorkspaceManager`):
- `std::unordered_map<std::string, std::weak_ptr<XNNWeightsCache>> caches_` keyed by absolute cache file path.
- `get_or_create(path)` looks up the path under a single `meta_mutex_`; if a live `weak_ptr` exists, returns the shared instance, otherwise constructs a new one, calls `set_packed_cache_path(path)` BEFORE registering it, and stores a `weak_ptr`.
- `meta_mutex_` is held only during the map op — never across any call into `XNNWeightsCache`, so different-path callers proceed in parallel after the brief window.
- `save_all()` snapshots live shared_ptrs under `meta_mutex_`, then iterates outside the meta lock and acquires each instance's own mutex around `save_packed_index()`.
- Empty path uses a separate `empty_path_cache_` (weak_ptr) + `empty_path_mutex_`: all heap-only callers (NGTTS sub-runners, FLLM classifier, PLLM methods when mmap MC is off) share one instance so XNNPACK's in-memory `look_up_or_insert` name dedup catches duplicate weights across PTEs and across methods within a PTE. Without this sharing, every `XnnpackBackend::init` allocated its own packed copy of every weight, regressing heap-only memory by ~500 MB on LoRA-multimethod PLLM (paste P2380809516: app_phys peak 1731 MB with per-instance vs ~1260 MB with the prior process-singleton). The hazard the per-path keying motivates — non-opt-in PTE inheriting an opt-in PTE's path and writing into its mmap file — never applies to empty-path callers because they hold no path / no fd / no mmap regions, so sharing among empty-path callers carries no isolation cost.
- Expired `weak_ptr` entries are erased opportunistically on the next `get_or_create` / `save_all` for that path. Stale entries from never-revisited paths linger; cost is one string + weak_ptr per dead entry. Acceptable per `XNNWorkspaceManager` precedent.

`XNNWeightsCache`:
- New `std::mutex instance_mutex_` member + `mutex()` accessor. The class has no internal synchronization; callers are responsible for holding `mutex()` around every method invocation, INCLUDING the XNNPACK callback paths (`look_up`, `reserve_space`, `look_up_or_insert`) that fire during `xnn_create_runtime`.
- `set_packed_cache_path` is documented as call-once-before-publish: production callers go through the manager, which sets the path before installing the `shared_ptr` in the map, so no other thread can observe the instance yet. Tests that construct the class directly must respect this contract.

`XnnpackBackendOptions`:
- Replaced `XNNWeightsCache weights_cache_` + `std::mutex weights_cache_mutex_` with `XNNWeightsCacheManager weights_cache_manager_`.
- New `get_or_create_weights_cache(path)` thin wrapper around the manager.
- `save_weights_cache_locked()` now walks every live cache via the manager's `save_all()`.
- `packed_cache_path_` keeps a small `path_mutex_` to serialize the `set_option(packed_cache_path_option_key)` → `init()` read; this is just transport for the option value, the path's authoritative home is per-instance inside each cache.

`XNNPACKBackend`:
- `init`: pulls the path from per-PTE `runtime_spec` (see "Per-PTE caller signal" below), asks the manager for the shared `XNNWeightsCache`, locks its `mutex()` for the entire init→compileModel sequence, then publishes the `shared_ptr` into the executor via the new `XNNExecutor::set_weights_cache`. Same-path PTEs serialize on the same instance mutex; different-path PTEs hold different mutexes and proceed in parallel — the singleton design forced full serialization here.
- `execute`: lock the per-executor cache's mutex (if any) instead of the global one. Concurrent execute on independent caches now runs in parallel.
- `destroy`: lock the per-executor cache's mutex, call `delete_packed_data`. The local `shared_ptr` keeps the instance alive across `delete_packed_data` even if dropping it from the executor was the last outside reference.

`XNNExecutor`:
- New `std::shared_ptr<XNNWeightsCache> weights_cache_` member, set once after `compileModel`. Forward-declared (rather than including `XNNWeightsCache.h`) to keep the transitive `pte_data_map.h` dependency out of the executor's public header — preserves the existing `xnnexecutor_test` build dep set.

Per-PTE caller signal (the FLLM / NGTTS isolation guarantee):

The manager's per-path dedup is necessary but not sufficient — if a non-opt-in PTE inherits an opt-in PTE's globally-set path, the manager hands it the same shared instance and the non-opt-in PTE's `reserve_space` writes into the opt-in model's mmap file. Investigation of the three on-device loaders confirms the concrete risk: cria PLLM pushes `packed_cache_path_option_key` globally (`runner_interface.h:365-373`), but NGTTS sub-runners (`AcousticRunner` / `HfMimiRunner` / `SemanticLmRunner` at `executorch/examples/models/fb/llama4/runner/*.cpp`) bypass cria entirely and never push a path, and the cria FLLM classifier path skips the push when `FactoryMetaData::useMmapPackedWeights = false` (`CriaHost.cpp:220-224`). All three loaders run in the same process and share one `XnnpackBackend` global (`XNNPACKBackend.cpp:246`).

Two complementary changes lock this down:

1. `XNNPACKBackend::init` no longer reads `options_.get_packed_cache_path()` (shared backend-singleton state). It reads the path strictly from `BackendInitContext::get_runtime_spec<const char*>(packed_cache_path_option_key)` — the only per-PTE signal that proves THIS PTE explicitly opted in. If `runtime_spec` carries no path, `cache_path` is empty and the manager hands the shared `empty_path_cache_` (per the empty-path branch above). Non-opt-in PTEs are guaranteed isolated from the mmap-path file regardless of what the global path happens to hold; they still dedupe against one another in the shared empty-path cache.

2. cria `runner_interface.h::loadModel()` no longer pushes XNNPACK options globally via `executorch::runtime::set_option`. It now builds a `BackendOptions<3>` carrying path / `weight_cache_option_key` / `workspace_sharing_mode_option_key`, wraps it in a `LoadBackendOptionsMap`, and passes that map to every `Module::load_method` call (primary, multimethod loop, YOCO prefill/decode). The `BackendOptions` and map both live on the `loadModel` stack frame, which extends through every `load_method` call — Span lifetime requirements satisfied. Per-PTE options propagate into the backend's `BackendInitContext::runtime_spec` via `Method::init`'s `LoadBackendOptionsMap` path (`method.cpp:957-963`). Non-opt-in cria PTEs and non-cria loaders (NGTTS, direct-Module) simply don't pass a map → empty runtime_spec → init forces empty path → shared heap-only instance with dedup.

Lock hierarchy (updated):
- `weights_cache_manager_.meta_mutex_` (leaf — only during path-keyed map ops, never held across calls into instances)
- `weights_cache_manager_.empty_path_mutex_` (leaf — only during empty-path weak_ptr lookup/store)
- `XNNWeightsCache::instance_mutex_` (one per cache)
- `workspace_meta_mutex_`
- `workspace_mutex_` (owned by executor)

Race-condition / corner-case coverage:
- Same-path concurrent `get_or_create`: serialize on `meta_mutex_`, both return the same shared instance.
- Different-path concurrent `get_or_create`: parallel after the brief `meta_mutex_` window.
- Mid-load contention: same-path callers serialize on the instance mutex around `initialize_for_runtime`.
- Cross-PTE clobbering (the original bug): impossible — each path owns its own instance.
- Cross-process same-path: existing `flock(LOCK_EX|LOCK_NB)` defense untouched.
- Cache file deleted on disk: existing mmap stays valid (unix unlink semantics); manager doesn't track disk state.
- Process shutdown mid-save: executor-held `shared_ptr` outlives the manager map; instance destruction follows the executor's normal teardown.
- XNNPACK seed mismatch / cache format bump: existing per-entry seed reject + v1-trailer reject paths untouched.
- Empty path: shared via `empty_path_cache_` weak_ptr; recreated when all shared_ptrs drop; never collides with any mmap-path instance.
- Concurrent same-cache execute + destroy: serialize on the instance mutex.
- Stale global path inherited by non-opt-in PTE: prevented by the runtime_spec-only path read in init.

Mirrored to `fbcode/executorch/backends/xnnpack/runtime/`. The cria change lives only under `xplat/cria/` (no fbcode mirror).

### Test plan
Built `fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend` on linux, Apple, Android, and `fbcode//executorch/backends/xnnpack:xnnpack_backend` on linux — all green. Built downstream consumers to verify the API change is binary-compatible: `fbsource//xplat/cria/core:cria{Apple,Android}`, `fbsource//xplat/sgr/ml_service/modules/llm:lib_sgr_llmApple`, `fbsource//xplat/assistant/oacr/trims/modules/ondevice_modules:mwa_ondevice_moduleApple` — all green.

```
buck2 test \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache_manager \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache \
  fbcode//executorch/backends/xnnpack/test:test_workspace_manager \
  fbcode//executorch/backends/xnnpack/test:xnnexecutor_test
→ Pass 38. Fail 0. Build failure 0.
```

The new `test_xnn_weights_cache_manager` exercises 13 hazard cases the manager handles: `SamePathReturnsSameInstance`, `DifferentPathsReturnDifferentInstances`, `EmptyPathSharedAcrossCallers`, `EmptyPathRecreatedAfterAllRefsDrop`, `EmptyPathDoesNotShareWithMmapPath`, `ExpiredEntryDoesNotLeak`, `ExpiredEntryRecreatedOnNextCall`, `ConcurrentSamePathSameInstance` (16-thread fan-in), `ConcurrentDifferentPathsIndependent` (8-thread fan-out), `SaveAllNoLiveInstancesIsOk`, `SaveAllWalksLiveCaches`, `SaveAllSkipsExpiredEntries`, `NonEmptyPathRegistersInMap`.

Cria runner tests (`fbsource//xplat/cria/core/runner/tests/...`): Pass 938, Fail 0. The 18 `Fatal` entries reported by buck2 (`PrefillReturnsLogits`, `PrefillMapsParams`, `PrefillStringPrompt`, etc.) reproduce identically on this commit's parent (605126226e, no cria change, no init guard) with the same OpenMP/MKL/ASan SEGV stack — `kmp_basic_flag_native::done_check` → `__kmp_hyper_barrier_release` triggered by `mkl_blas_sgemm_omp_driver_v1` racing with `pthread_create` from `pthreadpool_create_v2`. These are pre-existing flakes in the asan-ubsan platform configuration, not caused by either the manager refactor or the runtime_spec migration.

`arc lint -a` clean across all 22 changed/added files (11 xplat + 11 fbcode mirrors; cria is xplat-only).

Differential Revision: D108431510
@doggeral doggeral force-pushed the export-D108431510 branch from e3dfec7 to 685283c Compare June 18, 2026 01:00
doggeral added a commit to doggeral/executorch that referenced this pull request Jun 18, 2026
Summary:

Split `XNNWeightsCache` out of its de-facto singleton lifetime (one instance per `XnnpackBackend`, shared across every PTE that opted into the file-backed cache) into a per-cache-file-path instance dispensed by a new `XNNWeightsCacheManager`. Mirrors the `XNNWorkspaceManager` PerModel pattern: same path → same shared instance; different paths → independent instances; empty path → one shared heap-only instance so XNNPACK's in-memory name dedup still works across PTEs.

The singleton design was forced today because `XnnpackBackendOptions::weights_cache_` is a by-value member and `XnnpackBackend` itself is a namespace-scope global (`XNNPACKBackend.cpp:246`). For PTEs that genuinely share a cache file the singleton's per-entry refcounting works, but two PTEs with **different** packed-cache paths hit the same `XNNWeightsCache` instance, so the second PTE's `initialize_for_runtime` calls `ftruncate(0)` on the file under the first PTE's still-live mmap regions — every subsequent access SIGBUSes (P2369924970 traces this). The clue the prior fix added — the warning in `XNNWeightsCache.h:147-151` "the path MUST be unique per `XNNWeightsCache` instance" — is enforced here at the manager level rather than relying on each caller to honor it.

Design:

`XNNWeightsCacheManager` (new, mirrors `XNNWorkspaceManager`):
- `std::unordered_map<std::string, std::weak_ptr<XNNWeightsCache>> caches_` keyed by absolute cache file path.
- `get_or_create(path)` looks up the path under a single `meta_mutex_`; if a live `weak_ptr` exists, returns the shared instance, otherwise constructs a new one, calls `set_packed_cache_path(path)` BEFORE registering it, and stores a `weak_ptr`.
- `meta_mutex_` is held only during the map op — never across any call into `XNNWeightsCache`, so different-path callers proceed in parallel after the brief window.
- `save_all()` snapshots live shared_ptrs under `meta_mutex_`, then iterates outside the meta lock and acquires each instance's own mutex around `save_packed_index()`.
- Empty path uses a separate `empty_path_cache_` (weak_ptr) + `empty_path_mutex_`: all heap-only callers (NGTTS sub-runners, FLLM classifier, PLLM methods when mmap MC is off) share one instance so XNNPACK's in-memory `look_up_or_insert` name dedup catches duplicate weights across PTEs and across methods within a PTE. Without this sharing, every `XnnpackBackend::init` allocated its own packed copy of every weight, regressing heap-only memory by ~500 MB on LoRA-multimethod PLLM (paste P2380809516: app_phys peak 1731 MB with per-instance vs ~1260 MB with the prior process-singleton). The hazard the per-path keying motivates — non-opt-in PTE inheriting an opt-in PTE's path and writing into its mmap file — never applies to empty-path callers because they hold no path / no fd / no mmap regions, so sharing among empty-path callers carries no isolation cost.
- Expired `weak_ptr` entries are erased opportunistically on the next `get_or_create` / `save_all` for that path. Stale entries from never-revisited paths linger; cost is one string + weak_ptr per dead entry. Acceptable per `XNNWorkspaceManager` precedent.

`XNNWeightsCache`:
- New `std::mutex instance_mutex_` member + `mutex()` accessor. The class has no internal synchronization; callers are responsible for holding `mutex()` around every method invocation, INCLUDING the XNNPACK callback paths (`look_up`, `reserve_space`, `look_up_or_insert`) that fire during `xnn_create_runtime`.
- `set_packed_cache_path` is documented as call-once-before-publish: production callers go through the manager, which sets the path before installing the `shared_ptr` in the map, so no other thread can observe the instance yet. Tests that construct the class directly must respect this contract.

`XnnpackBackendOptions`:
- Replaced `XNNWeightsCache weights_cache_` + `std::mutex weights_cache_mutex_` with `XNNWeightsCacheManager weights_cache_manager_`.
- New `get_or_create_weights_cache(path)` thin wrapper around the manager.
- `save_weights_cache_locked()` now walks every live cache via the manager's `save_all()`.
- `packed_cache_path_` keeps a small `path_mutex_` to serialize the `set_option(packed_cache_path_option_key)` → `init()` read; this is just transport for the option value, the path's authoritative home is per-instance inside each cache.

`XNNPACKBackend`:
- `init`: pulls the path from per-PTE `runtime_spec` (see "Per-PTE caller signal" below), asks the manager for the shared `XNNWeightsCache`, locks its `mutex()` for the entire init→compileModel sequence, then publishes the `shared_ptr` into the executor via the new `XNNExecutor::set_weights_cache`. Same-path PTEs serialize on the same instance mutex; different-path PTEs hold different mutexes and proceed in parallel — the singleton design forced full serialization here.
- `execute`: lock the per-executor cache's mutex (if any) instead of the global one. Concurrent execute on independent caches now runs in parallel.
- `destroy`: lock the per-executor cache's mutex, call `delete_packed_data`. The local `shared_ptr` keeps the instance alive across `delete_packed_data` even if dropping it from the executor was the last outside reference.

`XNNExecutor`:
- New `std::shared_ptr<XNNWeightsCache> weights_cache_` member, set once after `compileModel`. Forward-declared (rather than including `XNNWeightsCache.h`) to keep the transitive `pte_data_map.h` dependency out of the executor's public header — preserves the existing `xnnexecutor_test` build dep set.

Per-PTE caller signal (the FLLM / NGTTS isolation guarantee):

The manager's per-path dedup is necessary but not sufficient — if a non-opt-in PTE inherits an opt-in PTE's globally-set path, the manager hands it the same shared instance and the non-opt-in PTE's `reserve_space` writes into the opt-in model's mmap file. Investigation of the three on-device loaders confirms the concrete risk: cria PLLM pushes `packed_cache_path_option_key` globally (`runner_interface.h:365-373`), but NGTTS sub-runners (`AcousticRunner` / `HfMimiRunner` / `SemanticLmRunner` at `executorch/examples/models/fb/llama4/runner/*.cpp`) bypass cria entirely and never push a path, and the cria FLLM classifier path skips the push when `FactoryMetaData::useMmapPackedWeights = false` (`CriaHost.cpp:220-224`). All three loaders run in the same process and share one `XnnpackBackend` global (`XNNPACKBackend.cpp:246`).

Two complementary changes lock this down:

1. `XNNPACKBackend::init` no longer reads `options_.get_packed_cache_path()` (shared backend-singleton state). It reads the path strictly from `BackendInitContext::get_runtime_spec<const char*>(packed_cache_path_option_key)` — the only per-PTE signal that proves THIS PTE explicitly opted in. If `runtime_spec` carries no path, `cache_path` is empty and the manager hands the shared `empty_path_cache_` (per the empty-path branch above). Non-opt-in PTEs are guaranteed isolated from the mmap-path file regardless of what the global path happens to hold; they still dedupe against one another in the shared empty-path cache.

2. cria `runner_interface.h::loadModel()` no longer pushes XNNPACK options globally via `executorch::runtime::set_option`. It now builds a `BackendOptions<3>` carrying path / `weight_cache_option_key` / `workspace_sharing_mode_option_key`, wraps it in a `LoadBackendOptionsMap`, and passes that map to every `Module::load_method` call (primary, multimethod loop, YOCO prefill/decode). The `BackendOptions` and map both live on the `loadModel` stack frame, which extends through every `load_method` call — Span lifetime requirements satisfied. Per-PTE options propagate into the backend's `BackendInitContext::runtime_spec` via `Method::init`'s `LoadBackendOptionsMap` path (`method.cpp:957-963`). Non-opt-in cria PTEs and non-cria loaders (NGTTS, direct-Module) simply don't pass a map → empty runtime_spec → init forces empty path → shared heap-only instance with dedup.

Lock hierarchy (updated):
- `weights_cache_manager_.meta_mutex_` (leaf — only during path-keyed map ops, never held across calls into instances)
- `weights_cache_manager_.empty_path_mutex_` (leaf — only during empty-path weak_ptr lookup/store)
- `XNNWeightsCache::instance_mutex_` (one per cache)
- `workspace_meta_mutex_`
- `workspace_mutex_` (owned by executor)

Race-condition / corner-case coverage:
- Same-path concurrent `get_or_create`: serialize on `meta_mutex_`, both return the same shared instance.
- Different-path concurrent `get_or_create`: parallel after the brief `meta_mutex_` window.
- Mid-load contention: same-path callers serialize on the instance mutex around `initialize_for_runtime`.
- Cross-PTE clobbering (the original bug): impossible — each path owns its own instance.
- Cross-process same-path: existing `flock(LOCK_EX|LOCK_NB)` defense untouched.
- Cache file deleted on disk: existing mmap stays valid (unix unlink semantics); manager doesn't track disk state.
- Process shutdown mid-save: executor-held `shared_ptr` outlives the manager map; instance destruction follows the executor's normal teardown.
- XNNPACK seed mismatch / cache format bump: existing per-entry seed reject + v1-trailer reject paths untouched.
- Empty path: shared via `empty_path_cache_` weak_ptr; recreated when all shared_ptrs drop; never collides with any mmap-path instance.
- Concurrent same-cache execute + destroy: serialize on the instance mutex.
- Stale global path inherited by non-opt-in PTE: prevented by the runtime_spec-only path read in init.

Mirrored to `fbcode/executorch/backends/xnnpack/runtime/`. The cria change lives only under `xplat/cria/` (no fbcode mirror).

### Test plan
Built `fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend` on linux, Apple, Android, and `fbcode//executorch/backends/xnnpack:xnnpack_backend` on linux — all green. Built downstream consumers to verify the API change is binary-compatible: `fbsource//xplat/cria/core:cria{Apple,Android}`, `fbsource//xplat/sgr/ml_service/modules/llm:lib_sgr_llmApple`, `fbsource//xplat/assistant/oacr/trims/modules/ondevice_modules:mwa_ondevice_moduleApple` — all green.

```
buck2 test \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache_manager \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache \
  fbcode//executorch/backends/xnnpack/test:test_workspace_manager \
  fbcode//executorch/backends/xnnpack/test:xnnexecutor_test
→ Pass 38. Fail 0. Build failure 0.
```

The new `test_xnn_weights_cache_manager` exercises 13 hazard cases the manager handles: `SamePathReturnsSameInstance`, `DifferentPathsReturnDifferentInstances`, `EmptyPathSharedAcrossCallers`, `EmptyPathRecreatedAfterAllRefsDrop`, `EmptyPathDoesNotShareWithMmapPath`, `ExpiredEntryDoesNotLeak`, `ExpiredEntryRecreatedOnNextCall`, `ConcurrentSamePathSameInstance` (16-thread fan-in), `ConcurrentDifferentPathsIndependent` (8-thread fan-out), `SaveAllNoLiveInstancesIsOk`, `SaveAllWalksLiveCaches`, `SaveAllSkipsExpiredEntries`, `NonEmptyPathRegistersInMap`.

Cria runner tests (`fbsource//xplat/cria/core/runner/tests/...`): Pass 938, Fail 0. The 18 `Fatal` entries reported by buck2 (`PrefillReturnsLogits`, `PrefillMapsParams`, `PrefillStringPrompt`, etc.) reproduce identically on this commit's parent (605126226e, no cria change, no init guard) with the same OpenMP/MKL/ASan SEGV stack — `kmp_basic_flag_native::done_check` → `__kmp_hyper_barrier_release` triggered by `mkl_blas_sgemm_omp_driver_v1` racing with `pthread_create` from `pthreadpool_create_v2`. These are pre-existing flakes in the asan-ubsan platform configuration, not caused by either the manager refactor or the runtime_spec migration.

`arc lint -a` clean across all 22 changed/added files (11 xplat + 11 fbcode mirrors; cria is xplat-only).

Differential Revision: D108431510
@doggeral doggeral force-pushed the export-D108431510 branch from 685283c to 32c7f95 Compare June 18, 2026 02:27
doggeral added a commit to doggeral/executorch that referenced this pull request Jun 18, 2026
Summary:

Split `XNNWeightsCache` out of its de-facto singleton lifetime (one instance per `XnnpackBackend`, shared across every PTE that opted into the file-backed cache) into a per-cache-file-path instance dispensed by a new `XNNWeightsCacheManager`. Mirrors the `XNNWorkspaceManager` PerModel pattern: same path → same shared instance; different paths → independent instances; empty path → one shared heap-only instance so XNNPACK's in-memory name dedup still works across PTEs.

The singleton design was forced today because `XnnpackBackendOptions::weights_cache_` is a by-value member and `XnnpackBackend` itself is a namespace-scope global (`XNNPACKBackend.cpp:246`). For PTEs that genuinely share a cache file the singleton's per-entry refcounting works, but two PTEs with **different** packed-cache paths hit the same `XNNWeightsCache` instance, so the second PTE's `initialize_for_runtime` calls `ftruncate(0)` on the file under the first PTE's still-live mmap regions — every subsequent access SIGBUSes (P2369924970 traces this). The clue the prior fix added — the warning in `XNNWeightsCache.h:147-151` "the path MUST be unique per `XNNWeightsCache` instance" — is enforced here at the manager level rather than relying on each caller to honor it.

Design:

`XNNWeightsCacheManager` (new, mirrors `XNNWorkspaceManager`):
- `std::unordered_map<std::string, std::weak_ptr<XNNWeightsCache>> caches_` keyed by absolute cache file path.
- `get_or_create(path)` looks up the path under a single `meta_mutex_`; if a live `weak_ptr` exists, returns the shared instance, otherwise constructs a new one, calls `set_packed_cache_path(path)` BEFORE registering it, and stores a `weak_ptr`.
- `meta_mutex_` is held only during the map op — never across any call into `XNNWeightsCache`, so different-path callers proceed in parallel after the brief window.
- `save_all()` snapshots live shared_ptrs under `meta_mutex_`, then iterates outside the meta lock and acquires each instance's own mutex around `save_packed_index()`.
- Empty path uses a separate `empty_path_cache_` (weak_ptr) + `empty_path_mutex_`: all heap-only callers (NGTTS sub-runners, FLLM classifier, PLLM methods when mmap MC is off) share one instance so XNNPACK's in-memory `look_up_or_insert` name dedup catches duplicate weights across PTEs and across methods within a PTE. Without this sharing, every `XnnpackBackend::init` allocated its own packed copy of every weight, regressing heap-only memory by ~500 MB on LoRA-multimethod PLLM (paste P2380809516: app_phys peak 1731 MB with per-instance vs ~1260 MB with the prior process-singleton). The hazard the per-path keying motivates — non-opt-in PTE inheriting an opt-in PTE's path and writing into its mmap file — never applies to empty-path callers because they hold no path / no fd / no mmap regions, so sharing among empty-path callers carries no isolation cost.
- Expired `weak_ptr` entries are erased opportunistically on the next `get_or_create` / `save_all` for that path. Stale entries from never-revisited paths linger; cost is one string + weak_ptr per dead entry. Acceptable per `XNNWorkspaceManager` precedent.

`XNNWeightsCache`:
- New `std::mutex instance_mutex_` member + `mutex()` accessor. The class has no internal synchronization; callers are responsible for holding `mutex()` around every method invocation, INCLUDING the XNNPACK callback paths (`look_up`, `reserve_space`, `look_up_or_insert`) that fire during `xnn_create_runtime`.
- `set_packed_cache_path` is documented as call-once-before-publish: production callers go through the manager, which sets the path before installing the `shared_ptr` in the map, so no other thread can observe the instance yet. Tests that construct the class directly must respect this contract.

`XnnpackBackendOptions`:
- Replaced `XNNWeightsCache weights_cache_` + `std::mutex weights_cache_mutex_` with `XNNWeightsCacheManager weights_cache_manager_`.
- New `get_or_create_weights_cache(path)` thin wrapper around the manager.
- `save_weights_cache_locked()` now walks every live cache via the manager's `save_all()`.
- `packed_cache_path_` keeps a small `path_mutex_` to serialize the `set_option(packed_cache_path_option_key)` → `init()` read; this is just transport for the option value, the path's authoritative home is per-instance inside each cache.

`XNNPACKBackend`:
- `init`: pulls the path from per-PTE `runtime_spec` (see "Per-PTE caller signal" below), asks the manager for the shared `XNNWeightsCache`, locks its `mutex()` for the entire init→compileModel sequence, then publishes the `shared_ptr` into the executor via the new `XNNExecutor::set_weights_cache`. Same-path PTEs serialize on the same instance mutex; different-path PTEs hold different mutexes and proceed in parallel — the singleton design forced full serialization here.
- `execute`: lock the per-executor cache's mutex (if any) instead of the global one. Concurrent execute on independent caches now runs in parallel.
- `destroy`: lock the per-executor cache's mutex, call `delete_packed_data`. The local `shared_ptr` keeps the instance alive across `delete_packed_data` even if dropping it from the executor was the last outside reference.

`XNNExecutor`:
- New `std::shared_ptr<XNNWeightsCache> weights_cache_` member, set once after `compileModel`. Forward-declared (rather than including `XNNWeightsCache.h`) to keep the transitive `pte_data_map.h` dependency out of the executor's public header — preserves the existing `xnnexecutor_test` build dep set.

Per-PTE caller signal (the FLLM / NGTTS isolation guarantee):

The manager's per-path dedup is necessary but not sufficient — if a non-opt-in PTE inherits an opt-in PTE's globally-set path, the manager hands it the same shared instance and the non-opt-in PTE's `reserve_space` writes into the opt-in model's mmap file. Investigation of the three on-device loaders confirms the concrete risk: cria PLLM pushes `packed_cache_path_option_key` globally (`runner_interface.h:365-373`), but NGTTS sub-runners (`AcousticRunner` / `HfMimiRunner` / `SemanticLmRunner` at `executorch/examples/models/fb/llama4/runner/*.cpp`) bypass cria entirely and never push a path, and the cria FLLM classifier path skips the push when `FactoryMetaData::useMmapPackedWeights = false` (`CriaHost.cpp:220-224`). All three loaders run in the same process and share one `XnnpackBackend` global (`XNNPACKBackend.cpp:246`).

Two complementary changes lock this down:

1. `XNNPACKBackend::init` no longer reads `options_.get_packed_cache_path()` (shared backend-singleton state). It reads the path strictly from `BackendInitContext::get_runtime_spec<const char*>(packed_cache_path_option_key)` — the only per-PTE signal that proves THIS PTE explicitly opted in. If `runtime_spec` carries no path, `cache_path` is empty and the manager hands the shared `empty_path_cache_` (per the empty-path branch above). Non-opt-in PTEs are guaranteed isolated from the mmap-path file regardless of what the global path happens to hold; they still dedupe against one another in the shared empty-path cache.

2. cria `runner_interface.h::loadModel()` no longer pushes XNNPACK options globally via `executorch::runtime::set_option`. It now builds a `BackendOptions<3>` carrying path / `weight_cache_option_key` / `workspace_sharing_mode_option_key`, wraps it in a `LoadBackendOptionsMap`, and passes that map to every `Module::load_method` call (primary, multimethod loop, YOCO prefill/decode). The `BackendOptions` and map both live on the `loadModel` stack frame, which extends through every `load_method` call — Span lifetime requirements satisfied. Per-PTE options propagate into the backend's `BackendInitContext::runtime_spec` via `Method::init`'s `LoadBackendOptionsMap` path (`method.cpp:957-963`). Non-opt-in cria PTEs and non-cria loaders (NGTTS, direct-Module) simply don't pass a map → empty runtime_spec → init forces empty path → shared heap-only instance with dedup.

Lock hierarchy (updated):
- `weights_cache_manager_.meta_mutex_` (leaf — only during path-keyed map ops, never held across calls into instances)
- `weights_cache_manager_.empty_path_mutex_` (leaf — only during empty-path weak_ptr lookup/store)
- `XNNWeightsCache::instance_mutex_` (one per cache)
- `workspace_meta_mutex_`
- `workspace_mutex_` (owned by executor)

Race-condition / corner-case coverage:
- Same-path concurrent `get_or_create`: serialize on `meta_mutex_`, both return the same shared instance.
- Different-path concurrent `get_or_create`: parallel after the brief `meta_mutex_` window.
- Mid-load contention: same-path callers serialize on the instance mutex around `initialize_for_runtime`.
- Cross-PTE clobbering (the original bug): impossible — each path owns its own instance.
- Cross-process same-path: existing `flock(LOCK_EX|LOCK_NB)` defense untouched.
- Cache file deleted on disk: existing mmap stays valid (unix unlink semantics); manager doesn't track disk state.
- Process shutdown mid-save: executor-held `shared_ptr` outlives the manager map; instance destruction follows the executor's normal teardown.
- XNNPACK seed mismatch / cache format bump: existing per-entry seed reject + v1-trailer reject paths untouched.
- Empty path: shared via `empty_path_cache_` weak_ptr; recreated when all shared_ptrs drop; never collides with any mmap-path instance.
- Concurrent same-cache execute + destroy: serialize on the instance mutex.
- Stale global path inherited by non-opt-in PTE: prevented by the runtime_spec-only path read in init.

Mirrored to `fbcode/executorch/backends/xnnpack/runtime/`. The cria change lives only under `xplat/cria/` (no fbcode mirror).

### Test plan
Built `fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend` on linux, Apple, Android, and `fbcode//executorch/backends/xnnpack:xnnpack_backend` on linux — all green. Built downstream consumers to verify the API change is binary-compatible: `fbsource//xplat/cria/core:cria{Apple,Android}`, `fbsource//xplat/sgr/ml_service/modules/llm:lib_sgr_llmApple`, `fbsource//xplat/assistant/oacr/trims/modules/ondevice_modules:mwa_ondevice_moduleApple` — all green.

```
buck2 test \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache_manager \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache \
  fbcode//executorch/backends/xnnpack/test:test_workspace_manager \
  fbcode//executorch/backends/xnnpack/test:xnnexecutor_test
→ Pass 38. Fail 0. Build failure 0.
```

The new `test_xnn_weights_cache_manager` exercises 13 hazard cases the manager handles: `SamePathReturnsSameInstance`, `DifferentPathsReturnDifferentInstances`, `EmptyPathSharedAcrossCallers`, `EmptyPathRecreatedAfterAllRefsDrop`, `EmptyPathDoesNotShareWithMmapPath`, `ExpiredEntryDoesNotLeak`, `ExpiredEntryRecreatedOnNextCall`, `ConcurrentSamePathSameInstance` (16-thread fan-in), `ConcurrentDifferentPathsIndependent` (8-thread fan-out), `SaveAllNoLiveInstancesIsOk`, `SaveAllWalksLiveCaches`, `SaveAllSkipsExpiredEntries`, `NonEmptyPathRegistersInMap`.

Cria runner tests (`fbsource//xplat/cria/core/runner/tests/...`): Pass 938, Fail 0. The 18 `Fatal` entries reported by buck2 (`PrefillReturnsLogits`, `PrefillMapsParams`, `PrefillStringPrompt`, etc.) reproduce identically on this commit's parent (605126226e, no cria change, no init guard) with the same OpenMP/MKL/ASan SEGV stack — `kmp_basic_flag_native::done_check` → `__kmp_hyper_barrier_release` triggered by `mkl_blas_sgemm_omp_driver_v1` racing with `pthread_create` from `pthreadpool_create_v2`. These are pre-existing flakes in the asan-ubsan platform configuration, not caused by either the manager refactor or the runtime_spec migration.

`arc lint -a` clean across all 22 changed/added files (11 xplat + 11 fbcode mirrors; cria is xplat-only).

Differential Revision: D108431510
@doggeral doggeral force-pushed the export-D108431510 branch from 32c7f95 to ebb80b8 Compare June 18, 2026 02:29
Summary:

Split `XNNWeightsCache` out of its de-facto singleton lifetime (one instance per `XnnpackBackend`, shared across every PTE that opted into the file-backed cache) into a per-cache-file-path instance dispensed by a new `XNNWeightsCacheManager`. Mirrors the `XNNWorkspaceManager` PerModel pattern: same path → same shared instance; different paths → independent instances; empty path → one shared heap-only instance so XNNPACK's in-memory name dedup still works across PTEs.

The singleton design was forced today because `XnnpackBackendOptions::weights_cache_` is a by-value member and `XnnpackBackend` itself is a namespace-scope global (`XNNPACKBackend.cpp:246`). For PTEs that genuinely share a cache file the singleton's per-entry refcounting works, but two PTEs with **different** packed-cache paths hit the same `XNNWeightsCache` instance, so the second PTE's `initialize_for_runtime` calls `ftruncate(0)` on the file under the first PTE's still-live mmap regions — every subsequent access SIGBUSes (P2369924970 traces this). The clue the prior fix added — the warning in `XNNWeightsCache.h:147-151` "the path MUST be unique per `XNNWeightsCache` instance" — is enforced here at the manager level rather than relying on each caller to honor it.

Design:

`XNNWeightsCacheManager` (new, mirrors `XNNWorkspaceManager`):
- `std::unordered_map<std::string, std::weak_ptr<XNNWeightsCache>> caches_` keyed by absolute cache file path.
- `get_or_create(path)` looks up the path under a single `meta_mutex_`; if a live `weak_ptr` exists, returns the shared instance, otherwise constructs a new one, calls `set_packed_cache_path(path)` BEFORE registering it, and stores a `weak_ptr`.
- `meta_mutex_` is held only during the map op — never across any call into `XNNWeightsCache`, so different-path callers proceed in parallel after the brief window.
- `save_all()` snapshots live shared_ptrs under `meta_mutex_`, then iterates outside the meta lock and acquires each instance's own mutex around `save_packed_index()`.
- Empty path uses a separate `empty_path_cache_` (weak_ptr) + `empty_path_mutex_`: all heap-only callers (NGTTS sub-runners, FLLM classifier, PLLM methods when mmap MC is off) share one instance so XNNPACK's in-memory `look_up_or_insert` name dedup catches duplicate weights across PTEs and across methods within a PTE. Without this sharing, every `XnnpackBackend::init` allocated its own packed copy of every weight, regressing heap-only memory by ~500 MB on LoRA-multimethod PLLM (paste P2380809516: app_phys peak 1731 MB with per-instance vs ~1260 MB with the prior process-singleton). The hazard the per-path keying motivates — non-opt-in PTE inheriting an opt-in PTE's path and writing into its mmap file — never applies to empty-path callers because they hold no path / no fd / no mmap regions, so sharing among empty-path callers carries no isolation cost.
- Expired `weak_ptr` entries are erased opportunistically on the next `get_or_create` / `save_all` for that path. Stale entries from never-revisited paths linger; cost is one string + weak_ptr per dead entry. Acceptable per `XNNWorkspaceManager` precedent.

`XNNWeightsCache`:
- New `std::mutex instance_mutex_` member + `mutex()` accessor. The class has no internal synchronization; callers are responsible for holding `mutex()` around every method invocation, INCLUDING the XNNPACK callback paths (`look_up`, `reserve_space`, `look_up_or_insert`) that fire during `xnn_create_runtime`.
- `set_packed_cache_path` is documented as call-once-before-publish: production callers go through the manager, which sets the path before installing the `shared_ptr` in the map, so no other thread can observe the instance yet. Tests that construct the class directly must respect this contract.

`XnnpackBackendOptions`:
- Replaced `XNNWeightsCache weights_cache_` + `std::mutex weights_cache_mutex_` with `XNNWeightsCacheManager weights_cache_manager_`.
- New `get_or_create_weights_cache(path)` thin wrapper around the manager.
- `save_weights_cache_locked()` now walks every live cache via the manager's `save_all()`.
- `packed_cache_path_` keeps a small `path_mutex_` to serialize the `set_option(packed_cache_path_option_key)` → `init()` read; this is just transport for the option value, the path's authoritative home is per-instance inside each cache.

`XNNPACKBackend`:
- `init`: pulls the path from per-PTE `runtime_spec` (see "Per-PTE caller signal" below), asks the manager for the shared `XNNWeightsCache`, locks its `mutex()` for the entire init→compileModel sequence, then publishes the `shared_ptr` into the executor via the new `XNNExecutor::set_weights_cache`. Same-path PTEs serialize on the same instance mutex; different-path PTEs hold different mutexes and proceed in parallel — the singleton design forced full serialization here.
- `execute`: lock the per-executor cache's mutex (if any) instead of the global one. Concurrent execute on independent caches now runs in parallel.
- `destroy`: lock the per-executor cache's mutex, call `delete_packed_data`. The local `shared_ptr` keeps the instance alive across `delete_packed_data` even if dropping it from the executor was the last outside reference.

`XNNExecutor`:
- New `std::shared_ptr<XNNWeightsCache> weights_cache_` member, set once after `compileModel`. Forward-declared (rather than including `XNNWeightsCache.h`) to keep the transitive `pte_data_map.h` dependency out of the executor's public header — preserves the existing `xnnexecutor_test` build dep set.

Per-PTE caller signal (the FLLM / NGTTS isolation guarantee):

The manager's per-path dedup is necessary but not sufficient — if a non-opt-in PTE inherits an opt-in PTE's globally-set path, the manager hands it the same shared instance and the non-opt-in PTE's `reserve_space` writes into the opt-in model's mmap file. Investigation of the three on-device loaders confirms the concrete risk: cria PLLM pushes `packed_cache_path_option_key` globally (`runner_interface.h:365-373`), but NGTTS sub-runners (`AcousticRunner` / `HfMimiRunner` / `SemanticLmRunner` at `executorch/examples/models/fb/llama4/runner/*.cpp`) bypass cria entirely and never push a path, and the cria FLLM classifier path skips the push when `FactoryMetaData::useMmapPackedWeights = false` (`CriaHost.cpp:220-224`). All three loaders run in the same process and share one `XnnpackBackend` global (`XNNPACKBackend.cpp:246`).

Two complementary changes lock this down:

1. `XNNPACKBackend::init` no longer reads `options_.get_packed_cache_path()` (shared backend-singleton state). It reads the path strictly from `BackendInitContext::get_runtime_spec<const char*>(packed_cache_path_option_key)` — the only per-PTE signal that proves THIS PTE explicitly opted in. If `runtime_spec` carries no path, `cache_path` is empty and the manager hands the shared `empty_path_cache_` (per the empty-path branch above). Non-opt-in PTEs are guaranteed isolated from the mmap-path file regardless of what the global path happens to hold; they still dedupe against one another in the shared empty-path cache.

2. cria `runner_interface.h::loadModel()` no longer pushes XNNPACK options globally via `executorch::runtime::set_option`. It now builds a `BackendOptions<3>` carrying path / `weight_cache_option_key` / `workspace_sharing_mode_option_key`, wraps it in a `LoadBackendOptionsMap`, and passes that map to every `Module::load_method` call (primary, multimethod loop, YOCO prefill/decode). The `BackendOptions` and map both live on the `loadModel` stack frame, which extends through every `load_method` call — Span lifetime requirements satisfied. Per-PTE options propagate into the backend's `BackendInitContext::runtime_spec` via `Method::init`'s `LoadBackendOptionsMap` path (`method.cpp:957-963`). Non-opt-in cria PTEs and non-cria loaders (NGTTS, direct-Module) simply don't pass a map → empty runtime_spec → init forces empty path → shared heap-only instance with dedup.

Lock hierarchy (updated):
- `weights_cache_manager_.meta_mutex_` (leaf — only during path-keyed map ops, never held across calls into instances)
- `weights_cache_manager_.empty_path_mutex_` (leaf — only during empty-path weak_ptr lookup/store)
- `XNNWeightsCache::instance_mutex_` (one per cache)
- `workspace_meta_mutex_`
- `workspace_mutex_` (owned by executor)

Race-condition / corner-case coverage:
- Same-path concurrent `get_or_create`: serialize on `meta_mutex_`, both return the same shared instance.
- Different-path concurrent `get_or_create`: parallel after the brief `meta_mutex_` window.
- Mid-load contention: same-path callers serialize on the instance mutex around `initialize_for_runtime`.
- Cross-PTE clobbering (the original bug): impossible — each path owns its own instance.
- Cross-process same-path: existing `flock(LOCK_EX|LOCK_NB)` defense untouched.
- Cache file deleted on disk: existing mmap stays valid (unix unlink semantics); manager doesn't track disk state.
- Process shutdown mid-save: executor-held `shared_ptr` outlives the manager map; instance destruction follows the executor's normal teardown.
- XNNPACK seed mismatch / cache format bump: existing per-entry seed reject + v1-trailer reject paths untouched.
- Empty path: shared via `empty_path_cache_` weak_ptr; recreated when all shared_ptrs drop; never collides with any mmap-path instance.
- Concurrent same-cache execute + destroy: serialize on the instance mutex.
- Stale global path inherited by non-opt-in PTE: prevented by the runtime_spec-only path read in init.

Mirrored to `fbcode/executorch/backends/xnnpack/runtime/`. The cria change lives only under `xplat/cria/` (no fbcode mirror).

### Test plan
Built `fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend` on linux, Apple, Android, and `fbcode//executorch/backends/xnnpack:xnnpack_backend` on linux — all green. Built downstream consumers to verify the API change is binary-compatible: `fbsource//xplat/cria/core:cria{Apple,Android}`, `fbsource//xplat/sgr/ml_service/modules/llm:lib_sgr_llmApple`, `fbsource//xplat/assistant/oacr/trims/modules/ondevice_modules:mwa_ondevice_moduleApple` — all green.

```
buck2 test \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache_manager \
  fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache \
  fbcode//executorch/backends/xnnpack/test:test_workspace_manager \
  fbcode//executorch/backends/xnnpack/test:xnnexecutor_test
→ Pass 38. Fail 0. Build failure 0.
```

The new `test_xnn_weights_cache_manager` exercises 13 hazard cases the manager handles: `SamePathReturnsSameInstance`, `DifferentPathsReturnDifferentInstances`, `EmptyPathSharedAcrossCallers`, `EmptyPathRecreatedAfterAllRefsDrop`, `EmptyPathDoesNotShareWithMmapPath`, `ExpiredEntryDoesNotLeak`, `ExpiredEntryRecreatedOnNextCall`, `ConcurrentSamePathSameInstance` (16-thread fan-in), `ConcurrentDifferentPathsIndependent` (8-thread fan-out), `SaveAllNoLiveInstancesIsOk`, `SaveAllWalksLiveCaches`, `SaveAllSkipsExpiredEntries`, `NonEmptyPathRegistersInMap`.

Cria runner tests (`fbsource//xplat/cria/core/runner/tests/...`): Pass 938, Fail 0. The 18 `Fatal` entries reported by buck2 (`PrefillReturnsLogits`, `PrefillMapsParams`, `PrefillStringPrompt`, etc.) reproduce identically on this commit's parent (605126226e, no cria change, no init guard) with the same OpenMP/MKL/ASan SEGV stack — `kmp_basic_flag_native::done_check` → `__kmp_hyper_barrier_release` triggered by `mkl_blas_sgemm_omp_driver_v1` racing with `pthread_create` from `pthreadpool_create_v2`. These are pre-existing flakes in the asan-ubsan platform configuration, not caused by either the manager refactor or the runtime_spec migration.

`arc lint -a` clean across all 22 changed/added files (11 xplat + 11 fbcode mirrors; cria is xplat-only).

Differential Revision: D108431510
@doggeral doggeral force-pushed the export-D108431510 branch from ebb80b8 to 57736af Compare June 18, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant