turbo-tasks: task-storage memory and dispatch-path overhead reductions#93720
Draft
lukesandberg wants to merge 9 commits intocanaryfrom
Draft
turbo-tasks: task-storage memory and dispatch-path overhead reductions#93720lukesandberg wants to merge 9 commits intocanaryfrom
lukesandberg wants to merge 9 commits intocanaryfrom
Conversation
The `task_cache` map and the `persistent_task_type` field both stored `Arc<CachedTaskType>`. `triomphe::Arc` is already a workspace dep used by `ReadRef` / `SharedReference`, and `CachedTaskType` is never wrapped in a `Weak`. triomphe saves one usize per allocation (no weak count) and avoids the weak-count CAS in `drop_slow`. Profiles showed `Arc<CachedTaskType>::drop_slow` taking ~1.4% of overhead samples on the `task_overhead/turbo-uncached-parallel` benchmark. Migration removes the weak-count CAS and shrinks the per-task allocation by 8 bytes. Wraps `triomphe::Arc<CachedTaskType>` in a newtype `CachedTaskTypeArc` so we can implement the foreign `bincode::Encode` / `Decode` traits — the orphan rule blocks an `impl<Context> Decode<Context> for triomphe::Arc<...>` because `Context` is not covered by a local type. Also forwards `Hash`, `Eq`, `Borrow<CachedTaskType>`, `Display`, `Debug`, `Deref`, and `Clone`.
…ypes The `cell_dependencies` set stored `(CellRef, Option<u64>)` tuples and `cell_dependents` stored `(CellId, Option<u64>, TaskId)` tuples. The `Option<u64>` cost a full 16 B (8 B discriminant + 8 B value, aligned), making each element 24 B. Going through `AutoSet` brought the largest `LazyField` variants to 48 B + 8 B discriminant = the enum's pinned size of 56 B. Replace the tuples with explicit enums `CellDependency` (`All(CellRef)` / `Hash(CellRef, u64)`) and `CellDependent` (`All(CellId, TaskId)` / `Hash(CellId, TaskId, u64)`). The layout algorithm reuses the niche on `ValueTypeId` (`NonZero<u16>`) inside `CellRef.cell.type_id` for the variant tag, so the elements drop to 24 B and the corresponding `AutoSet`s to 40 B. Net effect: `LazyField` shrinks from 56 B to 48 B, and `TaskStorage` from 344 B to 312 B (saves 32 B per task at `LAZY_INLINE_CAPACITY = 4`). Updates the backend's wire format. Persistent caches are auto-invalidated by the existing git-version-keyed cache directory scheme.
… futures `ResolveRawVcFuture::poll` and `ReadRawVcFuture::poll` are on the per-poll hot path of every Vc await. Each poll did: - `with_turbo_tasks(...)` (a `LocalKey::with` on the `TURBO_TASKS` task-local) on every poll, and - for `RawVc::LocalOutput` polls (the common case for `foo(bar).await` patterns): `try_read_local_output` which itself does a `CURRENT_TASK_STATE.with(...)` on every call. Add lazy-initialized `tt: Option<Arc<dyn TurboTasksApi>>` and `cts: Option<Arc<RwLock<CurrentTaskState>>>` fields on both futures. Initialize on first poll via `with_turbo_tasks` / `current_task_state` respectively, then reuse on every subsequent poll. The cached values are valid for the lifetime of the future because a future cannot legally migrate between turbo-tasks scopes. The `LocalOutput` resolution path now bypasses both the `LocalKey::with` on `CURRENT_TASK_STATE` and the `dyn TurboTasksApi` virtual dispatch in `try_read_local_output`. A new free function `try_read_local_output_in_state` is the underlying implementation; the trait method on `TurboTasksApi` delegates to it. Measured cumulative impact (across this and the preceding three commits) on `task_overhead/turbo`: - turbo-uncached-parallel/1µs: -25% - turbo-uncached-parallel/10µs: -26% - turbo-cached-different-keys/1µs: -37% - turbo-cached-different-keys/10µs: -29% - turbo-cached-different-keys/1000µs: -32% - turbo-cached-same-keys/1µs: -10% Higher-duration variants (100µs and 1ms) where `busy_task` dominates show no significant change, as expected.
`TaskStorage::lazy` only ever holds at most ~25 elements (one per declared lazy field in the schema), and growth never exceeds that, so a `Vec`'s 24 B `(ptr, len, cap)` header is wasteful. Replace it with a hand-rolled `LazyVec<T>`: `(ptr, len: u8, cap: u8)` plus 6 B padding = 16 B total. `size_of::<TaskStorage>()` drops from 136 B to 128 B. The savings multiplied by several million `TaskStorage`s live during a Next.js build recovers dozens of MB of resident memory. The API is intentionally a strict subset of `Vec` covering only what the task-storage callers and the `#[task_storage]` macro emit need: `len`, `is_empty`, `iter`, `iter_mut`, `push`, `swap_remove`, `last_mut`, `Index/IndexMut`, `Extend`, `IntoIterator`, `reserve`, `Default`, `Debug`, and `ShrinkToFit`. Capacity overflow above 255 panics, which we have plenty of headroom against.
Contributor
Author
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Contributor
Tests PassedCommit: dbced22 |
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
- Merge `CellDependent` into `CellDependency` — `CellDependent::All(CellId,
TaskId)` is `CellRef { task: TaskId, cell: CellId }` with field order
swapped. Both `cell_dependencies` and `cell_dependents` now store
`CellDependency`; the `CellRef.task` field's meaning differs by direction
(dependee vs. dependent) but the bits are identical. Drops the
`CellDependent` enum and 4 accessor methods.
- Rename `LazyVec` to `TinyVec`. Move the file accordingly.
- `CachedTaskTypeArc`: trim the orphan-rule paragraph and `derive(Hash)`
instead of the manual impl.
- Add `print_schema_sizes` test and update `test_schema_size` with the
per-field breakdown and slack analysis (TaskStorage = 128 B; naive sum
134 B; layout packs −6 B).
- Simplify the macro's `lazy` field comment.
- `raw_vc.rs`: add a TODO acknowledging the `cts` cache should be
unnecessary if `try_read_local_output` returned a future itself.
Contributor
Stats from current PR✅ No significant changes detected📊 All Metrics📖 Metrics GlossaryDev Server Metrics:
Build Metrics:
Change Thresholds:
⚡ Dev Server
📦 Dev Server (Webpack) (Legacy)📦 Dev Server (Webpack)
⚡ Production Builds
📦 Production Builds (Webpack) (Legacy)📦 Production Builds (Webpack)
📦 Bundle SizesBundle Sizes⚡ TurbopackClient Main Bundles
Server Middleware
Build DetailsBuild Manifests
📦 WebpackClient Main Bundles
Polyfills
Pages
Server Edge SSR
Middleware
Build DetailsBuild Manifests
Build Cache
🔄 Shared (bundler-independent)Runtimes
📎 Tarball URLCommit: dbced22 |
…ead/Resolve futures `ResolveRawVcFuture::tt` was an `Option<Arc<dyn TurboTasksApi>>` populated on first poll via `with_turbo_tasks(Arc::clone)`. Since the future is always constructed inside a turbo-tasks scope (entry points are `RawVc::resolve` and `RawVc::into_read`), there's no reason to defer — capture eagerly in the constructor and drop the `Option`. Removes one branch from the per-poll hot path. `ReadRawVcFuture` previously had its own `Option<Arc<dyn TurboTasksApi>>`, adding a second `LocalKey::with` plus a redundant `Arc::clone`. Drop that field entirely; phase 2 reuses the inner `ResolveRawVcFuture`'s `tt` via `&self.resolve.tt`. Also derive `PartialEq` + `Eq` on `CachedTaskTypeArc` instead of hand-rolling the `ptr_eq` short-circuit. The cache map's bucket lookup uses `eq_components` directly, so the short-circuit was unreachable.
`task_cache: FxDashMap::default()` falls through to dashmap's default shard amount of `num_cpus * 4` (e.g. 64 shards on a 14-core machine), while `storage.map` uses our `compute_shard_amount` heuristic which is quadratic in worker count for a target ~3% collision probability (e.g. 4096 shards on the same machine). The 64× mismatch made `task_cache` lookups self-contend on every cache hit even when `storage.map` accesses were uncontended. Profiles attributed ~10% of overhead samples to `dashmap::lock_exclusive_slow` on `task_cache`'s shards, which is implausible for a properly sharded map at this thread count. Use `with_capacity_and_hasher_and_shard_amount` on `task_cache` with the same `shard_amount` we already pass to `Storage::new`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
A series of small, independent perf and memory wins for
turbo-tasks-backend's task storage and dispatch path. Each commit stands alone; the benchmark numbers below cover the cumulative effect of all 7 commits.What changed
triomphe::Arc<CachedTaskType>(commit 1)triomphe::Arcis already a workspace dep used elsewhere (ReadRef,SharedReference).CachedTaskTypenever appears in aWeak<...>, so we can drop the weak count and the corresponding CAS indrop_slow. Saves oneusizeper allocation. Migration is via a thinCachedTaskTypeArcnewtype to satisfy the bincodeEncode/Decodeorphan rules.Niche-encode
CellDependency(commit 2)cell_dependencies/outdated_cell_dependenciespreviously stored(CellRef, Option<u64>)tuples — theOption<u64>cost a full 16 B (8 B discriminant + 8 B value, aligned), making elements 32 B. The dependent-sidecell_dependentsstored a parallel(CellId, Option<u64>, TaskId)tuple.A single
CellDependencyenum (All(CellRef)/Hash(CellRef, u64)) now backs both sides — the dependents map reuses the same enum withCellRef.taskre-pointed at the dependent task. The layout algorithm reuses the niche onValueTypeId(NonZero<u16>) insideCellRef.cell.type_idfor the variant tag. Elements drop from 32 B → 24 B, andLazyField's pinned size from 56 B → 48 B (thoseAutoSets were the largest variants).Cache
TURBO_TASKSandCURRENT_TASK_STATEin Resolve/Read futures (commit 3)ResolveRawVcFuture::pollandReadRawVcFuture::pollare on the per-poll path of everyVcawait. Each poll did:with_turbo_tasks(...)(aLocalKey::withon theTURBO_TASKStask-local), andRawVc::LocalOutputpolls — the common case forfoo(bar).awaitpatterns —try_read_local_outputdid anotherLocalKey::withonCURRENT_TASK_STATEplus adyn TurboTasksApivirtual dispatch.Both futures cache the
Arc<RwLock<CurrentTaskState>>lazily on first poll and reuse it. TheLocalOutputpath additionally bypasses the trait object via a freetry_read_local_output_in_statefunction. A future cannot legally migrate between turbo-tasks scopes, so the cached values stay valid for the future's lifetime.Replace
TaskStorage::lazy: Vec<LazyField>with a 16 BTinyVec(commit 4)TaskStorage::lazyonly ever holds at most ~25 elements (one per declared lazy field in the schema), and growth never exceeds that. ReplacingVec<LazyField>'s 24 B(ptr, len, cap)header with(ptr, len: u8, cap: u8)+ 6 B padding gives 16 B. Dropssize_of::<TaskStorage>()from 136 → 128 B. Multiplied by ~6.5M tasks live during a typical Next.js build, that's roughly 50 MB of resident memory recovered. TheTinyVecAPI is a strict subset ofVeccovering only what the schema's macro emit and the storage helpers need.Capture
Arc<dyn TurboTasksApi>atRawVcfuture construction (commit 5)Continuation of commit 3: instead of caching the turbo-tasks handle lazily on first poll, capture it eagerly when constructing
ResolveRawVcFuture/ReadRawVcFuture. Construction always happens inside a turbo-tasks scope, so a singlewith_turbo_tasks(Arc::clone)at construction replaces theOption<Arc<...>>and the per-poll fast path.ReadRawVcFutureshares theArcwith its innerResolveRawVcFuturerather than holding its own.Match
task_cacheshard count tostorage.map(commit 6)Storage::newcarefully computes a quadraticshard_amount(k=16) sized fornum_cpus. On a 14-core box that's 4096 shards. Thetask_cacheFxDashMap, however, fell through to dashmap's default heuristic (num_cpus * 4≈ 64 shards), so cache lookups contended on a fraction of the storage map's shard count. The cache is now constructed withwith_capacity_and_hasher_and_shard_amountusing the same value asstorage.map.Self-review polish (commit 7)
Renames
LazyVec→TinyVec, mergesCellDependentintoCellDependency, simplifies a few comments, drops a stale orphan-rule paragraph, and tightens thetask_storagemacro output.Benchmark results
task_overheadbenchmark (cargo bench -p turbo-tasks-backend --bench mod -- task_overhead/turbo) on an Apple M4 Pro (14 cores),--sample-size 200, run undercaffeinate -dimsu nice -n -20with Spotlight indexing paused andpowermetricsco-recorded. No thermal pressure events recorded during either run. Significance: 95% CIs on the mean.The cached paths show the largest improvements — that's where every
Vcawait goes through the per-poll TLS/dispatch path most heavily.turbo-cached-different-keys/100lands at −19.1% andturbo-cached-same-keys/100at −8.1%.turbo-uncached-parallel/10and/100(−5.8%, −6.3%) reflect thetask_cacheshard fix reducing contention on the cache lookup path.Where the workload itself dominates (
turbo-uncached/100,/1000),busy_taskis the entire budget and overhead is rounding error — flat is expected and correct. No significant regressions: the +% rows all have overlapping confidence intervals with their baselines.vercel-site build results
branch
canary
Test plan
cargo test -p turbo-tasks-backend --lib— 46 passedcargo test -p turbo-tasks-backend --tests— all integration tests passedcargo check --workspace— clean