CUDA Streams by MrNeRF · Pull Request #1302 · MrNeRF/LichtFeld-Studio

MrNeRF · 2026-06-13T21:31:10Z

No description provided.

waitForCUDAStream created and destroyed a cudaEvent per call — measurable churn on hot cross-stream paths. Lift DeferredFreeQueue's private event pool into an exported CudaEventPool singleton shared across DSOs, move waitForCUDAStream out-of-line onto it, and switch DeferredFreeQueue to the shared pool. Groundwork for multi-stream allocator safety. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Slab and bucket tiers recycled freed blocks with no stream tracking — correct only while everything runs on one stream. Free lists are now per-stream (same-stream reuse stays free by stream ordering); reuse on another stream records a pooled event on the owner stream and waits on the consumer (bridgeStreams), so cross-stream recycling is GPU-ordered with zero host syncs and no deferred VRAM retention. CudaMemoryPool routes frees from its allocation registry instead of the deleter-captured stream: record_stream() marks cross-stream uses, rehome_stream() moves a block's home, and free_routed() bridges every recorded use into the home stream before recycling. Bucket eviction now frees on the entry's tag stream (previously nullptr — unordered with the block's last use under multi-stream). Also fixes compute-sanitizer being silently disabled for every binary: gpu_config.hpp initialized CUDA at .so load via an eager inline global; GPUConfig::get() is now called lazily. initcheck remains impractical on lichtfeld_tests (hangs); gates are memcheck, racecheck, and the gated-stream pattern-verify tests, which fail deterministically without the event edges. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…homing set_stream TensorState::stream is now the home stream — where the tensor's pending writes are ordered — stored atomically (copyable wrapper) to remove cross-thread torn reads. set_stream() re-homes the allocation in the pool registry so frees follow the declared stream; the new record_stream() marks read-only cross-stream uses; sync_to_stream() bundles the event-edge wait with the record and replaces every waitForCUDAStream(exec, tensor.stream()) consume site. Streams referenced by pool memory must now be severed before cudaStreamDestroy: CudaMemoryPool::release_stream() syncs the stream, drops it from recorded uses, re-homes its live allocations to the legacy stream, and migrates cached free-list entries (slab → virgin, bucket retag). trim_cached_memory() does the same device-wide after its existing sync, so the per-test cleanup listener resets allocator stream state automatically. The stream tests destroyed streams while tensors homed on them were alive — valid under the old stream-blind allocator, a driver-level use-after-free now (SIGSEGV in cudaStreamWaitEvent, ~50% repro); they use the new sever helper. 10/10 clean sweeps, memcheck and racecheck clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A pinned block used by several streams (H2D upload on one, D2H readback on another) was only fenced against the single stream passed to deallocate — the other stream's pending transfer could still touch the block after reuse. Blocks now carry one pooled event per recorded using stream and are reusable only when all have completed. record_stream() registers extra uses (also reachable via Tensor::record_stream for pinned CPU tensors); release_stream() severs a stream before destruction and is invoked from CudaMemoryPool::release_stream so one call covers both allocators. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Zero references outside its own header; the live rasterizer arena is RasterizerMemoryArena in src/core/cuda/memory_arena. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…device sync begin_frame stalled the whole device with cudaDeviceSynchronize before every arena frame — once per training step and per viewport frame. Stream-aware frames now chain: end_frame records a completion event on the frame's stream, the next begin_frame waits on it GPU-side before the offset reset reuses memory. Frames without a stream (and every error path) break the chain, falling back to the device sync, so the legacy behavior is preserved until callers run under a current-stream guard. LFS_ARENA_LEGACY_SYNC=1 forces the old sync. Call sites pass getCurrentCUDAStream(): correct once the trainer/viewer guards land, nullptr (legacy sync) until then. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ent stream Per-camera image-loading streams were created with the blocking flag, implicitly synchronizing with the legacy default stream where all compute runs; they are now non-blocking and severed from the allocator before destruction. Every kernel launcher in src/training/kernels falls back to getCurrentCUDAStream() via resolve_stream() instead of the legacy stream — a no-op until a stream guard is installed, after which unparameterized launches follow the training stream instead of racing on stream 0. Known pre-existing failures (verified on the branch base, unrelated): MCMCTest.RemoveGaussiansSoftDeletesRows, MCMCRelocateOptimizerStateTest.ResetBothSourceAndDestinationRows. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

waitForCUDAStream treats a nullptr dependency as "nothing to wait for", but a tensor homed on the legacy stream can have pending work that a non-blocking execution stream would race. Use bridgeStreams, which records the edge for nullptr producers too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Trainer::train() installs training_stream_ as the thread's current stream (CUDAStreamGuard) after a one-time device-sync bridge for init-time legacy work; everything the training thread touches — tensor ops, optimizer, kernel launchers via resolve_stream — inherits it. LFS_TRAIN_STREAM_LEGACY=1 rolls back to the legacy stream. Streams are nvtx-named (lfs.train) for nsys. The stream uses the default (blocking) flag deliberately: ~64 blocking cudaMemcpy readbacks in the tensor lib and the fastgs/ssim kernels (hardcoded to the legacy stream) rely on implicit legacy-stream ordering — a non-blocking stream reproducibly degraded training loss 2x (stale cross-stream reads). Blocking keeps those implicitly ordered at zero cost to the overlap targets: the loader decode and viewer render streams are non-blocking and bridge with explicit events. Porting the vendored fastgs kernels onto the training stream is follow-up work. launch_normalize_by_device_scalar drops its embedded stream sync — the median temp's home stream now equals the launch stream, so the pool free is stream-ordered after the kernel. Verified: 7k bicycle smoke at loss parity with two baselines (identical final splat count, 239 it/s unchanged); nsys shows tensor ops on lfs.train; stream gtests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Readers of the live model (viewport packs, metric renders) get a bracketed, GPU-ordered protocol: beginModelRead waits on params_ready_event_ — recorded on the training stream at end-of-step and at the end of each exclusive-locked refine scope, so reads see exactly step-i parameters — and endModelRead records into a 4-slot reader-done ring that waitForModelReaders drains into the training stream before the next step's in-place writes. The viewer's exported render-complete timeline plugs in via setViewerReleaseFence / publishViewerBorrow and is waited the same way. All edges are GPU-side; the only CPU lock is the leaf stream_sync_mutex_ (render_mutex_ → stream_sync_mutex_ order). computeCameraMetrics adopts the bracket. Verified: 1500-iter smoke at loss and throughput parity (328 vs 336 it/s baseline, within noise). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The hot GPU decode path (nvjpeg + format conversion) ran on the legacy stream and ended with cudaDeviceSynchronize, coupling image availability to whatever training kernels were in flight. The decode thread now installs a non-blocking decode stream as its current stream, passes it through the decode/encode calls, and load_image_from_memory_gpu syncs only that stream before handoff — images stay materialized-on-arrival for the trainer, but decode and training no longer serialize each other. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The every-10-iterations loss display called .item() — a device-wide sync stalling the whole pipeline. The sample now goes through a 4-slot pinned ring: cudaMemcpyAsync + event on the training stream at submit, polled and reported on later intervals, host-synced only when the ring backs up, at loop exit, and after OOM recovery. NaN/Inf detection lags by at most LOSS_RING * LOSS_SYNC_INTERVAL iterations — the run is unrecoverable either way. The MRNF densification error map normalized via mean().item() + host branch every iteration; launch_normalize_by_device_scalar gains a skip_below guard evaluated in-kernel, so the mean never leaves the GPU. Contended A/B at 3k iters: identical runtime; 7k MCMC and 1.5k MRNF smokes at loss parity. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Both multinomial samplers block-reduced the weights, copied partials D2H, summed on the CPU, and synced the stream — every refine iteration, draining the whole training pipeline mid-step. The inclusive cumsum the samplers already build contains the total in its last element, so the kernels now read prob_sum from cumsum[n-1] and handle the all-zero case in-kernel (zero outputs, matching the old memset branch). The separate reduction kernels and the host readback are gone. gtest: sampling frequencies match the weight distribution, gathered opacity/scale values are exact, zero-weight gaussians are never sampled, all-zero weights produce zero outputs. memcheck clean; 3k MCMC smoke with identical final splat count (183773) vs pre-change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

VksplatViewportRenderer owns a non-blocking render_stream_ (created in ensureInitialized, severed+destroyed in reset). render(), rerenderSelectionOverlay() and buildSelectionMask() install it as the thread's current stream, and renderVulkanFrame wraps the frame so render-state tensor ops follow it. All explicit stream picks — input packing, opacity copies, LOD page uploads, overlay staging, selection queries, upload-timeline signals — now run on the render stream with producer tensors bridged in via event edges instead of inheriting whatever stream the source tensor was homed on (which silently relied on legacy-stream FIFO ordering). Behavior-neutral: the training-mode forced syncs remain (now syncing just the render stream) until the release-fence lands. Verified with a 35s live train+viewport GUI session (clean frames, no startup deadlock, clean shutdown) plus packer gtests and headless smoke. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The render-complete timeline becomes an exportable Vulkan timeline imported into CUDA (renderCompleteFence/renderCompleteValue) so the trainer can enqueue "wait render_complete >= borrow" GPU-side before its next in-place writes — the reverse edge of the handshake. The raw VkSemaphore view is unchanged for all existing waits and readback tags. Two lifetime hazards that the CPU fence wait used to mask get explicit GPU-side fixes: zero-copy input storages bound to a frame are kept alive in a retire-list keyed by that frame's completion value (trainer topology reallocations can drop their references while the batch is in flight), and uploads into the persistent LOD page buffer wait on the last frame that read it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

During training the viewport kept itself safe with a forced cudaStreamSynchronize after input uploads plus a _mm_pause busy-wait on the Vulkan batch fence — CPU stalls on the UI thread every frame. Both are gone. renderVulkanFrame wires the handshake under the shared render lock: registers the render-complete fence with the trainer, waits the params-ready event on the render stream before any model read (forward edge; the upload-timeline signal is ordered after it, covering Vulkan's zero-copy reads), and a scope-exit publisher hands the frame's completion value to the trainer, which waits it GPU-side before the next step's in-place writes (reverse edge). Storage lifetime across topology reallocations is covered by the retire-list; the selection overlay extends the last bind's retirement to its own submit. The render stream now exists from construction so the first frame is covered, and the trainer drops the fence handle before a renderer reset destroys the import. Verified: full 7k bicycle training with live viewport completed in 30.1s (headless baseline 29.2s), loss 0.108 and splat count identical to baselines, no errors, clean shutdown. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The selection service's cross-stream copy created and destroyed two cudaEvents per call; it now uses the shared bridgeStreams edges and records the cross-stream write with the allocator. New gtest mirrors the trainer↔viewer protocol exactly — writer rewrites a 32MB buffer in place each iteration (params-ready record + reader-done wait), reader snapshots on a second non-blocking stream under the same bracket — and asserts 50+ snapshots each contain exactly one iteration's value. This is the cross-stream race oracle racecheck can't provide; memcheck clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…iewer Camera movement during --gut GUI training reproducibly killed the GPU (Xid 109 CTX SWITCH TIMEOUT → VK_ERROR_DEVICE_LOST at ~iter 2300, 2/2 runs). Root cause: with the CPU fence wait removed, the viewport releases its arena frame at submit while the batch is still executing on shared scratch. The trainer's borrow wait only runs at step top, so a trainer already mid-step acquires the arena, and its begin_frame fallback — cudaDeviceSynchronize — cannot see in-flight Vulkan work: it resets the offset and training kernels overwrite scratch the running sort/blend shaders read, which then loop on garbage ranges until the channel times out. The arena now carries the release: the viewport notes {render-complete timeline, completion value} before its frame ends (note_external_release), and the next begin_frame waits that value GPU-side on its stream — or via the legacy stream ahead of the streamless device-sync path. The selection-overlay re-render gets its own arena guard for the resident sort buffers it re-reads. Borrow publishing also moves to a per-submit renderer callback (before the arena guard releases), publishViewerBorrow becomes monotonic, and a re-registered fence resets the stale borrow value (a fresh timeline never reaches the old one — latent trainer hang). Exception paths bounded-wait the timeline instead of publishing possibly-unsignaled values. Verified: --gut GUI 7k completed under sustained scripted camera churn (10 pause/resume cycles, loss 0.127, no Xid — previously dead at ~25s), a second 60s churn round clean, fastgs GUI 7k at parity (32.6s, loss 0.115), arena gtests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

item<T>() and TensorRowProxy::item_as<T>() did a blocking cudaMemcpy with no synchronization — that only orders against the legacy stream, so values produced on a non-blocking stream (the upcoming metrics stream; any future non-legacy reader) would read stale. Sync the tensor's home stream first; nullptr homes keep relying on legacy ordering as before. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

All 18 SSIM/fused-L1 kernel launches went to the legacy default stream — correct only via the training stream's blocking-flag implicit ordering, and wrong for any non-blocking caller (the upcoming metrics stream). Every launch now targets getCurrentCUDAStream() inline; the entry points need no signature changes since every caller wants its thread's stream, and the reduction sub-launches already resolve the same way. Gates: 42 SSIM/loss gtests green; 3x GUI train+view 7k sessions at loss parity (0.083-0.130) with zero errors. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

try_begin_frame failed instantly when the trainer held the arena frame, so under training contention the viewport often skipped its frame entirely. Add try_begin_frame_for(timeout): the pending-render flag already blocks the trainer from STARTING a new frame, so a bounded wait is normally one iteration. The viewport uses a 15ms timeout — long enough to win the frame in the common case, short enough to bail instead of deadlocking on refining iterations where the trainer holds the frame while blocked on the exclusive render lock the viewport's shared lock excludes (the lock→frame inversion the metrics work will close properly). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

forward_raw and forward() gain a stream parameter threaded into all four kernels (preprocess, create_instances, extract_instance_ranges, blend), the two CUB calls (InclusiveSum, RadixSort), the status/range memsets, and the n_instances sizing readback (now stream-scoped async-copy + stream sync instead of a legacy-stream blocking memcpy). fast_rasterizer derives the stream as getCurrentCUDAStream() (tensor fallback) and passes it; ForwardContext stores it so release paths (sorted indices free, arena end_frame) use the context's stream, not the caller's current one. The static memset side-stream + event is deleted: its whole purpose was overlapping a ~64KB instance-ranges memset, and it relied on legacy-stream implicit ordering with the previous frame's reads of the same arena memory — a hazard once the kernels move off the legacy stream. The memset now runs inline on the frame's stream. StreamOrderedDeviceBuffer carries its stream through allocate/reset/dtor so the sort scratch is freed stream-ordered, not on a nullptr stream unordered with the sort that used it. Backward still runs on the legacy stream (next commit) — safe meanwhile via the training stream's blocking-flag implicit ordering. Gates: 3x GUI train+view 7k at loss parity (0.10-0.14), zero errors; FastGSGradientTest.Numerical_Means is a pre-existing full-suite flake (fails identically at the branch base). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

backward() and backward_raw() take a stream, threaded into both kernels (blend_backward, preprocess_backward); backward_raw uses forward_ctx.stream so the backward shares the same stream and arena frame the forward chained. The six grad-helper zeroing cudaMemsets and the grad_w2c clear become cudaMemsetAsync on that stream. With forward (prev commit) and backward both off the legacy stream, fastgs is fully stream-resident. nsys confirms all 800 blend kernels in a 400-iter run land on lfs.train, none on the legacy stream. 3x GUI train+view 7k at loss parity (0.118-0.130), zero errors. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Mirrors the fastgs forward port for the edge variant (used by the MRNF and igs+ strategies): edge_forward/edge_forward_raw take a stream threaded into all four kernels (preprocess, create_instances, extract_instance_ranges, edge_blend) and the two CUB calls; the static memset side-stream + event is deleted (inline cudaMemsetAsync on the frame stream); the n_instances readback is stream-scoped. The three arena end_frame calls (incl. error paths) pass the stream so the arena chain stays intact instead of falling back to a device sync. Gates: GUI MRNF and igs+ 1500-iter sessions converge at parity, zero errors. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The gsplat backward's v_colors scratch used a synchronous cudaMalloc/cudaFree (ordered only against the legacy stream); now cudaMallocAsync/cudaFreeAsync on the backward stream. Forward and backward derive their stream getCurrentCUDAStream()-first (tensor fallback) instead of straight from means.stream(), so a metrics-thread gut render lands its kernels, arena frame, and consumers on one stream. GsplatRasterizeContext carries that stream; the inference-only release (gsplat_rasterizer.hpp) and the training backward free the isect/flatten buffers and end the arena frame on it — the inference path previously used a streamless end_frame that broke the arena chain and forced a device sync on the UI thread every render. Intersect.cpp's stream-scoped sizing sync is left as-is (control-flow readback). Gates: gsplat gtests green; 3x GUI train+view --gut 7k at parity (loss 0.128-0.165, ~75s), zero errors. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…deadlock computeCameraMetrics (PSNR/SSIM panel + camera-click) now installs a non-blocking lfs.metrics stream guard and brackets its render with the beginModelRead/endModelRead handshake on that stream, so the metric render's kernels/tensor ops overlap training and item() readbacks (now home-stream-aware, Port-1) stay correct off the legacy stream. Deadlock fix: the metric render holds render_mutex_ shared then needs the arena frame; a refining trainer holds the arena frame then needs render_mutex_ exclusive — a cycle when begin_frame waits forever. Instead of restructuring the trainer hot-loop lock order (high risk), RasterizerMemoryArena::ScopedBeginFrameTimeout caps begin_frame on the metrics thread only, so it bails (skipping that metric sample) rather than hanging; training threads keep blocking acquisition. A new gtest reproduces the opposite-order contention and asserts deadlock-freedom + bounded bail. UAF fix: waitForModelReaders() is now also drained inside the refining exclusive-lock scopes, so a metric read that finished CPU-side between the step-top drain and the realloc has its GPU work ordered before grow/prune frees model tensors. ppisp controller-pool predict copies move off the legacy stream (getCurrentCUDAStream) — they feed conv/ linear ops that run on the metrics stream under appearance correction. Validation: contention gtest + memcheck clean; 5 GUI train+view 7k sessions (3 fastgs, 2 gut) at loss parity, zero errors, clean teardown. The metrics-panel-during-training integration path is best exercised live (split-view GT metrics / camera click); its mechanism is gtested. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The per-step quantized Adam update already ran on the trainer's stream, but the four state-maintenance helpers (quantize_adam_moments[_swizzled], zero_rows / zero_quantized_rows) launched on the legacy default stream. With quantized-Adam + MCMC this fires every densification interval via relocate_params_at_indices_gpu, serializing the relocation against the training stream and, worse, stalling the device with bare cudaMalloc/ cudaFree for the index buffer. Thread an explicit stream into the four adam_api helpers and resolve it at the call sites from getCurrentCUDAStream(), bridging the touched state tensors with waitForCUDAStream and re-homing them with set_stream after. The relocation index buffers move to cudaMallocAsync/MemcpyAsync/FreeAsync on the same stream, removing the device-wide alloc/free syncs. Densification stays bit-identical (1,293,710 splats at 7k, loss in band); memcheck clean on the relocate/zero/adam kernels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PyTensor::dlpack accepted a stream argument but ignored it: it only did a full cudaDeviceSynchronize when stream was None, and when a consumer passed a real stream it performed no synchronization at all — a silent race against whatever stream the tensor was last written on (e.g. lfs.train). Export now implements the DLPack stream-exchange protocol: None syncs just the tensor's home stream, the -1 sentinel skips sync, and a consumer stream (legacy/per-thread/explicit) is ordered via the pooled bridgeStreams edge instead of stalling the whole device. Import passes our current stream into the producer's __dlpack__(stream=) so the producer orders its work onto the stream we will consume on, falling back to the no-arg call for producers without the kwarg. Adds a deterministic gtest reproducing the export edge (gated producer write must be visible to the consumer readback) and a torch-free Python self-roundtrip across the None/-1/legacy branches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The test predates the quantized Adam state (#1261) and read exp_avg as float32, but the optimizer stores moments as uint8 with a per-row float scale. Reinterpreting the quantized zero-point bytes as float produced a NaN momentum sum, failing the pre-relocation assertion (independent of the relocation logic under test). Check exp_avg_scale instead: a row that accumulated momentum has a non-zero scale, and relocation zeros the scale for the touched rows (the documented zero_quantized_rows reset), which forces their dequantized momentum to zero. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces stream-aware CUDA execution and memory-management improvements across the tensor core, training kernels, and the Vulkan viewer/trainer interop, and adds targeted stress/regression tests for multi-stream correctness.

Changes:

Add pooled CUDA event utilities and stream-aware tensor “home stream” tracking (sync/record/rehome) to make cross-stream tensor usage and allocator reuse safe without global synchronization.
Make multiple training/rasterization/optimizer kernels and arenas consistently honor the caller/current CUDA stream (and avoid legacy-default-stream implicit ordering).
Add new CUDA stream/event/multi-stream correctness tests and wire them into the test build.

Reviewed changes

Copilot reviewed 79 out of 79 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_tensor_stream.cpp	Ensure test streams are released from the tensor allocator before destruction.
tests/test_tensor_multistream.cpp	Add multi-stream allocator ordering/steal/reuse/handshake stress tests.
tests/test_mcmc_relocate_optimizer_state_bug.cpp	Update optimizer-state test to validate quantized moment scale reset semantics.
tests/test_mcmc_multinomial.cpp	Add multinomial sampling+gather regression tests (including all-zero weights).
tests/test_cuda_event_pool.cpp	Add CUDA event pool reuse/concurrency and cross-stream wait tests.
tests/test_arena_metrics_contention.cpp	Add bounded begin_frame contention test to prevent deadlocks.
tests/CMakeLists.txt	Register newly added tests in the test target.
src/visualizer/selection/selection_service.cpp	Replace ad-hoc event bridging with pooled event stream-bridging + allocator stream recording.
src/visualizer/rendering/vksplat_viewport_renderer.hpp	Expose render CUDA stream and viewer-release fence/value for trainer↔viewer handshake.
src/visualizer/rendering/rendering_manager_vulkan.cpp	Install per-frame render-stream guards and trainer↔viewer GPU handshake publishing.
src/visualizer/rendering/lod_upload_engine.cpp	Release long-lived CUDA stream from tensor pool before destruction.
src/training/trainer.hpp	Add model-access mutex and GPU-side model-read handshake API/state.
src/training/rasterization/gsplat/Rasterization.cpp	Use stream-ordered cudaMallocAsync/cudaFreeAsync for scratch when available.
src/training/rasterization/gsplat_rasterizer.hpp	Persist the forward stream in the context so backward/end_frame can match it.
src/training/rasterization/gsplat_rasterizer.cpp	Begin/end arena frames and launches in a current-stream-first manner; store ctx stream.
src/training/rasterization/fastgs/rasterization/src/rasterization_api.cu	Thread stream through arena frames, forward/backward, and stream-ordered frees/memsets.
src/training/rasterization/fastgs/rasterization/src/forward.cu	Make forward explicitly stream-driven; remove side-stream memset trick; stream-order scratch alloc/free.
src/training/rasterization/fastgs/rasterization/src/backward.cu	Thread stream into kernel launches for backward path.
src/training/rasterization/fastgs/rasterization/include/rasterization_api.h	Extend forward context and forward_raw API to carry/accept CUDA stream.
src/training/rasterization/fastgs/rasterization/include/forward.h	Add CUDA stream parameter to forward declaration.
src/training/rasterization/fastgs/rasterization/include/backward.h	Add CUDA stream parameter to backward declaration.
src/training/rasterization/fastgs/optimizer/src/adam_api.cu	Thread stream into quantize/zero-row kernels to avoid legacy-stream launches.
src/training/rasterization/fastgs/optimizer/include/adam_api.h	Extend optimizer kernel APIs with optional CUDA stream parameters.
src/training/rasterization/fast_rasterizer.cpp	Select raster stream from current guard first, tensor stream as fallback.
src/training/rasterization/edge_rasterizer.cpp	End arena frame on current stream for stream-aware chaining.
src/training/rasterization/edge_compute/rasterization/src/rasterization_api.cu	Add stream parameter; begin/end arena frames and forward launches on that stream.
src/training/rasterization/edge_compute/rasterization/src/forward.cu	Run edge forward kernels/memsets on an explicit stream.
src/training/rasterization/edge_compute/rasterization/include/forward.h	Add CUDA stream parameter (defaulted) to edge forward declaration.
src/training/rasterization/edge_compute/rasterization/include/edge_rasterization_api.h	Add CUDA stream parameter (defaulted) to public edge raster API.
src/training/optimizer/adam_optimizer.cpp	Make optimizer state init/zero/relocate stream-aware and rehome state tensors.
src/training/kernels/ssim.cu	Launch SSIM kernels on the current CUDA stream (guard-aware).
src/training/kernels/ssim_reduction.cu	Resolve/normalize stream parameters via shared resolve_stream helper.
src/training/kernels/regularization.cu	Resolve stream parameters consistently for regularization kernels.
src/training/kernels/pruning_kernels.cu	Resolve stream parameters consistently for pruning kernels.
src/training/kernels/ppisp.cu	Resolve stream parameters consistently for PPISP kernels.
src/training/kernels/mrnf_kernels.cu	Resolve stream parameters consistently for MRNF kernels and thrust usage.
src/training/kernels/mcmc_kernels.cu	Resolve stream parameters; remove host reduction; use scan total for multinomial sampling.
src/training/kernels/l1_loss.cu	Resolve stream parameters consistently for L1-loss kernels.
src/training/kernels/kernel_stream.hpp	Add shared resolve_stream helpers that default to the thread’s current CUDA stream.
src/training/kernels/image_kernels.hpp	Extend normalize-by-device-scalar API with skip threshold parameter.
src/training/kernels/image_kernels.cu	Make image kernels stream-resolved and remove host sync from normalize-by-scalar.
src/training/kernels/grad_alpha.cu	Resolve stream parameters consistently for grad/compose/permute/resize helpers.
src/training/kernels/depth_loss.cu	Resolve stream parameters consistently for depth loss kernels.
src/training/kernels/densification_kernels.cu	Resolve stream parameters consistently for densification kernels.
src/training/kernels/camera_loss_heatmap.cu	Resolve stream parameters consistently for loss heatmap kernel.
src/training/kernels/bilateral_grid_tv.cu	Resolve stream parameters consistently for bilateral TV kernels.
src/training/kernels/bilateral_grid_forward.cu	Resolve stream parameters consistently for bilateral forward kernels.
src/training/kernels/bilateral_grid_backward.cu	Resolve stream parameters consistently for bilateral backward/adam/accumulate kernels.
src/training/components/ppisp_controller_pool.cu	Ensure tensor copies/host uploads occur on the current guarded stream.
src/rendering/cuda_vulkan_interop.hpp	Expose raw CUDA external semaphore handle for consumers that enqueue waits.
src/python/lfs/py_tensor.cpp	Implement DLPack stream handshake/bridging; home imported tensors on the consumer stream.
src/io/pipelined_image_loader.cpp	Add decode CUDA stream, guard decode thread work onto it, and release it from pool on shutdown.
src/io/nvcodec_image_loader.cpp	Replace device-wide sync with stream sync when a caller stream is provided.
src/io/include/io/pipelined_image_loader.hpp	Store decode CUDA stream member for GPU decode path.
src/core/tensor/tensor.cpp	Add Tensor stream rehome/record/sync methods integrated with allocators and stream bridging.
src/core/tensor/tensor_unified_ops.cpp	Replace manual waitForCUDAStream calls with Tensor::sync_to_stream prologue.
src/core/tensor/tensor_masking_ops.cpp	Use Tensor::sync_to_stream for multi-tensor stream alignment before CUDA masking ops.
src/core/tensor/tensor_broadcast.cpp	Use Tensor::sync_to_stream to align broadcast execution stream.
src/core/tensor/pinned_memory_allocator.cpp	Track per-block multi-stream usage via pooled events; add record_stream/release_stream.
src/core/tensor/lazy_executor.cpp	Use Tensor::sync_to_stream to align pointwise execution stream.
src/core/tensor/internal/tensor_impl.hpp	Make tensor home stream atomic; add host readback synchronization on home stream.
src/core/tensor/internal/tensor_expr_impl.hpp	Use Tensor::sync_to_stream when preparing execution streams for expressions.
src/core/tensor/internal/size_bucketed_pool.hpp	Make bucket cache stream-tagged; bridge on cross-stream reuse; free on tag stream.
src/core/tensor/internal/memory_pool.hpp	Add allocator-wide stream tracking (record/rehome/release_stream) and routed frees.
src/core/tensor/internal/gpu_slab_allocator.hpp	Add per-stream free lists with GPU-side “steal” bridging for cross-stream reuse.
src/core/tensor/internal/gpu_config.hpp	Remove global inline GPUConfig instance usage; call GPUConfig::get() in macros.
src/core/tensor/internal/gpu_arena_allocator.hpp	Remove obsolete GPU arena allocator header.
src/core/tensor/internal/deferred_free_queue.hpp	Switch deferred-free event handling to shared CudaEventPool.
src/core/tensor/internal/cuda_stream_context.hpp	Move waitForCUDAStream implementation out-of-header; document pooled-event behavior.
src/core/tensor/internal/cuda_event_pool.hpp	Add global pooled cudaEvent management + bridgeStreams API.
src/core/tensor/cuda_stream_context.cpp	Implement waitForCUDAStream using pooled events (host-sync fallback).
src/core/tensor/cuda_event_pool.cpp	Implement pooled event acquire/release and stream bridging helper.
src/core/include/core/pinned_memory_allocator.hpp	Update pinned allocator API/docs for multi-stream tracking and stream release.
src/core/cuda/memory_arena.hpp	Add stream-aware begin/end frame chaining, bounded acquisition, and Vulkan external release fencing.
src/core/cuda/memory_arena.cu	Implement stream-aware chaining with completion events + bounded waits + external release drain.
src/core/CMakeLists.txt	Add cuda_event_pool.cpp to core tensor sources.
src/core/camera.cpp	Use non-blocking camera stream and release it from tensor pool before destruction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        // Begin arena frame for memory allocation
        auto& arena = core::GlobalArenaManager::instance().get_arena();
-        uint64_t frame_id = arena.begin_frame();
+        uint64_t frame_id = arena.begin_frame(core::getCurrentCUDAStream());
        auto arena_allocator = arena.get_allocator(frame_id);


+            // End on the same stream begin_frame used (same guard → same value),
+            // not the streamless device-sync path, so the arena frame chain stays
+            // intact for the next frame instead of falling back to a full sync.
+            arena.end_frame(frame_id, core::getCurrentCUDAStream());


        if (nvcodec_available) {
+            cudaStreamCreateWithFlags(&decode_stream_, cudaStreamNonBlocking);
            gpu_decode_thread_ = std::thread([this] { gpu_batch_decode_thread_func(); });
        }


+        // Non-blocking so image loading doesn't serialize with the legacy stream
+        cudaStreamCreateWithFlags(&_stream, cudaStreamNonBlocking);


+        // Non-blocking so image loading doesn't serialize with the legacy stream
+        cudaStreamCreateWithFlags(&_stream, cudaStreamNonBlocking);


MrNeRF and others added 30 commits June 10, 2026 15:59

Delete dead GPUArenaAllocator

5703a06

Zero references outside its own header; the live rasterizer arena is RasterizerMemoryArena in src/core/cuda/memory_arena. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'master' into stream_impl

d8fa847

MrNeRF added 6 commits June 11, 2026 14:10

fixes

51f9b92

fix

5f90c20

fix

61fab44

fixes

2886340

sync master

4daa62b

Merge remote-tracking branch 'origin/master' into stream_impl

54c084c

Copilot AI review requested due to automatic review settings June 13, 2026 21:31

Copilot started reviewing on behalf of MrNeRF June 13, 2026 21:31 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

fix gt image flickering

c813a8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Streams#1302

CUDA Streams#1302
MrNeRF wants to merge 37 commits into
masterfrom
stream_impl

MrNeRF commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Non-blocking so image loading doesn't serialize with the legacy stream
		cudaStreamCreateWithFlags(&_stream, cudaStreamNonBlocking);

Conversation

MrNeRF commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants