Skip to content

CUDA Streams#1302

Open
MrNeRF wants to merge 37 commits into
masterfrom
stream_impl
Open

CUDA Streams#1302
MrNeRF wants to merge 37 commits into
masterfrom
stream_impl

Conversation

@MrNeRF

@MrNeRF MrNeRF commented Jun 13, 2026

Copy link
Copy Markdown
Owner

No description provided.

MrNeRF and others added 30 commits June 10, 2026 15:59
waitForCUDAStream created and destroyed a cudaEvent per call — measurable
churn on hot cross-stream paths. Lift DeferredFreeQueue's private event
pool into an exported CudaEventPool singleton shared across DSOs, move
waitForCUDAStream out-of-line onto it, and switch DeferredFreeQueue to
the shared pool. Groundwork for multi-stream allocator safety.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Slab and bucket tiers recycled freed blocks with no stream tracking —
correct only while everything runs on one stream. Free lists are now
per-stream (same-stream reuse stays free by stream ordering); reuse on
another stream records a pooled event on the owner stream and waits on
the consumer (bridgeStreams), so cross-stream recycling is GPU-ordered
with zero host syncs and no deferred VRAM retention.

CudaMemoryPool routes frees from its allocation registry instead of the
deleter-captured stream: record_stream() marks cross-stream uses,
rehome_stream() moves a block's home, and free_routed() bridges every
recorded use into the home stream before recycling. Bucket eviction now
frees on the entry's tag stream (previously nullptr — unordered with
the block's last use under multi-stream).

Also fixes compute-sanitizer being silently disabled for every binary:
gpu_config.hpp initialized CUDA at .so load via an eager inline global;
GPUConfig::get() is now called lazily. initcheck remains impractical on
lichtfeld_tests (hangs); gates are memcheck, racecheck, and the
gated-stream pattern-verify tests, which fail deterministically without
the event edges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…homing set_stream

TensorState::stream is now the home stream — where the tensor's pending
writes are ordered — stored atomically (copyable wrapper) to remove
cross-thread torn reads. set_stream() re-homes the allocation in the
pool registry so frees follow the declared stream; the new
record_stream() marks read-only cross-stream uses; sync_to_stream()
bundles the event-edge wait with the record and replaces every
waitForCUDAStream(exec, tensor.stream()) consume site.

Streams referenced by pool memory must now be severed before
cudaStreamDestroy: CudaMemoryPool::release_stream() syncs the stream,
drops it from recorded uses, re-homes its live allocations to the
legacy stream, and migrates cached free-list entries (slab → virgin,
bucket retag). trim_cached_memory() does the same device-wide after its
existing sync, so the per-test cleanup listener resets allocator stream
state automatically. The stream tests destroyed streams while tensors
homed on them were alive — valid under the old stream-blind allocator,
a driver-level use-after-free now (SIGSEGV in cudaStreamWaitEvent, ~50%
repro); they use the new sever helper. 10/10 clean sweeps, memcheck and
racecheck clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A pinned block used by several streams (H2D upload on one, D2H readback
on another) was only fenced against the single stream passed to
deallocate — the other stream's pending transfer could still touch the
block after reuse. Blocks now carry one pooled event per recorded using
stream and are reusable only when all have completed. record_stream()
registers extra uses (also reachable via Tensor::record_stream for
pinned CPU tensors); release_stream() severs a stream before
destruction and is invoked from CudaMemoryPool::release_stream so one
call covers both allocators.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Zero references outside its own header; the live rasterizer arena is
RasterizerMemoryArena in src/core/cuda/memory_arena.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…device sync

begin_frame stalled the whole device with cudaDeviceSynchronize before
every arena frame — once per training step and per viewport frame.
Stream-aware frames now chain: end_frame records a completion event on
the frame's stream, the next begin_frame waits on it GPU-side before
the offset reset reuses memory. Frames without a stream (and every
error path) break the chain, falling back to the device sync, so the
legacy behavior is preserved until callers run under a current-stream
guard. LFS_ARENA_LEGACY_SYNC=1 forces the old sync.

Call sites pass getCurrentCUDAStream(): correct once the trainer/viewer
guards land, nullptr (legacy sync) until then.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ent stream

Per-camera image-loading streams were created with the blocking flag,
implicitly synchronizing with the legacy default stream where all
compute runs; they are now non-blocking and severed from the allocator
before destruction. Every kernel launcher in src/training/kernels falls
back to getCurrentCUDAStream() via resolve_stream() instead of the
legacy stream — a no-op until a stream guard is installed, after which
unparameterized launches follow the training stream instead of racing
on stream 0.

Known pre-existing failures (verified on the branch base, unrelated):
MCMCTest.RemoveGaussiansSoftDeletesRows,
MCMCRelocateOptimizerStateTest.ResetBothSourceAndDestinationRows.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
waitForCUDAStream treats a nullptr dependency as "nothing to wait for",
but a tensor homed on the legacy stream can have pending work that a
non-blocking execution stream would race. Use bridgeStreams, which
records the edge for nullptr producers too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Trainer::train() installs training_stream_ as the thread's current
stream (CUDAStreamGuard) after a one-time device-sync bridge for
init-time legacy work; everything the training thread touches — tensor
ops, optimizer, kernel launchers via resolve_stream — inherits it.
LFS_TRAIN_STREAM_LEGACY=1 rolls back to the legacy stream. Streams are
nvtx-named (lfs.train) for nsys.

The stream uses the default (blocking) flag deliberately: ~64 blocking
cudaMemcpy readbacks in the tensor lib and the fastgs/ssim kernels
(hardcoded to the legacy stream) rely on implicit legacy-stream
ordering — a non-blocking stream reproducibly degraded training loss 2x
(stale cross-stream reads). Blocking keeps those implicitly ordered at
zero cost to the overlap targets: the loader decode and viewer render
streams are non-blocking and bridge with explicit events. Porting the
vendored fastgs kernels onto the training stream is follow-up work.

launch_normalize_by_device_scalar drops its embedded stream sync — the
median temp's home stream now equals the launch stream, so the pool
free is stream-ordered after the kernel.

Verified: 7k bicycle smoke at loss parity with two baselines (identical
final splat count, 239 it/s unchanged); nsys shows tensor ops on
lfs.train; stream gtests green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Readers of the live model (viewport packs, metric renders) get a
bracketed, GPU-ordered protocol: beginModelRead waits on
params_ready_event_ — recorded on the training stream at end-of-step
and at the end of each exclusive-locked refine scope, so reads see
exactly step-i parameters — and endModelRead records into a 4-slot
reader-done ring that waitForModelReaders drains into the training
stream before the next step's in-place writes. The viewer's exported
render-complete timeline plugs in via setViewerReleaseFence /
publishViewerBorrow and is waited the same way. All edges are GPU-side;
the only CPU lock is the leaf stream_sync_mutex_ (render_mutex_ →
stream_sync_mutex_ order). computeCameraMetrics adopts the bracket.

Verified: 1500-iter smoke at loss and throughput parity (328 vs
336 it/s baseline, within noise).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The hot GPU decode path (nvjpeg + format conversion) ran on the legacy
stream and ended with cudaDeviceSynchronize, coupling image
availability to whatever training kernels were in flight. The decode
thread now installs a non-blocking decode stream as its current stream,
passes it through the decode/encode calls, and
load_image_from_memory_gpu syncs only that stream before handoff —
images stay materialized-on-arrival for the trainer, but decode and
training no longer serialize each other.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The every-10-iterations loss display called .item() — a device-wide
sync stalling the whole pipeline. The sample now goes through a 4-slot
pinned ring: cudaMemcpyAsync + event on the training stream at submit,
polled and reported on later intervals, host-synced only when the ring
backs up, at loop exit, and after OOM recovery. NaN/Inf detection lags
by at most LOSS_RING * LOSS_SYNC_INTERVAL iterations — the run is
unrecoverable either way.

The MRNF densification error map normalized via mean().item() + host
branch every iteration; launch_normalize_by_device_scalar gains a
skip_below guard evaluated in-kernel, so the mean never leaves the GPU.

Contended A/B at 3k iters: identical runtime; 7k MCMC and 1.5k MRNF
smokes at loss parity.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both multinomial samplers block-reduced the weights, copied partials
D2H, summed on the CPU, and synced the stream — every refine iteration,
draining the whole training pipeline mid-step. The inclusive cumsum the
samplers already build contains the total in its last element, so the
kernels now read prob_sum from cumsum[n-1] and handle the all-zero case
in-kernel (zero outputs, matching the old memset branch). The separate
reduction kernels and the host readback are gone.

gtest: sampling frequencies match the weight distribution, gathered
opacity/scale values are exact, zero-weight gaussians are never
sampled, all-zero weights produce zero outputs. memcheck clean; 3k
MCMC smoke with identical final splat count (183773) vs pre-change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
VksplatViewportRenderer owns a non-blocking render_stream_ (created in
ensureInitialized, severed+destroyed in reset). render(),
rerenderSelectionOverlay() and buildSelectionMask() install it as the
thread's current stream, and renderVulkanFrame wraps the frame so
render-state tensor ops follow it. All explicit stream picks — input
packing, opacity copies, LOD page uploads, overlay staging, selection
queries, upload-timeline signals — now run on the render stream with
producer tensors bridged in via event edges instead of inheriting
whatever stream the source tensor was homed on (which silently relied
on legacy-stream FIFO ordering).

Behavior-neutral: the training-mode forced syncs remain (now syncing
just the render stream) until the release-fence lands. Verified with a
35s live train+viewport GUI session (clean frames, no startup deadlock,
clean shutdown) plus packer gtests and headless smoke.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The render-complete timeline becomes an exportable Vulkan timeline
imported into CUDA (renderCompleteFence/renderCompleteValue) so the
trainer can enqueue "wait render_complete >= borrow" GPU-side before
its next in-place writes — the reverse edge of the handshake. The raw
VkSemaphore view is unchanged for all existing waits and readback tags.

Two lifetime hazards that the CPU fence wait used to mask get explicit
GPU-side fixes: zero-copy input storages bound to a frame are kept
alive in a retire-list keyed by that frame's completion value (trainer
topology reallocations can drop their references while the batch is in
flight), and uploads into the persistent LOD page buffer wait on the
last frame that read it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
During training the viewport kept itself safe with a forced
cudaStreamSynchronize after input uploads plus a _mm_pause busy-wait on
the Vulkan batch fence — CPU stalls on the UI thread every frame. Both
are gone. renderVulkanFrame wires the handshake under the shared render
lock: registers the render-complete fence with the trainer, waits the
params-ready event on the render stream before any model read (forward
edge; the upload-timeline signal is ordered after it, covering Vulkan's
zero-copy reads), and a scope-exit publisher hands the frame's
completion value to the trainer, which waits it GPU-side before the
next step's in-place writes (reverse edge). Storage lifetime across
topology reallocations is covered by the retire-list; the selection
overlay extends the last bind's retirement to its own submit. The
render stream now exists from construction so the first frame is
covered, and the trainer drops the fence handle before a renderer
reset destroys the import.

Verified: full 7k bicycle training with live viewport completed in
30.1s (headless baseline 29.2s), loss 0.108 and splat count identical
to baselines, no errors, clean shutdown.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The selection service's cross-stream copy created and destroyed two
cudaEvents per call; it now uses the shared bridgeStreams edges and
records the cross-stream write with the allocator.

New gtest mirrors the trainer↔viewer protocol exactly — writer rewrites
a 32MB buffer in place each iteration (params-ready record + reader-done
wait), reader snapshots on a second non-blocking stream under the same
bracket — and asserts 50+ snapshots each contain exactly one iteration's
value. This is the cross-stream race oracle racecheck can't provide;
memcheck clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…iewer

Camera movement during --gut GUI training reproducibly killed the GPU
(Xid 109 CTX SWITCH TIMEOUT → VK_ERROR_DEVICE_LOST at ~iter 2300, 2/2
runs). Root cause: with the CPU fence wait removed, the viewport
releases its arena frame at submit while the batch is still executing
on shared scratch. The trainer's borrow wait only runs at step top, so
a trainer already mid-step acquires the arena, and its begin_frame
fallback — cudaDeviceSynchronize — cannot see in-flight Vulkan work: it
resets the offset and training kernels overwrite scratch the running
sort/blend shaders read, which then loop on garbage ranges until the
channel times out.

The arena now carries the release: the viewport notes
{render-complete timeline, completion value} before its frame ends
(note_external_release), and the next begin_frame waits that value
GPU-side on its stream — or via the legacy stream ahead of the
streamless device-sync path. The selection-overlay re-render gets its
own arena guard for the resident sort buffers it re-reads. Borrow
publishing also moves to a per-submit renderer callback (before the
arena guard releases), publishViewerBorrow becomes monotonic, and a
re-registered fence resets the stale borrow value (a fresh timeline
never reaches the old one — latent trainer hang). Exception paths
bounded-wait the timeline instead of publishing possibly-unsignaled
values.

Verified: --gut GUI 7k completed under sustained scripted camera churn
(10 pause/resume cycles, loss 0.127, no Xid — previously dead at ~25s),
a second 60s churn round clean, fastgs GUI 7k at parity (32.6s, loss
0.115), arena gtests green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
item<T>() and TensorRowProxy::item_as<T>() did a blocking cudaMemcpy
with no synchronization — that only orders against the legacy stream,
so values produced on a non-blocking stream (the upcoming metrics
stream; any future non-legacy reader) would read stale. Sync the
tensor's home stream first; nullptr homes keep relying on legacy
ordering as before.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All 18 SSIM/fused-L1 kernel launches went to the legacy default stream
— correct only via the training stream's blocking-flag implicit
ordering, and wrong for any non-blocking caller (the upcoming metrics
stream). Every launch now targets getCurrentCUDAStream() inline; the
entry points need no signature changes since every caller wants its
thread's stream, and the reduction sub-launches already resolve the
same way.

Gates: 42 SSIM/loss gtests green; 3x GUI train+view 7k sessions at loss
parity (0.083-0.130) with zero errors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
try_begin_frame failed instantly when the trainer held the arena frame,
so under training contention the viewport often skipped its frame
entirely. Add try_begin_frame_for(timeout): the pending-render flag
already blocks the trainer from STARTING a new frame, so a bounded wait
is normally one iteration. The viewport uses a 15ms timeout — long
enough to win the frame in the common case, short enough to bail
instead of deadlocking on refining iterations where the trainer holds
the frame while blocked on the exclusive render lock the viewport's
shared lock excludes (the lock→frame inversion the metrics work will
close properly).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
forward_raw and forward() gain a stream parameter threaded into all
four kernels (preprocess, create_instances, extract_instance_ranges,
blend), the two CUB calls (InclusiveSum, RadixSort), the status/range
memsets, and the n_instances sizing readback (now stream-scoped
async-copy + stream sync instead of a legacy-stream blocking memcpy).
fast_rasterizer derives the stream as getCurrentCUDAStream() (tensor
fallback) and passes it; ForwardContext stores it so release paths
(sorted indices free, arena end_frame) use the context's stream, not
the caller's current one.

The static memset side-stream + event is deleted: its whole purpose
was overlapping a ~64KB instance-ranges memset, and it relied on
legacy-stream implicit ordering with the previous frame's reads of the
same arena memory — a hazard once the kernels move off the legacy
stream. The memset now runs inline on the frame's stream.

StreamOrderedDeviceBuffer carries its stream through allocate/reset/dtor
so the sort scratch is freed stream-ordered, not on a nullptr stream
unordered with the sort that used it.

Backward still runs on the legacy stream (next commit) — safe meanwhile
via the training stream's blocking-flag implicit ordering. Gates: 3x
GUI train+view 7k at loss parity (0.10-0.14), zero errors;
FastGSGradientTest.Numerical_Means is a pre-existing full-suite flake
(fails identically at the branch base).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
backward() and backward_raw() take a stream, threaded into both kernels
(blend_backward, preprocess_backward); backward_raw uses
forward_ctx.stream so the backward shares the same stream and arena
frame the forward chained. The six grad-helper zeroing cudaMemsets and
the grad_w2c clear become cudaMemsetAsync on that stream.

With forward (prev commit) and backward both off the legacy stream,
fastgs is fully stream-resident. nsys confirms all 800 blend kernels in
a 400-iter run land on lfs.train, none on the legacy stream. 3x GUI
train+view 7k at loss parity (0.118-0.130), zero errors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mirrors the fastgs forward port for the edge variant (used by the MRNF
and igs+ strategies): edge_forward/edge_forward_raw take a stream
threaded into all four kernels (preprocess, create_instances,
extract_instance_ranges, edge_blend) and the two CUB calls; the static
memset side-stream + event is deleted (inline cudaMemsetAsync on the
frame stream); the n_instances readback is stream-scoped. The three
arena end_frame calls (incl. error paths) pass the stream so the arena
chain stays intact instead of falling back to a device sync.

Gates: GUI MRNF and igs+ 1500-iter sessions converge at parity, zero
errors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The gsplat backward's v_colors scratch used a synchronous
cudaMalloc/cudaFree (ordered only against the legacy stream); now
cudaMallocAsync/cudaFreeAsync on the backward stream. Forward and
backward derive their stream getCurrentCUDAStream()-first (tensor
fallback) instead of straight from means.stream(), so a metrics-thread
gut render lands its kernels, arena frame, and consumers on one stream.
GsplatRasterizeContext carries that stream; the inference-only release
(gsplat_rasterizer.hpp) and the training backward free the isect/flatten
buffers and end the arena frame on it — the inference path previously
used a streamless end_frame that broke the arena chain and forced a
device sync on the UI thread every render.

Intersect.cpp's stream-scoped sizing sync is left as-is (control-flow
readback). Gates: gsplat gtests green; 3x GUI train+view --gut 7k at
parity (loss 0.128-0.165, ~75s), zero errors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…deadlock

computeCameraMetrics (PSNR/SSIM panel + camera-click) now installs a
non-blocking lfs.metrics stream guard and brackets its render with the
beginModelRead/endModelRead handshake on that stream, so the metric
render's kernels/tensor ops overlap training and item() readbacks (now
home-stream-aware, Port-1) stay correct off the legacy stream.

Deadlock fix: the metric render holds render_mutex_ shared then needs
the arena frame; a refining trainer holds the arena frame then needs
render_mutex_ exclusive — a cycle when begin_frame waits forever.
Instead of restructuring the trainer hot-loop lock order (high risk),
RasterizerMemoryArena::ScopedBeginFrameTimeout caps begin_frame on the
metrics thread only, so it bails (skipping that metric sample) rather
than hanging; training threads keep blocking acquisition. A new gtest
reproduces the opposite-order contention and asserts deadlock-freedom +
bounded bail.

UAF fix: waitForModelReaders() is now also drained inside the refining
exclusive-lock scopes, so a metric read that finished CPU-side between
the step-top drain and the realloc has its GPU work ordered before
grow/prune frees model tensors. ppisp controller-pool predict copies
move off the legacy stream (getCurrentCUDAStream) — they feed conv/
linear ops that run on the metrics stream under appearance correction.

Validation: contention gtest + memcheck clean; 5 GUI train+view 7k
sessions (3 fastgs, 2 gut) at loss parity, zero errors, clean teardown.
The metrics-panel-during-training integration path is best exercised
live (split-view GT metrics / camera click); its mechanism is gtested.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The per-step quantized Adam update already ran on the trainer's stream,
but the four state-maintenance helpers (quantize_adam_moments[_swizzled],
zero_rows / zero_quantized_rows) launched on the legacy default stream.
With quantized-Adam + MCMC this fires every densification interval via
relocate_params_at_indices_gpu, serializing the relocation against the
training stream and, worse, stalling the device with bare cudaMalloc/
cudaFree for the index buffer.

Thread an explicit stream into the four adam_api helpers and resolve it at
the call sites from getCurrentCUDAStream(), bridging the touched state
tensors with waitForCUDAStream and re-homing them with set_stream after.
The relocation index buffers move to cudaMallocAsync/MemcpyAsync/FreeAsync
on the same stream, removing the device-wide alloc/free syncs.

Densification stays bit-identical (1,293,710 splats at 7k, loss in band);
memcheck clean on the relocate/zero/adam kernels.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PyTensor::dlpack accepted a stream argument but ignored it: it only did a
full cudaDeviceSynchronize when stream was None, and when a consumer passed
a real stream it performed no synchronization at all — a silent race against
whatever stream the tensor was last written on (e.g. lfs.train).

Export now implements the DLPack stream-exchange protocol: None syncs just
the tensor's home stream, the -1 sentinel skips sync, and a consumer stream
(legacy/per-thread/explicit) is ordered via the pooled bridgeStreams edge
instead of stalling the whole device. Import passes our current stream into
the producer's __dlpack__(stream=) so the producer orders its work onto the
stream we will consume on, falling back to the no-arg call for producers
without the kwarg.

Adds a deterministic gtest reproducing the export edge (gated producer write
must be visible to the consumer readback) and a torch-free Python
self-roundtrip across the None/-1/legacy branches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The test predates the quantized Adam state (#1261) and read exp_avg as
float32, but the optimizer stores moments as uint8 with a per-row float
scale. Reinterpreting the quantized zero-point bytes as float produced a
NaN momentum sum, failing the pre-relocation assertion (independent of the
relocation logic under test).

Check exp_avg_scale instead: a row that accumulated momentum has a non-zero
scale, and relocation zeros the scale for the touched rows (the documented
zero_quantized_rows reset), which forces their dequantized momentum to zero.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 13, 2026 21:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces stream-aware CUDA execution and memory-management improvements across the tensor core, training kernels, and the Vulkan viewer/trainer interop, and adds targeted stress/regression tests for multi-stream correctness.

Changes:

  • Add pooled CUDA event utilities and stream-aware tensor “home stream” tracking (sync/record/rehome) to make cross-stream tensor usage and allocator reuse safe without global synchronization.
  • Make multiple training/rasterization/optimizer kernels and arenas consistently honor the caller/current CUDA stream (and avoid legacy-default-stream implicit ordering).
  • Add new CUDA stream/event/multi-stream correctness tests and wire them into the test build.

Reviewed changes

Copilot reviewed 79 out of 79 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_tensor_stream.cpp Ensure test streams are released from the tensor allocator before destruction.
tests/test_tensor_multistream.cpp Add multi-stream allocator ordering/steal/reuse/handshake stress tests.
tests/test_mcmc_relocate_optimizer_state_bug.cpp Update optimizer-state test to validate quantized moment scale reset semantics.
tests/test_mcmc_multinomial.cpp Add multinomial sampling+gather regression tests (including all-zero weights).
tests/test_cuda_event_pool.cpp Add CUDA event pool reuse/concurrency and cross-stream wait tests.
tests/test_arena_metrics_contention.cpp Add bounded begin_frame contention test to prevent deadlocks.
tests/CMakeLists.txt Register newly added tests in the test target.
src/visualizer/selection/selection_service.cpp Replace ad-hoc event bridging with pooled event stream-bridging + allocator stream recording.
src/visualizer/rendering/vksplat_viewport_renderer.hpp Expose render CUDA stream and viewer-release fence/value for trainer↔viewer handshake.
src/visualizer/rendering/rendering_manager_vulkan.cpp Install per-frame render-stream guards and trainer↔viewer GPU handshake publishing.
src/visualizer/rendering/lod_upload_engine.cpp Release long-lived CUDA stream from tensor pool before destruction.
src/training/trainer.hpp Add model-access mutex and GPU-side model-read handshake API/state.
src/training/rasterization/gsplat/Rasterization.cpp Use stream-ordered cudaMallocAsync/cudaFreeAsync for scratch when available.
src/training/rasterization/gsplat_rasterizer.hpp Persist the forward stream in the context so backward/end_frame can match it.
src/training/rasterization/gsplat_rasterizer.cpp Begin/end arena frames and launches in a current-stream-first manner; store ctx stream.
src/training/rasterization/fastgs/rasterization/src/rasterization_api.cu Thread stream through arena frames, forward/backward, and stream-ordered frees/memsets.
src/training/rasterization/fastgs/rasterization/src/forward.cu Make forward explicitly stream-driven; remove side-stream memset trick; stream-order scratch alloc/free.
src/training/rasterization/fastgs/rasterization/src/backward.cu Thread stream into kernel launches for backward path.
src/training/rasterization/fastgs/rasterization/include/rasterization_api.h Extend forward context and forward_raw API to carry/accept CUDA stream.
src/training/rasterization/fastgs/rasterization/include/forward.h Add CUDA stream parameter to forward declaration.
src/training/rasterization/fastgs/rasterization/include/backward.h Add CUDA stream parameter to backward declaration.
src/training/rasterization/fastgs/optimizer/src/adam_api.cu Thread stream into quantize/zero-row kernels to avoid legacy-stream launches.
src/training/rasterization/fastgs/optimizer/include/adam_api.h Extend optimizer kernel APIs with optional CUDA stream parameters.
src/training/rasterization/fast_rasterizer.cpp Select raster stream from current guard first, tensor stream as fallback.
src/training/rasterization/edge_rasterizer.cpp End arena frame on current stream for stream-aware chaining.
src/training/rasterization/edge_compute/rasterization/src/rasterization_api.cu Add stream parameter; begin/end arena frames and forward launches on that stream.
src/training/rasterization/edge_compute/rasterization/src/forward.cu Run edge forward kernels/memsets on an explicit stream.
src/training/rasterization/edge_compute/rasterization/include/forward.h Add CUDA stream parameter (defaulted) to edge forward declaration.
src/training/rasterization/edge_compute/rasterization/include/edge_rasterization_api.h Add CUDA stream parameter (defaulted) to public edge raster API.
src/training/optimizer/adam_optimizer.cpp Make optimizer state init/zero/relocate stream-aware and rehome state tensors.
src/training/kernels/ssim.cu Launch SSIM kernels on the current CUDA stream (guard-aware).
src/training/kernels/ssim_reduction.cu Resolve/normalize stream parameters via shared resolve_stream helper.
src/training/kernels/regularization.cu Resolve stream parameters consistently for regularization kernels.
src/training/kernels/pruning_kernels.cu Resolve stream parameters consistently for pruning kernels.
src/training/kernels/ppisp.cu Resolve stream parameters consistently for PPISP kernels.
src/training/kernels/mrnf_kernels.cu Resolve stream parameters consistently for MRNF kernels and thrust usage.
src/training/kernels/mcmc_kernels.cu Resolve stream parameters; remove host reduction; use scan total for multinomial sampling.
src/training/kernels/l1_loss.cu Resolve stream parameters consistently for L1-loss kernels.
src/training/kernels/kernel_stream.hpp Add shared resolve_stream helpers that default to the thread’s current CUDA stream.
src/training/kernels/image_kernels.hpp Extend normalize-by-device-scalar API with skip threshold parameter.
src/training/kernels/image_kernels.cu Make image kernels stream-resolved and remove host sync from normalize-by-scalar.
src/training/kernels/grad_alpha.cu Resolve stream parameters consistently for grad/compose/permute/resize helpers.
src/training/kernels/depth_loss.cu Resolve stream parameters consistently for depth loss kernels.
src/training/kernels/densification_kernels.cu Resolve stream parameters consistently for densification kernels.
src/training/kernels/camera_loss_heatmap.cu Resolve stream parameters consistently for loss heatmap kernel.
src/training/kernels/bilateral_grid_tv.cu Resolve stream parameters consistently for bilateral TV kernels.
src/training/kernels/bilateral_grid_forward.cu Resolve stream parameters consistently for bilateral forward kernels.
src/training/kernels/bilateral_grid_backward.cu Resolve stream parameters consistently for bilateral backward/adam/accumulate kernels.
src/training/components/ppisp_controller_pool.cu Ensure tensor copies/host uploads occur on the current guarded stream.
src/rendering/cuda_vulkan_interop.hpp Expose raw CUDA external semaphore handle for consumers that enqueue waits.
src/python/lfs/py_tensor.cpp Implement DLPack stream handshake/bridging; home imported tensors on the consumer stream.
src/io/pipelined_image_loader.cpp Add decode CUDA stream, guard decode thread work onto it, and release it from pool on shutdown.
src/io/nvcodec_image_loader.cpp Replace device-wide sync with stream sync when a caller stream is provided.
src/io/include/io/pipelined_image_loader.hpp Store decode CUDA stream member for GPU decode path.
src/core/tensor/tensor.cpp Add Tensor stream rehome/record/sync methods integrated with allocators and stream bridging.
src/core/tensor/tensor_unified_ops.cpp Replace manual waitForCUDAStream calls with Tensor::sync_to_stream prologue.
src/core/tensor/tensor_masking_ops.cpp Use Tensor::sync_to_stream for multi-tensor stream alignment before CUDA masking ops.
src/core/tensor/tensor_broadcast.cpp Use Tensor::sync_to_stream to align broadcast execution stream.
src/core/tensor/pinned_memory_allocator.cpp Track per-block multi-stream usage via pooled events; add record_stream/release_stream.
src/core/tensor/lazy_executor.cpp Use Tensor::sync_to_stream to align pointwise execution stream.
src/core/tensor/internal/tensor_impl.hpp Make tensor home stream atomic; add host readback synchronization on home stream.
src/core/tensor/internal/tensor_expr_impl.hpp Use Tensor::sync_to_stream when preparing execution streams for expressions.
src/core/tensor/internal/size_bucketed_pool.hpp Make bucket cache stream-tagged; bridge on cross-stream reuse; free on tag stream.
src/core/tensor/internal/memory_pool.hpp Add allocator-wide stream tracking (record/rehome/release_stream) and routed frees.
src/core/tensor/internal/gpu_slab_allocator.hpp Add per-stream free lists with GPU-side “steal” bridging for cross-stream reuse.
src/core/tensor/internal/gpu_config.hpp Remove global inline GPUConfig instance usage; call GPUConfig::get() in macros.
src/core/tensor/internal/gpu_arena_allocator.hpp Remove obsolete GPU arena allocator header.
src/core/tensor/internal/deferred_free_queue.hpp Switch deferred-free event handling to shared CudaEventPool.
src/core/tensor/internal/cuda_stream_context.hpp Move waitForCUDAStream implementation out-of-header; document pooled-event behavior.
src/core/tensor/internal/cuda_event_pool.hpp Add global pooled cudaEvent management + bridgeStreams API.
src/core/tensor/cuda_stream_context.cpp Implement waitForCUDAStream using pooled events (host-sync fallback).
src/core/tensor/cuda_event_pool.cpp Implement pooled event acquire/release and stream bridging helper.
src/core/include/core/pinned_memory_allocator.hpp Update pinned allocator API/docs for multi-stream tracking and stream release.
src/core/cuda/memory_arena.hpp Add stream-aware begin/end frame chaining, bounded acquisition, and Vulkan external release fencing.
src/core/cuda/memory_arena.cu Implement stream-aware chaining with completion events + bounded waits + external release drain.
src/core/CMakeLists.txt Add cuda_event_pool.cpp to core tensor sources.
src/core/camera.cpp Use non-blocking camera stream and release it from tensor pool before destruction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 32 to 35
// Begin arena frame for memory allocation
auto& arena = core::GlobalArenaManager::instance().get_arena();
uint64_t frame_id = arena.begin_frame();
uint64_t frame_id = arena.begin_frame(core::getCurrentCUDAStream());
auto arena_allocator = arena.get_allocator(frame_id);
Comment on lines +515 to +518
// End on the same stream begin_frame used (same guard → same value),
// not the streamless device-sync path, so the arena frame chain stays
// intact for the next frame instead of falling back to a full sync.
arena.end_frame(frame_id, core::getCurrentCUDAStream());
Comment on lines 310 to 313
if (nvcodec_available) {
cudaStreamCreateWithFlags(&decode_stream_, cudaStreamNonBlocking);
gpu_decode_thread_ = std::thread([this] { gpu_batch_decode_thread_func(); });
}
Comment thread src/core/camera.cpp Outdated
Comment on lines +118 to +119
// Non-blocking so image loading doesn't serialize with the legacy stream
cudaStreamCreateWithFlags(&_stream, cudaStreamNonBlocking);
Comment thread src/core/camera.cpp Outdated
Comment on lines +250 to +251
// Non-blocking so image loading doesn't serialize with the legacy stream
cudaStreamCreateWithFlags(&_stream, cudaStreamNonBlocking);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants