Skip to content

Conversation

@oschulz
Copy link
Contributor

@oschulz oschulz commented Oct 31, 2025

Doesn't fully work yet. Allocation of unified memory works after relaxing the limits of XLA_REACTANT_GPU_MEM_FRACTION in __init__() in XLA.jl. With

export TF_FORCE_UNIFIED_MEMORY=1
export XLA_REACTANT_GPU_MEM_FRACTION=4

we can run

using Reactant

Reactant.set_default_backend("cuda")

Reactant.XLA.default_device()
Reactant.XLA.XLA_REACTANT_GPU_MEM_FRACTION[]

A = ConcreteRArray{Float32}(undef, 6*10^10)
sizeof(eltype(A)) * length(A) / 1024^3

and successfully allocate 224 GiB on an NVIDIA GH200 system with 96GB GPU RAM and 480GB CPU RAM.

nvtop actually shows the GPU ram being filled up and then flattening out when full, and free shows that the rest of the array has been allocated on CPU RAM. (Note, nvidia-smi is not helpful, it only shows 578MiB allocated by the Julia process in unified memory mode, but from what I read that's expected.)

But when I try to fill and sum the array

fill_sum = @compile sum(fill!(A, one(eltype(A))))
fill_sum(A)

compilation fails with

E0000 00:00:1761906288.544836 1566209 gpu_hlo_schedule.cc:817] The byte size of input/output arguments (240000000000) exceeds the base limit (81604378624). This indicates an error in the calculation!

so the compiler still tries to limit sizes to GPU ram instead of unified RAM.

@wsmoses I think I need some help, here.

@oschulz oschulz marked this pull request as draft October 31, 2025 10:26
@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.52%. Comparing base (b39a1fc) to head (a2ea596).
⚠️ Report is 117 commits behind head on main.

Files with missing lines Patch % Lines
src/xla/XLA.jl 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1812      +/-   ##
==========================================
- Coverage   68.16%   64.52%   -3.65%     
==========================================
  Files         109      113       +4     
  Lines       11779    12557     +778     
==========================================
+ Hits         8029     8102      +73     
- Misses       3750     4455     +705     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant