Skip to content

Remove malloc/free#588

Open
segasai wants to merge 6 commits into
adrn:mainfrom
segasai:malloc_fix
Open

Remove malloc/free#588
segasai wants to merge 6 commits into
adrn:mainfrom
segasai:malloc_fix

Conversation

@segasai
Copy link
Copy Markdown
Contributor

@segasai segasai commented May 6, 2026

Describe your changes

The coordinate transformation code in c_gradient does malloc/free at each call, which shows up in CPU profiles when running orbits/streams. I don't think there is a need for memory allocation, when stack arrays would work.

Checklist

  • Did you add tests? No
  • Did you add documentation for your changes? No
  • Did you reference any relevant issues? No
  • Did you add a changelog entry? (see CHANGES.rst) No (not sure it's worth an entry or not)
  • Are the CI tests passing? Yes
  • Is the milestone set? No

@segasai segasai changed the title Malloc fix Remove malloc/free May 6, 2026
@adrn
Copy link
Copy Markdown
Owner

adrn commented May 7, 2026

Thanks! I think this would be a performance improvement for small N orbits that require transformation, but won't this then risk causing stack overflows for direct Potential.gradient() calls with large numbers of phase-space positions? I guess we could have a fallback to heap allocation if N is greater than some threshold...

@adrn adrn added the benchmark label May 7, 2026
@adrn adrn requested a review from lgarrison May 7, 2026 18:19
@adrn
Copy link
Copy Markdown
Owner

adrn commented May 7, 2026

Adding @lgarrison to see if he has any opinions here too :).

@lgarrison
Copy link
Copy Markdown
Collaborator

Yeah, I think there's a risk of a stack overflow here. Could we instead allocate scratch space on the heap higher in the call chain (outside of the hot loop) and pass that down to c_gradient?

@segasai
Copy link
Copy Markdown
Contributor Author

segasai commented May 8, 2026

Yes, I can see that if N is in the millions, it could be an issue.
I guess if we just do malloc inside allocate_cpotential(), it then could only be a fixed size buffer as the number of particles N is not known at that time. And that would make complicate the logic elsewhere.

@lgarrison
Copy link
Copy Markdown
Collaborator

I asked Claude to implement the idea of allocating scratch space at a higher level. The results weren't encouraging, the change surface is pretty big:

❯ git diff main..c-grad-scratch --stat
 src/gala/dynamics/mockstream/mockstream.pyx         | 32 ++++++++++++++++++++++----------
 src/gala/dynamics/nbody/nbody.pyx                   | 14 +++++++++-----
 src/gala/integrate/cyintegrators/dop853.pxd         |  5 +++++
 src/gala/integrate/cyintegrators/dop853.pyx         | 16 +++++++++++++---
 src/gala/integrate/cyintegrators/dopri/dop853.cpp   | 19 +++++++++----------
 src/gala/integrate/cyintegrators/dopri/dop853.h     |  9 +++++++++
 src/gala/integrate/cyintegrators/leapfrog.pxd       | 12 ++++++++----
 src/gala/integrate/cyintegrators/leapfrog.pyx       | 39 +++++++++++++++++++++++++++------------
 src/gala/integrate/cyintegrators/ruth4.pxd          |  3 ++-
 src/gala/integrate/cyintegrators/ruth4.pyx          | 22 ++++++++++++++++------
 src/gala/potential/hamiltonian/src/chamiltonian.cpp | 23 +++++++++++++++--------
 src/gala/potential/hamiltonian/src/chamiltonian.h   |  3 ++-
 src/gala/potential/potential/cpotential.pxd         |  3 ++-
 src/gala/potential/potential/cpotential.pyx         |  8 ++++++--
 src/gala/potential/potential/src/cpotential.cpp     | 49 +++++++++++++++++--------------------------------
 src/gala/potential/potential/src/cpotential.h       |  3 ++-
 16 files changed, 164 insertions(+), 96 deletions(-)

So what about this instead: we could allocate static thread_local scratch space that grows as needed. That way, the API doesn't change, and typically only the first call pays the dynamic memory allocation cost. The downside is somewhat higher memory usage: the scratch space would stick around for the duration of the program. But I suspect that's not so bad, as long as users aren't integrating GB of orbits. And if we needed to, we could provide an explicit API to drop the scratch space.

Any thoughts on this?

@adrn
Copy link
Copy Markdown
Owner

adrn commented May 8, 2026

Thanks @lgarrison for doing that experiment! I think your thread local scratch space idea would be fine and covers most practical use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants