Remove malloc/free by segasai · Pull Request #588 · adrn/gala

segasai · 2026-05-06T01:38:00Z

Describe your changes

The coordinate transformation code in c_gradient does malloc/free at each call, which shows up in CPU profiles when running orbits/streams. I don't think there is a need for memory allocation, when stack arrays would work.

Checklist

Did you add tests? No
Did you add documentation for your changes? No
Did you reference any relevant issues? No
Did you add a changelog entry? (see CHANGES.rst) No (not sure it's worth an entry or not)
Are the CI tests passing? Yes
Is the milestone set? No

merge upstream

adrn · 2026-05-07T18:19:36Z

Thanks! I think this would be a performance improvement for small N orbits that require transformation, but won't this then risk causing stack overflows for direct Potential.gradient() calls with large numbers of phase-space positions? I guess we could have a fallback to heap allocation if N is greater than some threshold...

adrn · 2026-05-07T18:20:14Z

Adding @lgarrison to see if he has any opinions here too :).

lgarrison · 2026-05-08T15:57:29Z

Yeah, I think there's a risk of a stack overflow here. Could we instead allocate scratch space on the heap higher in the call chain (outside of the hot loop) and pass that down to c_gradient?

segasai · 2026-05-08T16:24:49Z

Yes, I can see that if N is in the millions, it could be an issue.
I guess if we just do malloc inside allocate_cpotential(), it then could only be a fixed size buffer as the number of particles N is not known at that time. And that would make complicate the logic elsewhere.

lgarrison · 2026-05-08T18:46:22Z

I asked Claude to implement the idea of allocating scratch space at a higher level. The results weren't encouraging, the change surface is pretty big:

❯ git diff main..c-grad-scratch --stat
 src/gala/dynamics/mockstream/mockstream.pyx         | 32 ++++++++++++++++++++++----------
 src/gala/dynamics/nbody/nbody.pyx                   | 14 +++++++++-----
 src/gala/integrate/cyintegrators/dop853.pxd         |  5 +++++
 src/gala/integrate/cyintegrators/dop853.pyx         | 16 +++++++++++++---
 src/gala/integrate/cyintegrators/dopri/dop853.cpp   | 19 +++++++++----------
 src/gala/integrate/cyintegrators/dopri/dop853.h     |  9 +++++++++
 src/gala/integrate/cyintegrators/leapfrog.pxd       | 12 ++++++++----
 src/gala/integrate/cyintegrators/leapfrog.pyx       | 39 +++++++++++++++++++++++++++------------
 src/gala/integrate/cyintegrators/ruth4.pxd          |  3 ++-
 src/gala/integrate/cyintegrators/ruth4.pyx          | 22 ++++++++++++++++------
 src/gala/potential/hamiltonian/src/chamiltonian.cpp | 23 +++++++++++++++--------
 src/gala/potential/hamiltonian/src/chamiltonian.h   |  3 ++-
 src/gala/potential/potential/cpotential.pxd         |  3 ++-
 src/gala/potential/potential/cpotential.pyx         |  8 ++++++--
 src/gala/potential/potential/src/cpotential.cpp     | 49 +++++++++++++++++--------------------------------
 src/gala/potential/potential/src/cpotential.h       |  3 ++-
 16 files changed, 164 insertions(+), 96 deletions(-)

So what about this instead: we could allocate static thread_local scratch space that grows as needed. That way, the API doesn't change, and typically only the first call pays the dynamic memory allocation cost. The downside is somewhat higher memory usage: the scratch space would stick around for the duration of the program. But I suspect that's not so bad, as long as users aren't integrating GB of orbits. And if we needed to, we could provide an explicit API to drop the scratch space.

Any thoughts on this?

adrn · 2026-05-08T19:05:54Z

Thanks @lgarrison for doing that experiment! I think your thread local scratch space idea would be fine and covers most practical use cases.

Sergey Koposov and others added 6 commits December 24, 2019 05:52

Merge pull request #1 from adrn/master

8d6f355

merge upstream

Merge branch 'adrn:master' into master

8906d82

Merge branch 'adrn:main' into master

cf913fc

get rid of malloc

64403ab

replace the loop by memset

e8d45cd

style: pre-commit fixes

7dd1887

segasai changed the title ~~Malloc fix~~ Remove malloc/free May 6, 2026

adrn added the benchmark label May 7, 2026

adrn requested a review from lgarrison May 7, 2026 18:19

lgarrison mentioned this pull request May 11, 2026

potential: allocate persistent scratch space for gradients #591

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove malloc/free#588

Remove malloc/free#588
segasai wants to merge 6 commits into
adrn:mainfrom
segasai:malloc_fix

segasai commented May 6, 2026

Uh oh!

adrn commented May 7, 2026

Uh oh!

adrn commented May 7, 2026

Uh oh!

lgarrison commented May 8, 2026

Uh oh!

segasai commented May 8, 2026

Uh oh!

lgarrison commented May 8, 2026

Uh oh!

adrn commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

segasai commented May 6, 2026

Describe your changes

Checklist

Uh oh!

adrn commented May 7, 2026

Uh oh!

adrn commented May 7, 2026

Uh oh!

lgarrison commented May 8, 2026

Uh oh!

segasai commented May 8, 2026

Uh oh!

lgarrison commented May 8, 2026

Uh oh!

adrn commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants