Description
The C++ api in HDBSCAN
returns a struct containing buffers that are later exposed to python by wrapping them as UnownedMemory
. However, this is done incorrectly in a few ways:
- Most wrapped buffers set the
HDBSCAN
estimator itself as the owner. However, the underlying buffers are tied independently to the lifetime of the estimator. Subsequent fit calls drop the buffers, causing the originally returned cupy arrays to become invalid. - Some wrapped buffers accidentally set
None
as the owner (like in here, where the optionalowner
kwarg is not provided), meaning they may become invalid if the estimator is dropped before the array values since they're not holding a reference.
To remedy both of these bugs, a non-user-facing python class HDBSCANOutput
(or a better name) should be created, and the buffer lifetimes tied to it. It can then be stored as python-level attribute on the HDBSCAN
estimator. Upon refit, the attribute can be cleared/overwritten to decouple the references (letting python's reference counting decide if the buffers can be released yet or not).
I have seen memory errors from this, but cannot seem to achieve a clean reproducer (presumably because the original buffer location has yet to be reused). Looking at the code listed above though should (hopefully) be enough for others to clearly see the bug here.