Description
Describe the usage question you have. Please include as many useful details as possible.
Hello, I have a use case in Python involving arrow flight that is exemplified by the below snippet:
import pyarrow as pa
import pyarrow.flight as flight
def do_some_work(…):
# … set-up
client = flight.FlightClient(xxx)
ticket = xxx
reader = client.do_get(ticket)
# Assume the table is quite large - 500 MB
arrow_table: pa.Table = reader.read_all()
# Assume res is a very small object, compared to
# the size of the table.
res = do_something_quick_with(arrow_table)
# (A) From this point onwards arrow_table is no longer needed…
# … rest of the pipeline that uses res and does a lot of other things …
The above snippet is a slight simplification. The real-world scenario is a little more complex because the table is obtained in a library I don’t necessarily have easy control over and is passed to user-level code.
At point (A)
above benchmarking in high volume scenarios has shown it would be really good to free up the memory of the arrow_table
. The table itself does not have an explicit .close()
method or anything indicating we’re able to free the memory associated with it. A few things I have tried are:
- Obtaining the actual RecordBatchReader and calling close on it:
reader: RecordBatchReader = client.do_get(ticket).to_reader()
# … use the reader to obtain the arrow table …
# close the reader
reader.close()
- Deleting the reference to the arrow table via
del
and hoping at some point GC would kick in. - Deleting the reference via
del
and explicitly calling the GC (just for testing, I am aware this is not a recommended practice).
In the last 2 cases above, just as a debugging exercise, I ended up printing the number of references to the arrow_table object before calling del
. Expectation was it’d be 1, but it was more than that, so my assumption is something gets held internally within the flight framework.
The above said, my question is - is there a deterministic way that always work to release the memory of a pyarrow.Table
. I can imagine why in most of the cases doing this would be quite cumbersome and it’d be best to rely on the reference counting mechanism + the GC naturally kicking in, but in this particular case it would be quite useful.
I would also be grateful, if I can get some pointers to the lifetime implications of these objects in Python. It is not very clear from the documentation, for example, if the arrow_table
s lifetime from above is tied to the lifetime of the reader
and vice versa. Again, I appreciate in 99% of the cases we shouldn’t need to care about it, but there’s still this 1% that having this explained a little more in depth would be of great use!
P.S. There is a near-identical example I had to do within Java and the VectorSchemaRoot
’s API conveniently exposes a .close()
method, which works quite nicely in my use case.
Component(s)
Documentation, FlightRPC, Python