Implement Arrow cache reader/writer by czaloom · Pull Request #856 · Striveworks/valor

czaloom · 2025-10-28T16:08:18Z

Broke out cache files from #855 #854 #853

rsbowman-striveworks · 2025-10-29T20:15:04Z

+        batch : pa.RecordBatch | dict[str, list | np.ndarray | pa.Array]
+            A batch of columnar data.
+        """
+        if isinstance(batch, dict):


same question as above, do we need to be able to deal with multiple types here or could we convert them before we get to this point?

I've added write_columns to deal with the conversion. We now have write_rows and write_columns that perform a conversion and then call write_batch.

rsbowman-striveworks · 2025-10-29T20:20:53Z

+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Context manager exit - ensures data is flushed."""
+        self.flush()


If there's a problem with flush or indeed any call used during the time this object's _writer is open, it won't get closed properly. Not sure that's a big deal unless we want to try to handle errors that might happen, which sounds more difficult.

Yea, im not sure this is worth diving into. The most likely cause of an error is malformed annotations, in which case we already cannot use the cache.

rsbowman-striveworks · 2025-10-31T13:36:23Z

+        self.flush()
+        return MemoryCacheReader(
+            table=self._table, batch_size=self._batch_size
+        )


Does it make sense for the writer to continue to write after creating a reader like this? Should the writer's self._table be set to None or something?

PyArrow tables are immutatable so the memory reader is safe from the table being mutated, this is not the case for the file-based cache though.

If we want to signal to the user / developer that a reader has been created and writing has ceased then we could do that set to None operation. We would also need to maintain this flow on the file-based side where we would either need to change file permissions or leave a flag within the cache .cfg file saying that writing is no longer allowed.

rsbowman-striveworks · 2025-10-31T13:39:04Z

+            cfg_path.unlink()
+
+        # delete empty cache directory
+        path.rmdir()


All this is okay I think, but if the config file is bad for some reason FileCacheReader.load will fail, and if there are any extra files under path, path.rmdir will fail. Maybe that's what we want, i.e. if something unexpected happens maybe we shouldn't delete the thing in the first place?

My intent was to fail if the file footprint did not match exactly what was created by the Writer 😁

yeah I think that sounds good

rsbowman-striveworks · 2025-10-31T13:43:34Z


+class EmptyCacheError(Exception):
+    def __init__(self):
+        super().__init__("cache contains no data")


This does not seem to be used anywhere?

Not yet, at least. I copied in the common code for all task types

broke out caching

dd9c479

czaloom self-assigned this Oct 28, 2025

czaloom requested a review from a team as a code owner October 28, 2025 16:08

czaloom changed the title ~~Implement Arrow Cache~~ Implement Arrow cache reader/writer Oct 28, 2025

czaloom added 2 commits October 28, 2025 12:36

cleaned up

d0d8e00

clean up

b9d9d26

sasbury reviewed Oct 28, 2025

View reviewed changes

Comment thread src/valor_lite/cache.py Outdated

czaloom added 3 commits October 28, 2025 17:21

split cache

f2ceefc

renamed

1178849

rename folder

cb0e957

rsbowman-striveworks reviewed Oct 29, 2025

View reviewed changes

czaloom added 4 commits October 30, 2025 11:46

updated

318d923

suggested changes + updates

b38eaf6

bugfix + flushing on to_reader

9ddeebc

update

6e1264b

rsbowman-striveworks reviewed Oct 31, 2025

View reviewed changes

rsbowman-striveworks approved these changes Oct 31, 2025

View reviewed changes

czaloom merged commit 2ade309 into main Oct 31, 2025
4 checks passed

czaloom deleted the czaloom-arrow-cache-10-28-2025 branch October 31, 2025 14:49

Uh oh!

Conversation

czaloom commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants