Skip to content

Implement Arrow cache reader/writer#856

Merged
czaloom merged 10 commits into
mainfrom
czaloom-arrow-cache-10-28-2025
Oct 31, 2025
Merged

Implement Arrow cache reader/writer#856
czaloom merged 10 commits into
mainfrom
czaloom-arrow-cache-10-28-2025

Conversation

@czaloom

@czaloom czaloom commented Oct 28, 2025

Copy link
Copy Markdown
Collaborator

Broke out cache files from #855 #854 #853

@czaloom czaloom self-assigned this Oct 28, 2025
@czaloom czaloom requested a review from a team as a code owner October 28, 2025 16:08
@czaloom czaloom changed the title Implement Arrow Cache Implement Arrow cache reader/writer Oct 28, 2025
Comment thread src/valor_lite/cache.py Outdated
Comment thread src/valor_lite/common/ephemeral.py
Comment thread src/valor_lite/common/ephemeral.py Outdated
Comment thread src/valor_lite/common/persistent.py Outdated
Comment thread src/valor_lite/common/persistent.py Outdated
Comment thread src/valor_lite/common/persistent.py
Comment thread src/valor_lite/common/persistent.py Outdated
Comment thread src/valor_lite/common/persistent.py Outdated
batch : pa.RecordBatch | dict[str, list | np.ndarray | pa.Array]
A batch of columnar data.
"""
if isinstance(batch, dict):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above, do we need to be able to deal with multiple types here or could we convert them before we get to this point?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added write_columns to deal with the conversion. We now have write_rows and write_columns that perform a conversion and then call write_batch.


def __exit__(self, exc_type, exc_val, exc_tb):
"""Context manager exit - ensures data is flushed."""
self.flush()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a problem with flush or indeed any call used during the time this object's _writer is open, it won't get closed properly. Not sure that's a big deal unless we want to try to handle errors that might happen, which sounds more difficult.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, im not sure this is worth diving into. The most likely cause of an error is malformed annotations, in which case we already cannot use the cache.

self.flush()
return MemoryCacheReader(
table=self._table, batch_size=self._batch_size
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for the writer to continue to write after creating a reader like this? Should the writer's self._table be set to None or something?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArrow tables are immutatable so the memory reader is safe from the table being mutated, this is not the case for the file-based cache though.

If we want to signal to the user / developer that a reader has been created and writing has ceased then we could do that set to None operation. We would also need to maintain this flow on the file-based side where we would either need to change file permissions or leave a flag within the cache .cfg file saying that writing is no longer allowed.

cfg_path.unlink()

# delete empty cache directory
path.rmdir()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is okay I think, but if the config file is bad for some reason FileCacheReader.load will fail, and if there are any extra files under path, path.rmdir will fail. Maybe that's what we want, i.e. if something unexpected happens maybe we shouldn't delete the thing in the first place?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to fail if the file footprint did not match exactly what was created by the Writer 😁

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think that sounds good


class EmptyCacheError(Exception):
def __init__(self):
super().__init__("cache contains no data")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be used anywhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, at least. I copied in the common code for all task types

@czaloom czaloom merged commit 2ade309 into main Oct 31, 2025
4 checks passed
@czaloom czaloom deleted the czaloom-arrow-cache-10-28-2025 branch October 31, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants