Skip to content

Implement task output caching in Indexify #1383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

earhart
Copy link

@earhart earhart commented Apr 25, 2025

Context

The idea here is to cache function outputs and reuse them in subsequent function calls iff requested.

Fixes #1139

What

This PR allows the client to specify a caching key when specifying graph functions; this indicates to the server that it's allowed to cache the function applications. A new TaskCache component observes output ingestion (creating cache entries), and intercepts tasks between creation and executor allocation; when it sees a cache hit, it skips the allocation, and augments the scheduler update with the cached task outputs; when the state machine observes the augmented scheduler update, it resolves the task as though it had just completed, generating output ingestion events to release downstream tasks.

Testing

Currently, fairly basic: running a workload multiple times. This PR needs further tests before it lands.

Contribution Checklist

  • If a Python package was changed, please run make fmt in the package directory.
  • If the server was changed, please run make fmt in server/.
  • Make sure all PR Checks are passing.

@earhart earhart requested review from diptanu and eabatalov April 25, 2025 19:58
@earhart
Copy link
Author

earhart commented Apr 25, 2025

This isn't quite ready to go in as-is, but if you'd like to give it an early review, the basic pieces are here. The cache needs to respect namespaces and report cache hits to the client; we may also want to add cache-clearing and monitoring APIs in this PR.

@earhart earhart force-pushed the earhart/caching branch 3 times, most recently from 5061d30 to 4a446e3 Compare April 29, 2025 00:12
@earhart earhart marked this pull request as ready for review April 29, 2025 01:22
Copy link
Collaborator

@diptanu diptanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more time to review. Leaving some comments here.

@@ -335,6 +335,12 @@ pub struct DynamicEdgeRouter {
pub retry_policy: NodeRetryPolicy,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Hash, Eq)]
pub struct CacheKey {
pub algorithm: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an enum? If you want Enum <> String conversion somewhere else, then take a look at the crate called Strum.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was that it's essentially an opaque string from the client; the server's just checking for equality. We could also replace this struct with a single string, the way the Content-Digest header works -- "{algorithm}={hash}". Thoughts?

@@ -436,11 +443,16 @@ impl Node {
input_key: &str,
reducer_output_id: Option<String>,
graph_version: &GraphVersion,
input_sha256_hash: Option<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to store this value here? They can be retrieved from the NodeOutputs columnfamily using the input key. RocksDB has MultiGet to make lookups across multiple column family faster. The issue every field in Task adds to storage space.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@earhart earhart force-pushed the earhart/caching branch from 6619412 to 038e7d9 Compare May 1, 2025 13:48
@earhart earhart force-pushed the earhart/caching branch from fb25767 to b356740 Compare May 2, 2025 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants