Implement task output caching in Indexify #1383

earhart · 2025-04-25T19:58:29Z

Context

The idea here is to cache function outputs and reuse them in subsequent function calls iff requested.

What

This PR allows the client to specify a caching key when specifying graph functions; this indicates to the server that it's allowed to cache the function applications. A new TaskCache component observes output ingestion (creating cache entries), and intercepts tasks between creation and executor allocation; when it sees a cache hit, it skips the allocation, and augments the scheduler update with the cached task outputs; when the state machine observes the augmented scheduler update, it resolves the task as though it had just completed, generating output ingestion events to release downstream tasks.

Testing

Currently, fairly basic: running a workload multiple times. This PR needs further tests before it lands.

Contribution Checklist

If a Python package was changed, please run make fmt in the package directory.
If the server was changed, please run make fmt in server/.
Make sure all PR Checks are passing.

earhart · 2025-04-25T20:00:22Z

This isn't quite ready to go in as-is, but if you'd like to give it an early review, the basic pieces are here. The cache needs to respect namespaces and report cache hits to the client; we may also want to add cache-clearing and monitoring APIs in this PR.

diptanu

Need more time to review. Leaving some comments here.

server/data_model/src/lib.rs

diptanu

@earhart Looks great, have some small comments. Thanks for working on this very patiently!

server/state_store/src/state_changes.rs

diptanu · 2025-05-06T23:16:59Z

server/state_store/src/lib.rs

+                    if sched_update.cached_task_outputs.contains_key(&task.id) {
+                        let _ =
+                            self.task_event_tx
+                                .send(InvocationStateChangeEvent::TaskMatchedCache(


Does it need to be a different event? This is for notifying the clients about what the scheduler is doing. We could add a new attribute to TaskCreated - cached_output: bool

I kinda like having it as a different event: this way, TaskCreated always gets matched up with a corresponding TaskAssigned and TaskCompleted. Also, this event name is very prominent in the text output from the cli, so it's very, very obvious to users what's going on.

diptanu · 2025-05-06T23:17:35Z

server/state_store/src/requests.rs

 #[derive(Debug, Clone, Default)]
 pub struct SchedulerUpdateRequest {
    pub new_allocations: Vec<Allocation>,
    pub remove_allocations: Vec<Allocation>,
    pub updated_tasks: HashMap<TaskId, Task>,
+    pub cached_task_outputs: HashMap<TaskId, CachedTaskOutput>,


Assuming we are also updating the TaskOutcome to completed?

Yes -- that actually happens in task_cache.rs::try_allocate(), where we insert the task into cached_task_outputs.

earhart requested review from diptanu and eabatalov April 25, 2025 19:58

earhart force-pushed the earhart/caching branch 3 times, most recently from 5061d30 to 4a446e3 Compare April 29, 2025 00:12

earhart marked this pull request as ready for review April 29, 2025 01:22

earhart force-pushed the earhart/caching branch from 4a446e3 to 6619412 Compare April 29, 2025 13:34

diptanu reviewed Apr 29, 2025

View reviewed changes

server/data_model/src/lib.rs Outdated Show resolved Hide resolved

server/data_model/src/lib.rs Outdated Show resolved Hide resolved

earhart force-pushed the earhart/caching branch 4 times, most recently from 0690799 to 902d389 Compare May 6, 2025 18:42

diptanu approved these changes May 7, 2025

View reviewed changes

earhart added 2 commits May 7, 2025 13:38

Implement task output caching in Indexify

edbc8f3

Remove unneeded invocation hash

2067080

earhart force-pushed the earhart/caching branch from fec8be4 to 2067080 Compare May 7, 2025 20:38

earhart merged commit 8eb476f into main May 7, 2025
9 checks passed

earhart deleted the earhart/caching branch May 7, 2025 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement task output caching in Indexify #1383

Implement task output caching in Indexify #1383

Uh oh!

earhart commented Apr 25, 2025 •

edited

Loading

Uh oh!

earhart commented Apr 25, 2025

Uh oh!

diptanu left a comment

Uh oh!

Uh oh!

Uh oh!

diptanu left a comment

Uh oh!

Uh oh!

diptanu May 6, 2025

Uh oh!

earhart May 7, 2025

Uh oh!

diptanu May 6, 2025

Uh oh!

earhart May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Implement task output caching in Indexify #1383

Implement task output caching in Indexify #1383

Uh oh!

Conversation

earhart commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What

Testing

Contribution Checklist

Uh oh!

earhart commented Apr 25, 2025

Uh oh!

diptanu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

diptanu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

diptanu May 6, 2025

Choose a reason for hiding this comment

Uh oh!

earhart May 7, 2025

Choose a reason for hiding this comment

Uh oh!

diptanu May 6, 2025

Choose a reason for hiding this comment

Uh oh!

earhart May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

earhart commented Apr 25, 2025 •

edited

Loading