Skip to content

Commit 2cfe00c

Browse files
committed
Refactors caching examples to be in a single place
Updates links and adds README.
1 parent c9ef352 commit 2cfe00c

14 files changed

+62
-20
lines changed

docs/how-tos/cache-nodes.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ Sometimes it is convenient to cache intermediate nodes. This is especially usefu
66

77
For example, if a particular node takes a long time to calculate (perhaps it extracts data from an outside source or performs some heavy computation), you can annotate it with "cache" tag. The first time the DAG is executed, that node will be cached to disk. If then you do some development on any of the downstream nodes, the subsequent executions will load the cached node instead of repeating the computation.
88

9-
See the full tutorial `here <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes>`_.
9+
See the examples here `here <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes>`_.

examples/caching_nodes/README.md

+5-15
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,8 @@
1-
# Caching Graph Adapter
1+
Here you'll find two adapters that allow you to cache the results of your functions.
22

3-
You can use `CachingGraphAdapter` to cache certain nodes.
3+
The first one is the `DiskCacheAdapter`, which uses the `diskcache` library to store the results on disk.
44

5-
This is great for:
5+
The second one is the `CachingGraphAdapter`, which requires you to tag functions to cache along with the
6+
serialization format.
67

7-
1. Iterating during development, where you don't want to recompute certain expensive function calls.
8-
2. Providing some lightweight means to control recomputation in production, by controlling whether a "cached file" exists or not.
9-
10-
For iterating during development, the general process would be:
11-
12-
1. Write your functions.
13-
2. Mark them with `tag(cache="SERIALIZATION_FORMAT")`
14-
3. Use the CachingGraphAdapter and pass that to the Driver to turn on caching for these functions.
15-
a. If at any point in your development you need to re-run a cached node, you can pass
16-
its name to the adapter in the `force_compute` argument. Then, this node and its downstream
17-
nodes will be computed instead of loaded from cache.
18-
4. When no longer required, you can just skip (3) and any caching behavior will be skipped.
8+
Both have their sweet spots and trade-offs. We invite you play with them and provide feedback on which one you prefer.

examples/caching_nodes/business_logic.py

-1
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Caching Graph Adapter
2+
3+
You can use `CachingGraphAdapter` to cache certain nodes.
4+
5+
This is great for:
6+
7+
1. Iterating during development, where you don't want to recompute certain expensive function calls.
8+
2. Providing some lightweight means to control recomputation in production, by controlling whether a "cached file" exists or not.
9+
10+
For iterating during development, the general process would be:
11+
12+
1. Write your functions.
13+
2. Mark them with `tag(cache="SERIALIZATION_FORMAT")`
14+
3. Use the CachingGraphAdapter and pass that to the Driver to turn on caching for these functions.
15+
a. If at any point in your development you need to re-run a cached node, you can pass
16+
its name to the adapter in the `force_compute` argument. Then, this node and its downstream
17+
nodes will be computed instead of loaded from cache.
18+
4. When no longer required, you can just skip (3) and any caching behavior will be skipped.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import pandas as pd
2+
3+
"""
4+
Copied from the hello world example.
5+
"""
6+
7+
8+
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
9+
"""Rolling 3 week average spend."""
10+
return spend.rolling(3).mean()
11+
12+
13+
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
14+
"""The cost per signup in relation to spend."""
15+
return spend / signups
16+
17+
18+
def spend_mean(spend: pd.Series) -> float:
19+
"""Shows function creating a scalar. In this case it computes the mean of the entire column."""
20+
return spend.mean()
21+
22+
23+
def spend_zero_mean(spend: pd.Series, spend_mean: float) -> pd.Series:
24+
"""Shows function that takes a scalar. In this case to zero mean spend."""
25+
return spend - spend_mean
26+
27+
28+
def spend_std_dev(spend: pd.Series) -> float:
29+
"""Function that computes the standard deviation of the spend column."""
30+
return spend.std()
31+
32+
33+
def spend_zero_mean_unit_variance(spend_zero_mean: pd.Series, spend_std_dev: float) -> pd.Series:
34+
"""Function showing one way to make spend have zero mean and unit variance."""
35+
return spend_zero_mean / spend_std_dev

examples/cache_hook/README.md examples/caching_nodes/diskcache_adapter/README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
# Cache hook
2-
This hook uses the [diskcache](https://grantjenks.com/docs/diskcache/tutorial.html) to cache node execution on disk. The cache key is a tuple of the function's
1+
# DiskCache Adapter
2+
This adapter uses [diskcache](https://grantjenks.com/docs/diskcache/tutorial.html) to cache node execution on disk. The cache key is a tuple of the function's
33
`(source code, input a, ..., input n)`. This means, a function will only be executed once for a given set of inputs,
44
and source code hash. The cache is stored in a directory of your choice, and it can be shared across different runs of your
55
code. That way as you develop, if the inputs and the code haven't changed, the function will not be executed again and
@@ -16,7 +16,7 @@ Disk cache has great features to:
1616
> cache (both keys and values). Learn more about [caveats](https://grantjenks.com/docs/diskcache/tutorial.html#caveats).
1717
1818
> ❓ To store artifacts robustly, please use Hamilton materializers or the
19-
> [CachingGraphAdapter](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes) instead.
19+
> [CachingGraphAdapter](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes/caching_graph_adatper) instead.
2020
> The `CachingGraphAdapter` stores tagged nodes directly on the file system using common formats (JSON, CSV, Parquet, etc.).
2121
> However, it isn't aware of your function version and requires you to manually manage your disk space.
2222
File renamed without changes.

0 commit comments

Comments
 (0)