Deterministic checkpoint hashes for read_values, read_pandas, read_hf

## Problem

Several `read_*` entry points produce a random checkpoint hash on every run, making checkpoint reuse impossible. The root cause is `read_records()` — it creates a temp dataset via `session.generate_temp_dataset_name()` which includes a random UUID. The starting hash is derived from this random dataset name.

Affected entry points:
| Entry point | Root cause |
|---|---|
| `read_values()` | Built on `read_records()` |
| `read_pandas()` | Built on `read_values()` |
| `read_hf()` | Built on `read_values()` |
| `read_records()` with concrete data | Same random temp name |
| `read_database()` | Built on `read_records()` with a generator |
| `read_records()` with generators | Same |

Working entry points (for reference):
| Entry point | Why it works |
|---|---|
| `read_dataset()` | Hash based on dataset name + version |
| `read_storage()` | Hash based on listing dataset name |
| `read_csv()`, `read_parquet()`, `read_json()` | Built on `read_storage()` |

## Proposed solution

For entry points where data is concrete (in memory) at construction time, compute a deterministic hash from the input data upfront — before steps are applied. This preserves the current design where hash calculation is lightweight and doesn't require step application.

**Concrete data (can fix now):**
- `read_values()` — input is always concrete lists (`**fr_map` sequences)
- `read_pandas()` — input is always a DataFrame converted to lists
- `read_hf()` — input is a list of split names
- `read_records()` when called with a list/dict

Approach: compute a streaming SHA256 hash of the serialized input rows at construction time. Python's `hashlib.sha256` supports incremental `.update()` calls with constant memory. Performance for 1M rows of ~100 bytes ≈ 0.2s, negligible compared to DB insert time.

**Challenges:**
- Deterministic serialization of arbitrary Python objects (Pydantic models, datetimes, nested dicts). The `_flatten_record` + `adjust_outputs` pipeline already serializes these for DB insertion — we could reuse or mirror that logic for hashing.
- The hash needs to be available before `read_dataset()` is called on the temp dataset. Could either set it directly as the starting step hash on the `DatasetQuery`, or use a content-addressable dataset name.

**Generators/iterators (deferred — separate discussion):**
- `read_database()` passes a generator from SQL result iteration
- Direct `read_records()` calls can pass generators
- These can't be hashed without consuming the iterator. Options include hashing during insertion (breaks the lightweight-hash-before-apply design) or leaving as a known limitation. To be discussed separately.

Related: #1629

Entry point	Root cause
`read_values()`	Built on `read_records()`
`read_pandas()`	Built on `read_values()`
`read_hf()`	Built on `read_values()`
`read_records()` with concrete data	Same random temp name
`read_database()`	Built on `read_records()` with a generator
`read_records()` with generators	Same

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic checkpoint hashes for read_values, read_pandas, read_hf #1636

Problem

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Entry point	Why it works
`read_dataset()`	Hash based on dataset name + version
`read_storage()`	Hash based on listing dataset name
`read_csv()`, `read_parquet()`, `read_json()`	Built on `read_storage()`

Deterministic checkpoint hashes for read_values, read_pandas, read_hf #1636

Description

Problem

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions