-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Problem
Several read_* entry points produce a random checkpoint hash on every run, making checkpoint reuse impossible. The root cause is read_records() — it creates a temp dataset via session.generate_temp_dataset_name() which includes a random UUID. The starting hash is derived from this random dataset name.
Affected entry points:
| Entry point | Root cause |
|---|---|
read_values() |
Built on read_records() |
read_pandas() |
Built on read_values() |
read_hf() |
Built on read_values() |
read_records() with concrete data |
Same random temp name |
read_database() |
Built on read_records() with a generator |
read_records() with generators |
Same |
Working entry points (for reference):
| Entry point | Why it works |
|---|---|
read_dataset() |
Hash based on dataset name + version |
read_storage() |
Hash based on listing dataset name |
read_csv(), read_parquet(), read_json() |
Built on read_storage() |
Proposed solution
For entry points where data is concrete (in memory) at construction time, compute a deterministic hash from the input data upfront — before steps are applied. This preserves the current design where hash calculation is lightweight and doesn't require step application.
Concrete data (can fix now):
read_values()— input is always concrete lists (**fr_mapsequences)read_pandas()— input is always a DataFrame converted to listsread_hf()— input is a list of split namesread_records()when called with a list/dict
Approach: compute a streaming SHA256 hash of the serialized input rows at construction time. Python's hashlib.sha256 supports incremental .update() calls with constant memory. Performance for 1M rows of ~100 bytes ≈ 0.2s, negligible compared to DB insert time.
Challenges:
- Deterministic serialization of arbitrary Python objects (Pydantic models, datetimes, nested dicts). The
_flatten_record+adjust_outputspipeline already serializes these for DB insertion — we could reuse or mirror that logic for hashing. - The hash needs to be available before
read_dataset()is called on the temp dataset. Could either set it directly as the starting step hash on theDatasetQuery, or use a content-addressable dataset name.
Generators/iterators (deferred — separate discussion):
read_database()passes a generator from SQL result iteration- Direct
read_records()calls can pass generators - These can't be hashed without consuming the iterator. Options include hashing during insertion (breaks the lightweight-hash-before-apply design) or leaving as a known limitation. To be discussed separately.
Related: #1629