Skip to content

Deterministic checkpoint hashes for read_values, read_pandas, read_hf #1636

@ilongin

Description

@ilongin

Problem

Several read_* entry points produce a random checkpoint hash on every run, making checkpoint reuse impossible. The root cause is read_records() — it creates a temp dataset via session.generate_temp_dataset_name() which includes a random UUID. The starting hash is derived from this random dataset name.

Affected entry points:

Entry point Root cause
read_values() Built on read_records()
read_pandas() Built on read_values()
read_hf() Built on read_values()
read_records() with concrete data Same random temp name
read_database() Built on read_records() with a generator
read_records() with generators Same

Working entry points (for reference):

Entry point Why it works
read_dataset() Hash based on dataset name + version
read_storage() Hash based on listing dataset name
read_csv(), read_parquet(), read_json() Built on read_storage()

Proposed solution

For entry points where data is concrete (in memory) at construction time, compute a deterministic hash from the input data upfront — before steps are applied. This preserves the current design where hash calculation is lightweight and doesn't require step application.

Concrete data (can fix now):

  • read_values() — input is always concrete lists (**fr_map sequences)
  • read_pandas() — input is always a DataFrame converted to lists
  • read_hf() — input is a list of split names
  • read_records() when called with a list/dict

Approach: compute a streaming SHA256 hash of the serialized input rows at construction time. Python's hashlib.sha256 supports incremental .update() calls with constant memory. Performance for 1M rows of ~100 bytes ≈ 0.2s, negligible compared to DB insert time.

Challenges:

  • Deterministic serialization of arbitrary Python objects (Pydantic models, datetimes, nested dicts). The _flatten_record + adjust_outputs pipeline already serializes these for DB insertion — we could reuse or mirror that logic for hashing.
  • The hash needs to be available before read_dataset() is called on the temp dataset. Could either set it directly as the starting step hash on the DatasetQuery, or use a content-addressable dataset name.

Generators/iterators (deferred — separate discussion):

  • read_database() passes a generator from SQL result iteration
  • Direct read_records() calls can pass generators
  • These can't be hashed without consuming the iterator. Options include hashing during insertion (breaks the lightweight-hash-before-apply design) or leaving as a known limitation. To be discussed separately.

Related: #1629

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions