Skip to content

GridSearchOracle: persist _ordered_ids and _populate_next across reloads (#1055)#1056

Open
SAY-5 wants to merge 3 commits into
keras-team:masterfrom
SAY-5:say5-grid-search-state-persistence
Open

GridSearchOracle: persist _ordered_ids and _populate_next across reloads (#1055)#1056
SAY-5 wants to merge 3 commits into
keras-team:masterfrom
SAY-5:say5-grid-search-state-persistence

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented May 11, 2026

Closes #1055.

GridSearchOracle.__init__ builds two in-memory bookkeeping fields not present in Oracle:

  • _ordered_ids: a LinkedList of trial ids in hp-combo order
  • _populate_next: a queue of trial ids ready to produce their next combination

Neither is in Oracle.get_state / set_state and GridSearchOracle doesn't override them, so after a process restart with overwrite=False they reload empty. start_order rehydrates fine, the first end_trial pushes a trial id onto _populate_next, and the next populate_space calls _ordered_ids.next(old_trial_id) against an empty _data_to_indexKeyError: '<trial_id>' at gridsearch.py:80,197.

Override get_state / set_state on GridSearchOracle to persist both fields and rebuild the LinkedList from them on reload. Older state files (saved before this change) lazily rebuild from start_order so existing checkpoints keep working.

Regression tests in gridsearch_test.py cover:

  • the state round-trip (write → fresh oracle → resume, no KeyError)
  • legacy state files that lack the new keys still rehydrate via start_order

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements state persistence for GridSearch by adding get_state and set_state methods to the oracle, ensuring that search progress is correctly maintained across process restarts. It includes regression tests for state round-trips and backward compatibility with legacy checkpoints. Feedback was provided regarding the persistence of the LinkedList state, noting that directly saving the internal memory list may lead to corrupted trial ordering; a traversal-based approach was suggested to correctly capture the logical sequence of trial IDs.

# `GridSearch(..., overwrite=False)` after a kernel restart) keeps
# working without `KeyError` from `_ordered_ids.next()` (#1055).
state = super().get_state()
state["ordered_ids"] = list(self._ordered_ids._memory)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _ordered_ids._memory list stores trial IDs in the order they were inserted into the LinkedList, which does not necessarily correspond to the logical sequence of the grid traversal. This happens when new hyperparameters are discovered or when using conditional search spaces, causing trials to be inserted in the middle of the list. Persisting _memory directly and then rebuilding the list by assuming its order is the sequence will result in a corrupted linked list upon search resumption. Instead, the linked list should be traversed using its next pointers to capture the correct logical sequence of trial IDs.

Suggested change
state["ordered_ids"] = list(self._ordered_ids._memory)
ordered_ids = []
current_index = self._ordered_ids._next_index[None]
while current_index is not None:
ordered_ids.append(self._ordered_ids._memory[current_index])
current_index = self._ordered_ids._next_index[current_index]
state["ordered_ids"] = ordered_ids

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KeyError on GridSearch resume: _ordered_ids and _populate_next aren't persisted

1 participant