Skip to content

KeyError on GridSearch resume: _ordered_ids and _populate_next aren't persisted #1055

@ahalp90

Description

@ahalp90

Describe the bug
kt.GridSearch(..., overwrite=False) in a fresh process (kernel restart, cell re-run, new script) reloads the prior trials fine but then dies on the first completed trial in the resumed process with KeyError: '<trial_id>' from LinkedList.next inside GridSearchOracle.populate_space.

Version: keras_tuner 1.4.8, Python 3.12, installed via pip

To Reproduce
No Colab link, sorry--I've been working off a private notebook with a long image download sequence (unrelated) that I doubt you'd want to repeat.
Bug is purely in GridSearchOracle internal state and reproducing it cleanly needs an interrupted trial (process killed while a trial is running).

To reproduce:

  1. Run any kt.GridSearch(...) to completion of at least one trial.
  2. Restart the Python process (or re-run the cell that builds kt.GridSearch(..., overwrite=False)).
  3. Call .search(). After the first completed trial in the new process you get:

File ".../keras_tuner/src/tuners/gridsearch.py", line 197, in populate_space
next_id = self._ordered_ids.next(old_trial_id)
File ".../keras_tuner/src/tuners/gridsearch.py", line 80, in next
index = self._data_to_index[data]
KeyError: '0001'

Expected behavior
Resumed search picks up at the next un-tried grid combo, same as if the original process had kept going.

Additional context
Traced through the source. GridSearchOracle.init makes two in-memory fields:

  • _ordered_ids: LinkedList — trial IDs in hp-combo order
  • _populate_next: list — queue of trial IDs ready to spawn the next combo

Neither is in Oracle.get_state / set_state, and GridSearchOracle doesn't override either. So on resume they come back empty while start_order rehydrates fine. As soon as end_trial fires (e.g. from an interrupted trial retried via _retry_queue) it pushes a trial_id onto _populate_next.
Next populate_space pops that id and looks it up in the empty _ordered_ids._data_to_index.KeyError.

gridsearch.py is byte-identical at v1.4.7, v1.4.8 and current master, so it's still there.

Two possible fixes:

a) Override get_state / set_state on GridSearchOracle to persist both fields plus rebuild the LinkedList in set_state.
or
b) Lazily reconstruct on first populate_space after reload — walk start_order to fill _ordered_ids in insertion order via _ordered_ids.insert(tid, prev_tid), seed _populate_next with end_order[-1]. Smaller change.

Workaround I've got in my notebook:

def reseed_grid_picker(tuner):
oracle = tuner.oracle
if oracle._ordered_ids._memory:
return
prev_id = None
for trial_id in oracle.start_order:
oracle._ordered_ids.insert(trial_id, prev_id)
prev_id = trial_id
if oracle.end_order:
oracle._populate_next.append(oracle.end_order[-1])

Would you like to help us fix it?
Happy to open a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions