Describe the bug
kt.GridSearch(..., overwrite=False) in a fresh process (kernel restart, cell re-run, new script) reloads the prior trials fine but then dies on the first completed trial in the resumed process with KeyError: '<trial_id>' from LinkedList.next inside GridSearchOracle.populate_space.
Version: keras_tuner 1.4.8, Python 3.12, installed via pip
To Reproduce
No Colab link, sorry--I've been working off a private notebook with a long image download sequence (unrelated) that I doubt you'd want to repeat.
Bug is purely in GridSearchOracle internal state and reproducing it cleanly needs an interrupted trial (process killed while a trial is running).
To reproduce:
- Run any kt.GridSearch(...) to completion of at least one trial.
- Restart the Python process (or re-run the cell that builds kt.GridSearch(..., overwrite=False)).
- Call .search(). After the first completed trial in the new process you get:
File ".../keras_tuner/src/tuners/gridsearch.py", line 197, in populate_space
next_id = self._ordered_ids.next(old_trial_id)
File ".../keras_tuner/src/tuners/gridsearch.py", line 80, in next
index = self._data_to_index[data]
KeyError: '0001'
Expected behavior
Resumed search picks up at the next un-tried grid combo, same as if the original process had kept going.
Additional context
Traced through the source. GridSearchOracle.init makes two in-memory fields:
- _ordered_ids: LinkedList — trial IDs in hp-combo order
- _populate_next: list — queue of trial IDs ready to spawn the next combo
Neither is in Oracle.get_state / set_state, and GridSearchOracle doesn't override either. So on resume they come back empty while start_order rehydrates fine. As soon as end_trial fires (e.g. from an interrupted trial retried via _retry_queue) it pushes a trial_id onto _populate_next.
Next populate_space pops that id and looks it up in the empty _ordered_ids._data_to_index.KeyError.
gridsearch.py is byte-identical at v1.4.7, v1.4.8 and current master, so it's still there.
Two possible fixes:
a) Override get_state / set_state on GridSearchOracle to persist both fields plus rebuild the LinkedList in set_state.
or
b) Lazily reconstruct on first populate_space after reload — walk start_order to fill _ordered_ids in insertion order via _ordered_ids.insert(tid, prev_tid), seed _populate_next with end_order[-1]. Smaller change.
Workaround I've got in my notebook:
def reseed_grid_picker(tuner):
oracle = tuner.oracle
if oracle._ordered_ids._memory:
return
prev_id = None
for trial_id in oracle.start_order:
oracle._ordered_ids.insert(trial_id, prev_id)
prev_id = trial_id
if oracle.end_order:
oracle._populate_next.append(oracle.end_order[-1])
Would you like to help us fix it?
Happy to open a PR.
Describe the bug
kt.GridSearch(..., overwrite=False) in a fresh process (kernel restart, cell re-run, new script) reloads the prior trials fine but then dies on the first completed trial in the resumed process with KeyError: '<trial_id>' from LinkedList.next inside GridSearchOracle.populate_space.
Version: keras_tuner 1.4.8, Python 3.12, installed via pip
To Reproduce
No Colab link, sorry--I've been working off a private notebook with a long image download sequence (unrelated) that I doubt you'd want to repeat.
Bug is purely in GridSearchOracle internal state and reproducing it cleanly needs an interrupted trial (process killed while a trial is running).
To reproduce:
File ".../keras_tuner/src/tuners/gridsearch.py", line 197, in populate_space
next_id = self._ordered_ids.next(old_trial_id)
File ".../keras_tuner/src/tuners/gridsearch.py", line 80, in next
index = self._data_to_index[data]
KeyError: '0001'
Expected behavior
Resumed search picks up at the next un-tried grid combo, same as if the original process had kept going.
Additional context
Traced through the source. GridSearchOracle.init makes two in-memory fields:
Neither is in Oracle.get_state / set_state, and GridSearchOracle doesn't override either. So on resume they come back empty while start_order rehydrates fine. As soon as end_trial fires (e.g. from an interrupted trial retried via _retry_queue) it pushes a trial_id onto _populate_next.
Next populate_space pops that id and looks it up in the empty _ordered_ids._data_to_index.KeyError.
gridsearch.py is byte-identical at v1.4.7, v1.4.8 and current master, so it's still there.
Two possible fixes:
a) Override get_state / set_state on GridSearchOracle to persist both fields plus rebuild the LinkedList in set_state.
or
b) Lazily reconstruct on first populate_space after reload — walk start_order to fill _ordered_ids in insertion order via _ordered_ids.insert(tid, prev_tid), seed _populate_next with end_order[-1]. Smaller change.
Workaround I've got in my notebook:
def reseed_grid_picker(tuner):
oracle = tuner.oracle
if oracle._ordered_ids._memory:
return
prev_id = None
for trial_id in oracle.start_order:
oracle._ordered_ids.insert(trial_id, prev_id)
prev_id = trial_id
if oracle.end_order:
oracle._populate_next.append(oracle.end_order[-1])
Would you like to help us fix it?
Happy to open a PR.