Skip to content

Commit 8091386

Browse files
esantorellafacebook-github-bot
authored andcommitted
Track cost; order oracle trace by completion order
Summary: I am not sure if this is what we will want in the long run, but it will unblock benchmarking early stopping. # What's wrong with the current behavior **Ordering by start order vs. completion order:** Currently, the oracle trace is ordered by trial order and has one entry for each trial. The inference trace has always been ordered by completion order because it is updated every time a trial ceases running. The order of completion (including early stopping) seems preferable for both, and it's a little weird for the oracle trace to have a different ordering than the inference trace. See here for discussion on this: https://fb.workplace.com/groups/1294299434097422/posts/2563368300523856 **Inability to compare more costly vs. less costly strategies**: Separately, tracking cost is necessary to fairly compare more aggressive vs. less aggressive early-stopping strategies or to compare stopping early against not. I am bundling these two changes (reordering the oracle trace and introducing cost) because the oracle trace should now only be compared against the cost. Ordering by completion order doesn't make a lot of sense without a notion of cost when multiple trials can complete at the same time. # New behavior | time | first trial running | second trial running | objective values | best point | | ---- | ----------------- | -------------------- | --------------- | ---------- | | 0 | 0 | 1 | | | | 1 | 0 | 2 | y_1 | y_a | | 2 | 0 | 2 | y_1 | not computed | | 3 | 0 | 2 | y_1, y_0, y_2 | y_b | Assuming higher is better, this produces ```BenchmarkResult: cost_trace: [1, 3] oracle_trace: [y_1, max(y_1, y_0, y_2)] inference_trace: [y_a, y_b] ``` Now traces are only updated when a trial completes, so there are 2 trace elements with 3 trials. (We could also just duplicate elements when multiple trials complete at the same time to preserve the length.) See docstrings for more detail. # What's not ideal about this I want to flag that a few things are not great about this setup. * It makes plotting hard: One one replication produces a cost_trace of [3, 5] and another one produces a cost_trace of [2, 6], how do we aggregate their optimization traces? We can do this by left-interpolating the optimization traces onto [2, 3, .., 6] and then aggregating as usual, but it is clunky. * Even aside from the issue of different replications producing different cost traces, plotting is harder because plotting must be against cost now. * People typically are interested in epoch-by-epoch results for early stopping, and those are not available here. # Better long-term solution Two alternatives are * Storing trace values for each time step, which would remove the need to track cost at all: element `i` of the trace would have happened at virtual second `i`. * Storing cost/time information at each step in MapData, and then deriving a proper trace from there (we may already have this -- need to check) # Internal: Differential Revision: D69489720
1 parent 63a1eaf commit 8091386

File tree

4 files changed

+180
-133
lines changed

4 files changed

+180
-133
lines changed

ax/benchmark/benchmark.py

Lines changed: 101 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
from ax.benchmark.benchmark_method import BenchmarkMethod
3131
from ax.benchmark.benchmark_problem import BenchmarkProblem
3232
from ax.benchmark.benchmark_result import AggregatedBenchmarkResult, BenchmarkResult
33-
from ax.benchmark.benchmark_runner import BenchmarkRunner
33+
from ax.benchmark.benchmark_runner import BenchmarkRunner, get_total_runtime
3434
from ax.benchmark.benchmark_test_function import BenchmarkTestFunction
3535
from ax.benchmark.methods.sobol import get_sobol_generation_strategy
3636
from ax.core.arm import Arm
@@ -39,13 +39,14 @@
3939
from ax.core.optimization_config import OptimizationConfig
4040
from ax.core.search_space import SearchSpace
4141
from ax.core.trial_status import TrialStatus
42-
from ax.core.types import TParameterization, TParamValue
42+
from ax.core.types import TParamValue
4343
from ax.core.utils import get_model_times
4444
from ax.service.scheduler import Scheduler
4545
from ax.service.utils.best_point_mixin import BestPointMixin
4646
from ax.service.utils.scheduler_options import SchedulerOptions, TrialType
4747
from ax.utils.common.logger import DEFAULT_LOG_LEVEL, get_logger
4848
from ax.utils.common.random import with_rng_seed
49+
from pyre_extensions import assert_is_instance
4950

5051
logger: Logger = get_logger(__name__)
5152

@@ -172,23 +173,6 @@ def get_oracle_experiment_from_params(
172173
return experiment
173174

174175

175-
def get_oracle_experiment_from_experiment(
176-
problem: BenchmarkProblem, experiment: Experiment
177-
) -> Experiment:
178-
"""
179-
Get an ``Experiment`` that is the same as the original experiment but has
180-
metrics evaluated at oracle values (noiseless ground-truth values
181-
evaluated at the target task and fidelity)
182-
"""
183-
return get_oracle_experiment_from_params(
184-
problem=problem,
185-
dict_of_dict_of_params={
186-
trial.index: {arm.name: arm.parameters for arm in trial.arms}
187-
for trial in experiment.trials.values()
188-
},
189-
)
190-
191-
192176
def get_benchmark_scheduler_options(
193177
method: BenchmarkMethod,
194178
include_sq: bool = False,
@@ -225,6 +209,35 @@ def get_benchmark_scheduler_options(
225209
)
226210

227211

212+
def _get_cumulative_cost(
213+
previous_cost: float,
214+
new_trials: set[int],
215+
experiment: Experiment,
216+
) -> float:
217+
"""
218+
Get the total cost of running a benchmark where `new_trials` have just
219+
completed, and the cost up to that point was `previous_cost`.
220+
221+
If a backend simulator is used to track runtime the cost is just the
222+
simulated time. If there is no backend simulator, it is still possible that
223+
trials have varying runtimes without that being simulated, so in that case,
224+
runtimes are computed.
225+
"""
226+
runner = assert_is_instance(experiment.runner, BenchmarkRunner)
227+
if runner.simulated_backend_runner is not None:
228+
return runner.simulated_backend_runner.simulator.time
229+
230+
per_trial_times = (
231+
get_total_runtime(
232+
trial=experiment.trials[i],
233+
step_runtime_function=runner.step_runtime_function,
234+
n_steps=runner.test_function.n_steps,
235+
)
236+
for i in new_trials
237+
)
238+
return previous_cost + sum(per_trial_times)
239+
240+
228241
def benchmark_replication(
229242
problem: BenchmarkProblem,
230243
method: BenchmarkMethod,
@@ -284,16 +297,22 @@ def benchmark_replication(
284297
options=scheduler_options,
285298
)
286299

287-
# list of parameters for each trial
288-
best_params_by_trial: list[list[TParameterization]] = []
300+
# Each of these lists is added to when a trial completes or stops early.
301+
# Since multiple trials can complete at once, there may be fewer elements in
302+
# these traces than the number of trials run.
303+
cost_trace: list[float] = []
304+
best_params_list: list[Mapping[str, TParamValue]] = [] # For inference trace
305+
evaluated_arms_list: list[set[Arm]] = [] # For oracle trace
289306

290307
is_mf_or_mt = len(problem.target_fidelity_and_task) > 0
291-
trials_used_for_best_point: set[int] = set()
292308

293309
# Run the optimization loop.
294310
timeout_hours = method.timeout_hours
295311
remaining_hours = timeout_hours
296312

313+
previously_completed_trials = set()
314+
cost = 0.0
315+
297316
with with_rng_seed(seed=seed), warnings.catch_warnings():
298317
warnings.filterwarnings(
299318
"ignore",
@@ -302,28 +321,15 @@ def benchmark_replication(
302321
module="ax.modelbridge.cross_validation",
303322
)
304323
start = monotonic()
305-
# These next several lines do the same thing as `run_n_trials`, but
324+
# These next several lines do the same thing as
325+
# `scheduler.run_n_trials`, but
306326
# decrement the timeout with each step, so that the timeout refers to
307327
# the total time spent in the optimization loop, not time per trial.
308328
scheduler.poll_and_process_results()
309329
for _ in scheduler.run_trials_and_yield_results(
310330
max_trials=problem.num_trials,
311331
timeout_hours=remaining_hours,
312332
):
313-
if timeout_hours is not None:
314-
elapsed_hours = (monotonic() - start) / 3600
315-
remaining_hours = timeout_hours - elapsed_hours
316-
if remaining_hours <= 0.0:
317-
logger.warning("The optimization loop timed out.")
318-
break
319-
320-
if problem.is_moo or is_mf_or_mt:
321-
# Inference trace is not supported for MOO.
322-
# It's also not supported for multi-fidelity or multi-task
323-
# problems, because Ax's best-point functionality doesn't know
324-
# to predict at the target task or fidelity.
325-
continue
326-
327333
currently_completed_trials = {
328334
t.index
329335
for t in experiment.trials.values()
@@ -334,45 +340,70 @@ def benchmark_replication(
334340
)
335341
}
336342
newly_completed_trials = (
337-
currently_completed_trials - trials_used_for_best_point
338-
)
339-
if len(newly_completed_trials) == 0:
340-
continue
341-
for t in newly_completed_trials:
342-
trials_used_for_best_point.add(t)
343-
344-
best_params = method.get_best_parameters(
345-
experiment=experiment,
346-
optimization_config=problem.optimization_config,
347-
n_points=problem.n_best_points,
343+
currently_completed_trials - previously_completed_trials
348344
)
349-
# If multiple trials complete at the same time, add that number of
350-
# points to the inference trace so that the trace has length equal to
351-
# the number of trials.
352-
for _ in newly_completed_trials:
353-
best_params_by_trial.append(best_params)
345+
previously_completed_trials = currently_completed_trials
346+
347+
if len(newly_completed_trials) > 0:
348+
cost = _get_cumulative_cost(
349+
new_trials=newly_completed_trials,
350+
experiment=experiment,
351+
previous_cost=cost,
352+
)
353+
cost_trace.append(cost)
354+
355+
# Track what params are newly evaluated from those trials, for
356+
# the oracle trace
357+
params = {
358+
arm
359+
for i in newly_completed_trials
360+
for arm in experiment.trials[i].arms
361+
}
362+
evaluated_arms_list.append(params)
363+
364+
# Inference trace: Not supported for MOO.
365+
# It's also not supported for multi-fidelity or multi-task
366+
# problems, because Ax's best-point functionality doesn't know
367+
# to predict at the target task or fidelity.
368+
if not (problem.is_moo or is_mf_or_mt):
369+
best_params = method.get_best_parameters(
370+
experiment=experiment,
371+
optimization_config=problem.optimization_config,
372+
n_points=problem.n_best_points,
373+
)[0]
374+
best_params_list.append(best_params)
375+
376+
if timeout_hours is not None:
377+
elapsed_hours = (monotonic() - start) / 3600
378+
remaining_hours = timeout_hours - elapsed_hours
379+
if remaining_hours <= 0.0:
380+
logger.warning("The optimization loop timed out.")
381+
break
354382

355383
scheduler.summarize_final_result()
356384

357385
# Construct inference trace from best parameters
358-
inference_trace = np.full(problem.num_trials, np.nan)
359-
for trial_index, best_params in enumerate(best_params_by_trial):
360-
if len(best_params) == 0:
361-
inference_trace[trial_index] = np.nan
362-
continue
363-
# Construct an experiment with one BatchTrial
364-
best_params_oracle_experiment = get_oracle_experiment_from_params(
365-
problem=problem,
366-
dict_of_dict_of_params={0: {str(i): p for i, p in enumerate(best_params)}},
386+
single_params_as_experiments = (
387+
get_oracle_experiment_from_params(
388+
problem=problem, dict_of_dict_of_params={0: {"0_0": params}}
367389
)
368-
# Get the optimization trace. It will have only one point.
369-
inference_trace[trial_index] = BestPointMixin._get_trace(
370-
experiment=best_params_oracle_experiment,
371-
optimization_config=problem.optimization_config,
372-
)[0]
390+
for params in best_params_list
391+
)
392+
inference_trace = np.array(
393+
[
394+
BestPointMixin._get_trace(
395+
experiment=exp, optimization_config=problem.optimization_config
396+
)[0]
397+
for exp in single_params_as_experiments
398+
]
399+
)
373400

374-
actual_params_oracle_experiment = get_oracle_experiment_from_experiment(
375-
problem=problem, experiment=experiment
401+
actual_params_oracle_experiment = get_oracle_experiment_from_params(
402+
problem=problem,
403+
dict_of_dict_of_params={
404+
i: {arm.name: arm.parameters for arm in arms}
405+
for i, arms in enumerate(evaluated_arms_list)
406+
},
376407
)
377408
oracle_trace = np.array(
378409
BestPointMixin._get_trace(
@@ -404,6 +435,7 @@ def benchmark_replication(
404435
inference_trace=inference_trace,
405436
optimization_trace=optimization_trace,
406437
score_trace=score_trace,
438+
cost_trace=np.array(cost_trace),
407439
fit_time=fit_time,
408440
gen_time=gen_time,
409441
)

ax/benchmark/benchmark_result.py

Lines changed: 43 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -32,40 +32,57 @@ class BenchmarkResult(Base):
3232
name: Name of the benchmark. Should make it possible to determine the
3333
problem and the method.
3434
seed: Seed used for determinism.
35-
oracle_trace: For single-objective problems, element i of the
36-
optimization trace is the best oracle value of the arms evaluated
37-
after the first i trials. For multi-objective problems, element i
38-
of the optimization trace is the hypervolume of the oracle values of
39-
the arms in the first i trials (which may be ``BatchTrial``s).
40-
Oracle values are typically ground-truth (rather than noisy) and
41-
evaluated at the target task and fidelity.
42-
inference_trace: Inference trace comes from choosing a "best" point
43-
based only on data that would be observable in realistic settings,
44-
as specified by `BenchmarkMethod.get_best_parameters`,
45-
and then evaluating the oracle value of that point according to the
46-
problem's `OptimizationConfig`. For multi-objective problems, the
47-
hypervolume of a set of points is considered.
35+
oracle_trace: For single-objective problems, the oracle trace is the
36+
cumulative best oracle objective value seen so far. For
37+
multi-objective problems, it is the cumulative hypervolume of
38+
feasible oracle objective values.
39+
40+
Oracle values are typically objective values that are at the ground
41+
truth (not noisy) and evaluated at the target task and fidelity.
42+
43+
The trace may have fewer elements than the number of trials run if
44+
multiple trials stop at the same time; the trace is updated whenever
45+
trials stop (TrialStatus COMPLETED or EARLY_STOPPED). The number of
46+
trials completed is reflected in the `cost_trace`, which is updated
47+
at the same time as the `oracle_trace`. For example, if each trial
48+
has a cost of 1, and `cost_trace[i] = 4`, then `oracle_trace[i]` is
49+
the value of the best of the first four trials to complete, or the
50+
feasible hypervolume of those trials.
51+
inference_trace: Inference values come from choosing a "best" point or
52+
points based only on data that would be observable in realistic
53+
settings, as specified by `BenchmarkMethod.get_best_parameters`, and
54+
then evaluating the oracle objective value of that point according
55+
to the problem's `OptimizationConfig`.
4856
4957
By default, if it is not overridden,
5058
`BenchmarkMethod.get_best_parameters` uses the empirical best point
5159
if `use_model_predictions_for_best_point` is False and the best
5260
point of those evaluated so far if it is True.
5361
54-
Note: This is not "inference regret", which is a lower-is-better value
55-
that is relative to the best possible value. The inference value
56-
trace is higher-is-better if the problem is a maximization problem
57-
or if the problem is multi-objective (in which case hypervolume is
58-
used). Hence, it is signed the same as ``oracle_trace`` and
59-
``optimization_trace``. ``score_trace`` is higher-is-better and
60-
relative to the optimum.
61-
optimization_trace: Either the ``oracle_trace`` or the
62-
``inference_trace``, depending on whether the ``BenchmarkProblem``
63-
specifies ``report_inference_value``. Having ``optimization_trace``
64-
specified separately is useful when we need just one value to
65-
evaluate how well the benchmark went.
62+
As with the oracle trace, the inference trace is updated whenever a
63+
trial completes and may have fewer elements than the number of trials.
64+
65+
Note: This is scaled differently from "inference regret", which is a
66+
lower-is-better value that is relative to the best possible value.
67+
The inference value trace is higher-is-better if the problem is a
68+
maximization problem or if the problem is multi-objective (in which
69+
case hypervolume is used). Hence, it is signed the same as
70+
`oracle_trace` and `optimization_trace`. `score_trace`, meanwhile,
71+
is higher-is-better and relative to the optimum.
72+
optimization_trace: Either the `oracle_trace` or the `inference_trace`,
73+
depending on whether the `BenchmarkProblem` specifies
74+
`report_inference_value`. Having `optimization_trace` specified
75+
separately is useful when we need just one value to evaluate how
76+
well the benchmark went.
6677
score_trace: The scores associated with the problem, typically either
6778
the optimization_trace or inference_value_trace normalized to a
6879
0-100 scale for comparability between problems.
80+
cost_trace: The cumulative cost of completed trials. The `cost_trace` is
81+
updated whenever a trial completes, so, like the `oracle_trace` and
82+
`inference_trace`, it can have fewer elements than the number of
83+
trials if multiple trials complete at the same time. Trials that do
84+
not produce `MapData` have a cost of 1, and trials that produce
85+
`MapData` have a cost equal to the length of the `MapData`.
6986
fit_time: Total time spent fitting models.
7087
gen_time: Total time spent generating candidates.
7188
experiment: If not ``None``, the Ax experiment associated with the
@@ -81,6 +98,7 @@ class BenchmarkResult(Base):
8198
inference_trace: npt.NDArray
8299
optimization_trace: npt.NDArray
83100
score_trace: npt.NDArray
101+
cost_trace: npt.NDArray
84102

85103
fit_time: float
86104
gen_time: float

0 commit comments

Comments
 (0)