LubyTS is a rollout sampling algorithm, which samples actions for rollouts of depth following the Luby universal sequence, then reset and performs another rollout if a solution is found. This continues until the budget is exhausted.
- Orseau, Laurent, et al. "Single-agent policy tree search with guarantees." Advances in Neural Information Processing Systems 31 (2018).
LubyTS needs as input a valid environment and model with policy evaluations.
Environment objects need to support each of the following to satisfy the constraint
- Supports standard equality and hash operators
apply_actionupdates the internal state by applying the action and returns the step costget_observationgets the flat state observation, as a vector of floatget_hashgets the hash of the underlying stateis_solutionchecks for a solution stateis_terminalchecks for terminal non-solution states (this is often just callingis_solution)
All the environments in libpolicyts/env/ satisfies this concept.
template <typename T>
concept IsEnv = std::equality_comparable<T> && IsSTDHashable<T> && requires(T t, const T ct, const std::string &s) {
{ t.apply_action(makeval<int>()) } -> std::same_as<double>; // apply_action with int action and returns cost
{ ct.get_observation() } -> std::same_as<Observation>; // Observation for policy/heuristic inference
{ ct.get_hash() } -> std::same_as<uint64_t>; // get hash
{ ct.is_solution() } -> std::same_as<bool>; // Solution check
{ ct.is_terminal() } -> std::same_as<bool>; // Terminal check (both solution + non-solution)
*(&T::num_actions) == makeval<int>(); // Number of actions
};Models which evaluate policies need to support each of the following to satisfy the constraint.
- Has an inner type
InferenceInput, which can be constructed from anObservation inferencetakes either a vector ofInferenceInputand returns a vector of a struct such that it has an innerstd::vector<double> policy, OR takes a singleInferenceInputand returns a struct with an innerstd::vector<double> policy.
All of the provided model wrappers satisfies this concept.
template <typename T>
concept IsLubyModelValue = requires(T t, Observation &obs) {
{ t.inference(obs) } -> HasPolicy;
};
template <typename T>
concept IsLubyModelVector = requires(T t) {
// Has an inner type called InferenceInput
typename T::InferenceInput;
// Which is constructable from an observation
requires std::is_constructible_v<typename T::InferenceInput, Observation>;
// Inference takes as input a vector of inference inputs and must return a std::vector<...>
requires IsSpecialization<
std::remove_cvref_t<decltype(t.inference(makeval<std::vector<typename T::InferenceInput> &>()))>,
std::vector>;
// Returned vector element type must satisfy HasPolicy
requires HasPolicy<typename std::remove_cvref_t<
decltype(t.inference(makeval<std::vector<typename T::InferenceInput> &>()))>::value_type>;
};
// Luby model satisfies the following:
// Either takes an observation as input and returns a type holding a policy
// OR takes a vector of observations and returns a vector of types holding a policy
template <typename T>
concept IsLubyModel = IsLubyModelValue<T> || IsLubyModelVector<T>;Full example in examples/lubyts/.
// Policy which satisfies the constraint for LubyTS
template <int N>
struct Policy {
static_assert(N >= 1);
struct InferenceOutput {
std::vector<double> policy;
};
[[nodiscard]] auto inference([[maybe_unused]] libpts::Observation &observations) const -> InferenceOutput {
return {std::vector<double>(static_cast<std::size_t>(N), 1.0 / N)};
}
};
using State = ...
namespace luby = libpts::algorithm::luby;
luby::SearchInput<State, Policy<4>> search_input{
.puzzle_name = "puzzle_0",
.state = start_state,
.search_budget = budget,
.mix_epsilon = 0.0,
.seed = 0,
.stop_token = stop_token,
.policy_model = std::make_shared<SokobanPolicy>()
};
auto search_result = luby::search(search_input);For an example using the library policy model,
using Policy = libpts::model::PolicyConvNetWrapper;
auto model_config = Policy::get_default_json_config();
model_config["resnet_channels"] = 16;
model_config["resnet_blocks"] = 2;
model_config["policy_channels"] = 2;
model_config["policy_mlp_layers"] = std::vector<int>{8, 8};
model_config["use_batchnorm"] = false;
model_config["learning_rate"] = 3e-4;
model_config["l2_weight_decay"] = 1e-4;
// Policy which satisfies the constraint for LubyTS
auto model = std::make_shared<Policy>(
model_config,
start_state.observation_shape(),
start_state.num_actions,
"cpu",
""
);
luby::SearchInput<State, Policy> search_input{
.puzzle_name = "puzzle_0",
.state = start_state,
.search_budget = budget,
.mix_epsilon = 0.0,
.seed = 0,
.stop_token = stop_token,
.policy_model = std::make_shared<SokobanPolicy>()
};
auto search_result = luby::search(search_input);