A Gym-style environment and agent framework for the Wolfram Language. An
environment is a pure model (an Association of callbacks) wrapped in a
GymEnvironment object; the model keys are exactly the interface a state-space
search consumes, so env["Model"] is directly searchable and a search becomes a
policy. Policies are plain functions state -> action. Episodes, games,
and matches run them, including self-play. On top sits the full agent zoo:
bandits, dynamic programming, tabular model-free control, and deep RL.
PacletDirectoryLoad["."]; (* from this directory; or PacletInstall the paclet *)
Needs["Wolfram`Gym`"]
ttt = GymEnvironment["TicTacToe"];
ttt["Actions", ttt["InitialState"]] (* legal moves: {1, ..., 9} *)
(* random self-play, and a 100-game match summary *)
PlayGame[ttt, {RandomPolicy[ttt], RandomPolicy[ttt]}]["Outcome"]
PlayMatch[ttt, {RandomPolicy[ttt], RandomPolicy[ttt]}, 100]Each learner returns an Agent that is itself callable as a policy. The
companion tutorial builds them up one rung at a time on classic environments.
AgentTrain is the one training entry point; its Method selects the algorithm,
and an interrupted run hands back the agent learned so far.
gw = GymEnvironment["GridWorld", {4, 4}];
ValueIteration[gw] (* dynamic programming: exact, model-based *)
AgentTrain[gw, Method -> "QLearning", "Episodes" -> 1200] (* tabular model-free, from experience *)
BanditAgent[GymEnvironment["Bandit"], "UCB"] (* k-armed bandit *)
cp = GymEnvironment["CartPole"];
AgentTrain[cp, Method -> "CrossEntropy"] (* gradient-free policy search *)
AgentTrain[cp, Method -> "DQN"] (* deep Q-network, gradients from THVMLink *)The search-as-policy seam (with the sibling TreeSearch resource):
mcts = SearchPolicy[ttt, TreeSearch, Method -> "MonteCarlo", MaxIterations -> 1000];
PlayGame[ttt, {mcts, RandomPolicy[ttt]}]["Outcome"] (* the planner beats random *)- Environment (
GymEnvironment): a modelAssociationof pure callbacks ("Actions","Apply","TerminalQ","Reward","Player", optional"StepReward","Transitions","Reset","Observation"), plus presentation"Render", spaces,"InitialState", and"Players". States are plain expressions.env["prop"]reads a property,env["prop", args]applies a callback,env["Model"]returns the searchableAssociation. Built-ins:TicTacToe,GridWorld,Bandit,FrozenLake,CartPole,Pendulum,PushT(continuous control / manipulation), board games,AtariEnvironment(real Atari via the Arcade Learning Environment), andARCEnvironment(ARC-AGI reasoning tasks and interactive games).GymEnvironment[]/AtariEnvironment[]/ARCEnvironment[]list the available environments as a Dataset, with notebook argument auto-completion. - Policy: any
state -> actionfunction:RandomPolicy,HeuristicPolicy,SearchPolicy[env, planner, opts],HumanPolicy. - Orchestration:
RunEpisode(single agent),PlayGame(multi-agent),PlayMatch(repeated, mean per-player outcome); self-play isPlayGame[env, {p, p}]. - Learners:
BanditAgent,ValueIteration,PolicyIteration, andAgentTrain- the super-function whoseMethodselects"QLearning","Sarsa","CrossEntropy","DQN","AlphaZero"(self-play for two-player games),"MPC"(model-predictive control planning), or a custom Association. Each returns anAgent; an abortedAgentTrainrun returns the partially trained agent. - World models:
WorldModel[env]learns a model of an environment and returns it as a learnedGymEnvironment, so it composes - plan in it (AgentTrain[WorldModel[env], Method -> "MPC"]) or learn inside it (AgentTrain[WorldModel[env], Method -> "DQN"], learning in imagination).
The paclet uses StructuredPackageFormat: Kernel/Gym.wl two-pass-reads the
feature files (Spaces, Environments, Policies, Rollout, Agents,
DeepRL, Training, SelfPlay, WorldModel, Atari, ARC) into the
WolframGym`` context.
AgentTrain[env, Method -> "DQN"] trains a neural action-value network by
gradient descent; the gradient of the temporal-difference loss is computed by
THVMLink, the local experimental deep-learning runtime. Install it as a paclet
so Needs/PackageImport resolve it (the DeepRL feature file does
PackageImport["THVMLink"]). Set the DEVenvironment variable tometal` to
train on the GPU.
wl -f run_tests.wls
runs the Tests/*.wlt VerificationTest suite via TestReport and exits
non-zero on failure (75 tests across 8 files; the DeepRL file trains a DQN and
a cross-entropy policy, so it runs for a couple of minutes).
Literate-markdown sources live in docs/ (Guides/, Symbols/, Tutorials/);
build.wls converts them to notebooks under Documentation/English/ with
MarkdownToNotebook.
Part of the WolframInstitute example collection; pairs with the
TreeSearchresource (search becomes a policy).