Skip to content

WolframInstitute/Gym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gym

A Gym-style environment and agent framework for the Wolfram Language. An environment is a pure model (an Association of callbacks) wrapped in a GymEnvironment object; the model keys are exactly the interface a state-space search consumes, so env["Model"] is directly searchable and a search becomes a policy. Policies are plain functions state -> action. Episodes, games, and matches run them, including self-play. On top sits the full agent zoo: bandits, dynamic programming, tabular model-free control, and deep RL.

PacletDirectoryLoad["."];   (* from this directory; or PacletInstall the paclet *)
Needs["Wolfram`Gym`"]

ttt = GymEnvironment["TicTacToe"];
ttt["Actions", ttt["InitialState"]]          (* legal moves: {1, ..., 9} *)

(* random self-play, and a 100-game match summary *)
PlayGame[ttt, {RandomPolicy[ttt], RandomPolicy[ttt]}]["Outcome"]
PlayMatch[ttt, {RandomPolicy[ttt], RandomPolicy[ttt]}, 100]

The agent zoo

Each learner returns an Agent that is itself callable as a policy. The companion tutorial builds them up one rung at a time on classic environments. AgentTrain is the one training entry point; its Method selects the algorithm, and an interrupted run hands back the agent learned so far.

gw = GymEnvironment["GridWorld", {4, 4}];
ValueIteration[gw]                                   (* dynamic programming: exact, model-based *)
AgentTrain[gw, Method -> "QLearning", "Episodes" -> 1200]   (* tabular model-free, from experience *)
BanditAgent[GymEnvironment["Bandit"], "UCB"]         (* k-armed bandit *)

cp = GymEnvironment["CartPole"];
AgentTrain[cp, Method -> "CrossEntropy"]             (* gradient-free policy search *)
AgentTrain[cp, Method -> "DQN"]                      (* deep Q-network, gradients from THVMLink *)

The search-as-policy seam (with the sibling TreeSearch resource):

mcts = SearchPolicy[ttt, TreeSearch, Method -> "MonteCarlo", MaxIterations -> 1000];
PlayGame[ttt, {mcts, RandomPolicy[ttt]}]["Outcome"]   (* the planner beats random *)

Design

  • Environment (GymEnvironment): a model Association of pure callbacks ("Actions", "Apply", "TerminalQ", "Reward", "Player", optional "StepReward", "Transitions", "Reset", "Observation"), plus presentation "Render", spaces, "InitialState", and "Players". States are plain expressions. env["prop"] reads a property, env["prop", args] applies a callback, env["Model"] returns the searchable Association. Built-ins: TicTacToe, GridWorld, Bandit, FrozenLake, CartPole, Pendulum, PushT (continuous control / manipulation), board games, AtariEnvironment (real Atari via the Arcade Learning Environment), and ARCEnvironment (ARC-AGI reasoning tasks and interactive games). GymEnvironment[] / AtariEnvironment[] / ARCEnvironment[] list the available environments as a Dataset, with notebook argument auto-completion.
  • Policy: any state -> action function: RandomPolicy, HeuristicPolicy, SearchPolicy[env, planner, opts], HumanPolicy.
  • Orchestration: RunEpisode (single agent), PlayGame (multi-agent), PlayMatch (repeated, mean per-player outcome); self-play is PlayGame[env, {p, p}].
  • Learners: BanditAgent, ValueIteration, PolicyIteration, and AgentTrain - the super-function whose Method selects "QLearning", "Sarsa", "CrossEntropy", "DQN", "AlphaZero" (self-play for two-player games), "MPC" (model-predictive control planning), or a custom Association. Each returns an Agent; an aborted AgentTrain run returns the partially trained agent.
  • World models: WorldModel[env] learns a model of an environment and returns it as a learned GymEnvironment, so it composes - plan in it (AgentTrain[WorldModel[env], Method -> "MPC"]) or learn inside it (AgentTrain[WorldModel[env], Method -> "DQN"], learning in imagination).

The paclet uses StructuredPackageFormat: Kernel/Gym.wl two-pass-reads the feature files (Spaces, Environments, Policies, Rollout, Agents, DeepRL, Training, SelfPlay, WorldModel, Atari, ARC) into the WolframGym`` context.

Deep RL and THVMLink

AgentTrain[env, Method -> "DQN"] trains a neural action-value network by gradient descent; the gradient of the temporal-difference loss is computed by THVMLink, the local experimental deep-learning runtime. Install it as a paclet so Needs/PackageImport resolve it (the DeepRL feature file does PackageImport["THVMLink"]). Set the DEVenvironment variable tometal` to train on the GPU.

Tests

wl -f run_tests.wls

runs the Tests/*.wlt VerificationTest suite via TestReport and exits non-zero on failure (75 tests across 8 files; the DeepRL file trains a DQN and a cross-entropy policy, so it runs for a couple of minutes).

Documentation

Literate-markdown sources live in docs/ (Guides/, Symbols/, Tutorials/); build.wls converts them to notebooks under Documentation/English/ with MarkdownToNotebook.

Part of the WolframInstitute example collection; pairs with the TreeSearch resource (search becomes a policy).

About

Gym-style environment and agent framework for the Wolfram Language: pure-model environments (games, control, gridworlds, puzzles), pluggable policies, episodes, matches and self-play, interoperating with state-space search (TreeSearch)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors