Gym

A Gym-style environment and agent framework for the Wolfram Language. An environment is a pure model (an Association of callbacks) wrapped in a GymEnvironment object; the model keys are exactly the interface a state-space search consumes, so env["Model"] is directly searchable and a search becomes a policy. Policies are plain functions state -> action. Episodes, games, and matches run them, including self-play. On top sits the full agent zoo: bandits, dynamic programming, tabular model-free control, and deep RL.

PacletDirectoryLoad["."];   (* from this directory; or PacletInstall the paclet *)
Needs["Wolfram`Gym`"]

ttt = GymEnvironment["TicTacToe"];
ttt["Actions", ttt["InitialState"]]          (* legal moves: {1, ..., 9} *)

(* random self-play, and a 100-game match summary *)
PlayGame[ttt, {RandomPolicy[ttt], RandomPolicy[ttt]}]["Outcome"]
PlayMatch[ttt, {RandomPolicy[ttt], RandomPolicy[ttt]}, 100]

The agent zoo

Each learner returns an Agent that is itself callable as a policy. The companion tutorial builds them up one rung at a time on classic environments. AgentTrain is the one training entry point; its Method selects the algorithm, and an interrupted run hands back the agent learned so far.

gw = GymEnvironment["GridWorld", {4, 4}];
ValueIteration[gw]                                   (* dynamic programming: exact, model-based *)
AgentTrain[gw, Method -> "QLearning", "Episodes" -> 1200]   (* tabular model-free, from experience *)
BanditAgent[GymEnvironment["Bandit"], "UCB"]         (* k-armed bandit *)

cp = GymEnvironment["CartPole"];
AgentTrain[cp, Method -> "CrossEntropy"]             (* gradient-free policy search *)
AgentTrain[cp, Method -> "DQN"]                      (* deep Q-network, gradients from THVMLink *)

The search-as-policy seam (with the sibling TreeSearch resource):

mcts = SearchPolicy[ttt, TreeSearch, Method -> "MonteCarlo", MaxIterations -> 1000];
PlayGame[ttt, {mcts, RandomPolicy[ttt]}]["Outcome"]   (* the planner beats random *)

Design

Environment (GymEnvironment): a model Association of pure callbacks ("Actions", "Apply", "TerminalQ", "Reward", "Player", optional "StepReward", "Transitions", "Reset", "Observation"), plus presentation "Render", spaces, "InitialState", and "Players". States are plain expressions. env["prop"] reads a property, env["prop", args] applies a callback, env["Model"] returns the searchable Association. Built-ins: TicTacToe, GridWorld, Bandit, FrozenLake, CartPole, Pendulum, PushT (continuous control / manipulation), board games, AtariEnvironment (real Atari via the Arcade Learning Environment), and ARCEnvironment (ARC-AGI reasoning tasks and interactive games). GymEnvironment[] / AtariEnvironment[] / ARCEnvironment[] list the available environments as a Dataset, with notebook argument auto-completion.
Policy: any state -> action function: RandomPolicy, HeuristicPolicy, SearchPolicy[env, planner, opts], HumanPolicy.
Orchestration: RunEpisode (single agent), PlayGame (multi-agent), PlayMatch (repeated, mean per-player outcome); self-play is PlayGame[env, {p, p}].
Learners: BanditAgent, ValueIteration, PolicyIteration, and AgentTrain - the super-function whose Method selects "QLearning", "Sarsa", "CrossEntropy", "DQN", "AlphaZero" (self-play for two-player games), "MPC" (model-predictive control planning), or a custom Association. Each returns an Agent; an aborted AgentTrain run returns the partially trained agent.
World models: WorldModel[env] learns a model of an environment and returns it as a learned GymEnvironment, so it composes - plan in it (AgentTrain[WorldModel[env], Method -> "MPC"]) or learn inside it (AgentTrain[WorldModel[env], Method -> "DQN"], learning in imagination).

The paclet uses StructuredPackageFormat: Kernel/Gym.wl two-pass-reads the feature files (Spaces, Environments, Policies, Rollout, Agents, DeepRL, Training, SelfPlay, WorldModel, Atari, ARC) into the WolframGym`` context.

Deep RL and THVMLink

AgentTrain[env, Method -> "DQN"] trains a neural action-value network by gradient descent; the gradient of the temporal-difference loss is computed by THVMLink, the local experimental deep-learning runtime. Install it as a paclet so Needs/PackageImport resolve it (the DeepRL feature file does PackageImport["THVMLink"]). Set the DEVenvironment variable tometal` to train on the GPU.

Tests

wl -f run_tests.wls

runs the Tests/*.wlt VerificationTest suite via TestReport and exits non-zero on failure (75 tests across 8 files; the DeepRL file trains a DQN and a cross-entropy policy, so it runs for a couple of minutes).

Documentation

Literate-markdown sources live in docs/ (Guides/, Symbols/, Tutorials/); build.wls converts them to notebooks under Documentation/English/ with MarkdownToNotebook.

Part of the WolframInstitute example collection; pairs with the TreeSearch resource (search becomes a policy).

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Kernel		Kernel
Tests		Tests
docs		docs
.gitignore		.gitignore
GUIDE.md		GUIDE.md
PacletInfo.wl		PacletInfo.wl
README.md		README.md
build.wls		build.wls
run_tests.wls		run_tests.wls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gym

The agent zoo

Design

Deep RL and THVMLink

Tests

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gym

The agent zoo

Design

Deep RL and THVMLink

Tests

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages