feat: Add CodeDark multi-turn data analytics environment #322

vj-09 · 2026-01-23T10:30:32Z

Summary

This PR adds CodeDark, the first data analytics environment for the OpenEnv Hub. CodeDark challenges AI agents to analyze real CSV datasets using Python/Pandas through multi-turn tool-use conversations, testing their ability to be data scientists rather than just code executors.

Key Features

Real Business Tasks: Bank marketing (750K rows) and road safety (500K rows) datasets with genuine analytical questions
Multi-Turn Interaction: Agents explore data, save notes, ask clarifications, and submit answers over multiple turns
Shaped Rewards: 80% correctness + 10% efficiency + 10% token cost for RL training
Pre-Benchmarked: 25 curated L4-L6 difficulty tasks validated on 11+ models with 1,844 completions
Live Demo: Already deployed at https://huggingface.co/spaces/albert-einstein-09/codedark

Benchmark Results

Pre-benchmarked performance on the 25-task benchmark:

Model	Accuracy	Avg Turns	Cost/Task
Claude Opus 4.5	77.3%	4.2	$0.89
Qwen3 Max	46.7%	5.1	$0.12
Mistral Large	45.3%	5.8	$0.18
Llama 4 Maverick	38.7%	6.2	$0.08

Full leaderboard: https://www.analytics-rl.com

Tools

Tool	Description
`run_python`	Execute Python/pandas code (sandboxed)
`read_notes`	Read saved notes from previous turns
`save_note`	Save observations for later recall
`clarify`	Ask clarifying questions (max 2/episode)
`submit_answer`	Submit final answer (ends episode)

Task Difficulty Levels

Level	Complexity	Example
L4	Quartile/binned	"Subscription rate in Q1 balance?"
L5	Multi-condition	"Rate for month='may' AND job='management'?"
L6	Nested extrema	"In lowest subscription month, what's avg day?"

Quick Start

from codedark_env import CodeDarkEnv

# Connect to environment
env = CodeDarkEnv("https://albert-einstein-09-codedark.hf.space")

# Reset for new task
obs = env.reset()
print(f"Task: {obs['question']}")

# Execute Python code
obs = env.run_python("result = df.shape")
print(f"Shape: {obs['stdout']}")

# Submit answer
obs = env.submit_answer(2.44)
print(f"Reward: {obs['reward']}")

Files Added

envs/codedark_env/
├── __init__.py           # Package exports
├── models.py             # Action, Observation, State dataclasses
├── client.py             # HTTP client implementation
├── README.md             # Full documentation
└── server/
    ├── __init__.py
    ├── environment.py    # Core environment logic
    ├── tools.py          # Tool implementations
    ├── scoring.py        # Reward computation
    ├── app.py            # FastAPI server
    ├── requirements.txt  # Python dependencies
    └── Dockerfile        # Container spec

Checklist

Environment follows OpenEnv spec with reset(), step(), state API
Pydantic models for Action, Observation, State
FastAPI server with standard endpoints
Docker container builds and runs
README with action/observation specs
Pre-benchmarked on multiple models
Live demo on HuggingFace Spaces

Links

HuggingFace Space: https://huggingface.co/spaces/albert-einstein-09/codedark
Live API: https://albert-einstein-09-codedark.hf.space
Leaderboard: https://www.analytics-rl.com
Original Benchmark: https://github.com/vj-09/codeblue-env

Author

Vijay Athithya (@vj-09)

cc @jspisak

CodeDark is the first data analytics environment for OpenEnv Hub. It challenges AI agents to analyze real CSV datasets using Python/Pandas through multi-turn tool-use conversations. Key features: - Real business tasks: Bank marketing (750K rows) + Road safety (500K rows) - Multi-turn interaction: run_python, save_note, read_notes, clarify, submit_answer - Shaped rewards: 80% correctness + 10% efficiency + 10% token cost - Pre-benchmarked: 25 curated L4-L6 tasks, 11+ models, 1,844 completions - Live demo: https://huggingface.co/spaces/albert-einstein-09/codedark Benchmark results: - Claude Opus 4.5: 77.3% accuracy - Qwen3 Max: 46.7% - Mistral Large: 45.3% - Llama 4 Maverick: 38.7% Co-Authored-By: Claude Opus 4.5 <[email protected]>

meta-cla · 2026-01-23T10:30:39Z

Hi @vj-09!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

greptile-apps · 2026-01-23T10:37:32Z

Greptile Summary

CodeDark adds a multi-turn data analytics environment where agents analyze CSV datasets using Python/Pandas tools. The implementation follows OpenEnv's core Gym-like API (reset, step, state) and correctly encapsulates rewards inside the environment boundary.

Key Changes:

New environment with 5 tools: run_python, read_notes, save_note, clarify, submit_answer
Shaped reward system: 80% correctness + 10% efficiency + 10% token cost
HTTP-only server using FastAPI (no WebSocket implementation yet)
Custom HTTP client instead of MCP-based client pattern used by EchoEnv
Pre-benchmarked with 25 tasks across bank (750K rows) and road (500K rows) datasets

Issues Found:

CRITICAL: Security vulnerability in tools.py:53-55 - globals, locals, dir, vars in SAFE_BUILTINS allow sandbox escape
Client implementation uses custom HTTP pattern rather than MCPToolClient pattern
Missing WebSocket support (uses HTTP-only, but per INVARIANTS.md HTTP is still being transitioned)

Architectural Notes:

Follows "rewards inside environment" principle correctly
No client-server separation violations
Doesn't expose reset() to agents (properly isolated)
Container isolation properly configured

Confidence Score: 3/5

This PR has one critical security vulnerability that must be fixed before merge
Score of 3 reflects a critical security issue in the sandbox (globals/locals exposure) that allows code execution escape. The architecture follows OpenEnv patterns correctly (rewards inside, dual API boundary respected), but the code execution vulnerability is a blocking issue. Additionally, the HTTP-only approach and custom client pattern diverge from emerging MCP standards, though these are architectural choices rather than bugs.
Pay close attention to envs/codedark_env/server/tools.py - the SAFE_BUILTINS dictionary exposes dangerous introspection functions

Important Files Changed

Filename	Overview
envs/codedark_env/client.py	Custom HTTP client implementation instead of using OpenEnv's standard MCP client pattern. Missing generic types and doesn't follow EnvClient pattern.
envs/codedark_env/server/environment.py	Environment follows Gym-like API (reset, step, state) correctly. Rewards computed inside environment boundary as required. Minor: missing error handling in some edge cases.
envs/codedark_env/server/app.py	FastAPI server with standard endpoints. Uses HTTP-only (no WebSocket). CORS allows all origins which is acceptable for demo but should be noted.
envs/codedark_env/server/tools.py	Code execution using `exec` with restricted builtins. Security concern: `globals` and `locals` exposed in SAFE_BUILTINS allows sandbox escape.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Client as CodeDarkEnv<br/>(HTTP Client)
    participant Server as FastAPI Server
    participant Env as CodeDarkEnvironment
    participant Tools as Tool Functions
    participant Scoring as Reward System

    Agent->>Client: reset(task_id, seed)
    Client->>Server: POST /reset
    Server->>Env: reset(task_id, seed)
    Env->>Env: Load task from JSONL
    Env->>Env: Load CSV (bank/road)
    Env->>Env: Initialize state
    Env-->>Server: CodeDarkObservation
    Server-->>Client: JSON observation
    Client-->>Agent: Dict[str, Any]

    loop Multi-turn episode (max 10 turns)
        Agent->>Client: run_python(code)
        Client->>Server: POST /step {tool, args}
        Server->>Env: step(CodeDarkAction)
        Env->>Tools: run_python(code, df)
        Tools->>Tools: exec() with SAFE_BUILTINS
        Tools-->>Env: (stdout, stderr, exit_code)
        Env->>Env: Increment turn_count
        Env-->>Server: CodeDarkObservation
        Server-->>Client: JSON observation
        Client-->>Agent: Dict[str, Any]

        opt Save note
            Agent->>Client: save_note(content)
            Client->>Server: POST /step
            Server->>Env: step(CodeDarkAction)
            Env->>Tools: save_note(content, notes)
            Tools->>Tools: Append to notes list
            Tools-->>Env: Success
            Env-->>Server: CodeDarkObservation
            Server-->>Client: JSON
            Client-->>Agent: Dict[str, Any]
        end

        opt Ask clarification
            Agent->>Client: clarify(question)
            Client->>Server: POST /step
            Server->>Env: step(CodeDarkAction)
            Env->>Tools: clarify(question, ...)
            Tools->>Tools: Match to ambiguities
            Tools-->>Env: Clarification response
            Env-->>Server: CodeDarkObservation
            Server-->>Client: JSON
            Client-->>Agent: Dict[str, Any]
        end

        Agent->>Client: submit_answer(value)
        Client->>Server: POST /step
        Server->>Env: step(CodeDarkAction)
        Env->>Tools: submit_answer(answer_str)
        Tools->>Tools: Parse answer
        Tools-->>Env: Parsed answer
        Env->>Env: Set submitted=True
        Env->>Scoring: compute_reward(submitted, expected, ...)
        Scoring->>Scoring: score_correctness (80%)
        Scoring->>Scoring: score_efficiency (10%)
        Scoring->>Scoring: score_token_cost (10%)
        Scoring-->>Env: (reward, correctness, efficiency, cost)
        Env-->>Server: CodeDarkObservation (done=True, reward)
        Server-->>Client: JSON with reward
        Client-->>Agent: Dict[str, Any]
    end

    Agent->>Client: state()
    Client->>Server: GET /state
    Server->>Env: @property state
    Env-->>Server: CodeDarkState
    Server-->>Client: JSON state
    Client-->>Agent: Dict[str, Any]

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-23T10:37:31Z

envs/codedark_env/server/tools.py

+    "locals": locals,
+    "globals": globals,
+    "dir": dir,


logic: Security risk: locals, globals, dir, and vars in SAFE_BUILTINS allow sandbox escape.

These introspection functions can access __builtins__, __import__, and other dangerous objects that bypass the restricted execution context.

Suggested change

"locals": locals,

"globals": globals,

"dir": dir,

# "locals": locals, # Removed - allows sandbox escape

# "globals": globals, # Removed - allows sandbox escape

# "dir": dir, # Removed - allows introspection of restricted objects

# "vars": vars, # Removed - allows introspection of restricted objects

Prompt To Fix With AI

This is a comment left during a code review. Path: envs/codedark_env/server/tools.py Line: 53:55 Comment: **logic:** Security risk: `locals`, `globals`, `dir`, and `vars` in `SAFE_BUILTINS` allow sandbox escape. These introspection functions can access `__builtins__`, `__import__`, and other dangerous objects that bypass the restricted execution context. ```suggestion # "locals": locals, # Removed - allows sandbox escape # "globals": globals, # Removed - allows sandbox escape # "dir": dir, # Removed - allows introspection of restricted objects # "vars": vars, # Removed - allows introspection of restricted objects ``` How can I resolve this? If you propose a fix, please make it concise.

Remove globals, locals, dir, and vars from SAFE_BUILTINS as they allow sandbox escape in the run_python tool. These introspection functions could be used to access the full execution environment and bypass the sandboxed code execution. Addresses security concern raised in Greptile review. Co-Authored-By: Claude Opus 4.5 <[email protected]>

vj-09 · 2026-01-31T21:36:47Z

Next steps?

greptile-apps bot reviewed Jan 23, 2026

View reviewed changes

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add CodeDark multi-turn data analytics environment #322

feat: Add CodeDark multi-turn data analytics environment #322

Uh oh!

vj-09 commented Jan 23, 2026

Uh oh!

meta-cla bot commented Jan 23, 2026

Uh oh!

greptile-apps bot commented Jan 23, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 23, 2026

Uh oh!

vj-09 commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    "locals": locals,
-    "globals": globals,
-    "dir": dir,
+    # "locals": locals,  # Removed - allows sandbox escape
+    # "globals": globals,  # Removed - allows sandbox escape
+    # "dir": dir,  # Removed - allows introspection of restricted objects
+    # "vars": vars,  # Removed - allows introspection of restricted objects

feat: Add CodeDark multi-turn data analytics environment #322

Are you sure you want to change the base?

feat: Add CodeDark multi-turn data analytics environment #322

Uh oh!

Conversation

vj-09 commented Jan 23, 2026

Summary

Key Features

Benchmark Results

Tools

Task Difficulty Levels

Quick Start

Files Added

Checklist

Links

Author

Uh oh!

meta-cla bot commented Jan 23, 2026

Action Required

Process

Uh oh!

greptile-apps bot commented Jan 23, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

vj-09 commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant