Skip to content

Conversation

@vj-09
Copy link

@vj-09 vj-09 commented Jan 23, 2026

Summary

This PR adds CodeDark, the first data analytics environment for the OpenEnv Hub. CodeDark challenges AI agents to analyze real CSV datasets using Python/Pandas through multi-turn tool-use conversations, testing their ability to be data scientists rather than just code executors.

Key Features

  • Real Business Tasks: Bank marketing (750K rows) and road safety (500K rows) datasets with genuine analytical questions
  • Multi-Turn Interaction: Agents explore data, save notes, ask clarifications, and submit answers over multiple turns
  • Shaped Rewards: 80% correctness + 10% efficiency + 10% token cost for RL training
  • Pre-Benchmarked: 25 curated L4-L6 difficulty tasks validated on 11+ models with 1,844 completions
  • Live Demo: Already deployed at https://huggingface.co/spaces/albert-einstein-09/codedark

Benchmark Results

Pre-benchmarked performance on the 25-task benchmark:

Model Accuracy Avg Turns Cost/Task
Claude Opus 4.5 77.3% 4.2 $0.89
Qwen3 Max 46.7% 5.1 $0.12
Mistral Large 45.3% 5.8 $0.18
Llama 4 Maverick 38.7% 6.2 $0.08

Full leaderboard: https://www.analytics-rl.com

Tools

Tool Description
run_python Execute Python/pandas code (sandboxed)
read_notes Read saved notes from previous turns
save_note Save observations for later recall
clarify Ask clarifying questions (max 2/episode)
submit_answer Submit final answer (ends episode)

Task Difficulty Levels

Level Complexity Example
L4 Quartile/binned "Subscription rate in Q1 balance?"
L5 Multi-condition "Rate for month='may' AND job='management'?"
L6 Nested extrema "In lowest subscription month, what's avg day?"

Quick Start

from codedark_env import CodeDarkEnv

# Connect to environment
env = CodeDarkEnv("https://albert-einstein-09-codedark.hf.space")

# Reset for new task
obs = env.reset()
print(f"Task: {obs['question']}")

# Execute Python code
obs = env.run_python("result = df.shape")
print(f"Shape: {obs['stdout']}")

# Submit answer
obs = env.submit_answer(2.44)
print(f"Reward: {obs['reward']}")

Files Added

envs/codedark_env/
├── __init__.py           # Package exports
├── models.py             # Action, Observation, State dataclasses
├── client.py             # HTTP client implementation
├── README.md             # Full documentation
└── server/
    ├── __init__.py
    ├── environment.py    # Core environment logic
    ├── tools.py          # Tool implementations
    ├── scoring.py        # Reward computation
    ├── app.py            # FastAPI server
    ├── requirements.txt  # Python dependencies
    └── Dockerfile        # Container spec

Checklist

  • Environment follows OpenEnv spec with reset(), step(), state API
  • Pydantic models for Action, Observation, State
  • FastAPI server with standard endpoints
  • Docker container builds and runs
  • README with action/observation specs
  • Pre-benchmarked on multiple models
  • Live demo on HuggingFace Spaces

Links

Author

Vijay Athithya (@vj-09)


cc @jspisak

CodeDark is the first data analytics environment for OpenEnv Hub.
It challenges AI agents to analyze real CSV datasets using Python/Pandas
through multi-turn tool-use conversations.

Key features:
- Real business tasks: Bank marketing (750K rows) + Road safety (500K rows)
- Multi-turn interaction: run_python, save_note, read_notes, clarify, submit_answer
- Shaped rewards: 80% correctness + 10% efficiency + 10% token cost
- Pre-benchmarked: 25 curated L4-L6 tasks, 11+ models, 1,844 completions
- Live demo: https://huggingface.co/spaces/albert-einstein-09/codedark

Benchmark results:
- Claude Opus 4.5: 77.3% accuracy
- Qwen3 Max: 46.7%
- Mistral Large: 45.3%
- Llama 4 Maverick: 38.7%

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@meta-cla
Copy link

meta-cla bot commented Jan 23, 2026

Hi @vj-09!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@greptile-apps
Copy link

greptile-apps bot commented Jan 23, 2026

Greptile Summary

CodeDark adds a multi-turn data analytics environment where agents analyze CSV datasets using Python/Pandas tools. The implementation follows OpenEnv's core Gym-like API (reset, step, state) and correctly encapsulates rewards inside the environment boundary.

Key Changes:

  • New environment with 5 tools: run_python, read_notes, save_note, clarify, submit_answer
  • Shaped reward system: 80% correctness + 10% efficiency + 10% token cost
  • HTTP-only server using FastAPI (no WebSocket implementation yet)
  • Custom HTTP client instead of MCP-based client pattern used by EchoEnv
  • Pre-benchmarked with 25 tasks across bank (750K rows) and road (500K rows) datasets

Issues Found:

  • CRITICAL: Security vulnerability in tools.py:53-55 - globals, locals, dir, vars in SAFE_BUILTINS allow sandbox escape
  • Client implementation uses custom HTTP pattern rather than MCPToolClient pattern
  • Missing WebSocket support (uses HTTP-only, but per INVARIANTS.md HTTP is still being transitioned)

Architectural Notes:

  • Follows "rewards inside environment" principle correctly
  • No client-server separation violations
  • Doesn't expose reset() to agents (properly isolated)
  • Container isolation properly configured

Confidence Score: 3/5

  • This PR has one critical security vulnerability that must be fixed before merge
  • Score of 3 reflects a critical security issue in the sandbox (globals/locals exposure) that allows code execution escape. The architecture follows OpenEnv patterns correctly (rewards inside, dual API boundary respected), but the code execution vulnerability is a blocking issue. Additionally, the HTTP-only approach and custom client pattern diverge from emerging MCP standards, though these are architectural choices rather than bugs.
  • Pay close attention to envs/codedark_env/server/tools.py - the SAFE_BUILTINS dictionary exposes dangerous introspection functions

Important Files Changed

Filename Overview
envs/codedark_env/client.py Custom HTTP client implementation instead of using OpenEnv's standard MCP client pattern. Missing generic types and doesn't follow EnvClient pattern.
envs/codedark_env/server/environment.py Environment follows Gym-like API (reset, step, state) correctly. Rewards computed inside environment boundary as required. Minor: missing error handling in some edge cases.
envs/codedark_env/server/app.py FastAPI server with standard endpoints. Uses HTTP-only (no WebSocket). CORS allows all origins which is acceptable for demo but should be noted.
envs/codedark_env/server/tools.py Code execution using exec with restricted builtins. Security concern: globals and locals exposed in SAFE_BUILTINS allows sandbox escape.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Client as CodeDarkEnv<br/>(HTTP Client)
    participant Server as FastAPI Server
    participant Env as CodeDarkEnvironment
    participant Tools as Tool Functions
    participant Scoring as Reward System

    Agent->>Client: reset(task_id, seed)
    Client->>Server: POST /reset
    Server->>Env: reset(task_id, seed)
    Env->>Env: Load task from JSONL
    Env->>Env: Load CSV (bank/road)
    Env->>Env: Initialize state
    Env-->>Server: CodeDarkObservation
    Server-->>Client: JSON observation
    Client-->>Agent: Dict[str, Any]

    loop Multi-turn episode (max 10 turns)
        Agent->>Client: run_python(code)
        Client->>Server: POST /step {tool, args}
        Server->>Env: step(CodeDarkAction)
        Env->>Tools: run_python(code, df)
        Tools->>Tools: exec() with SAFE_BUILTINS
        Tools-->>Env: (stdout, stderr, exit_code)
        Env->>Env: Increment turn_count
        Env-->>Server: CodeDarkObservation
        Server-->>Client: JSON observation
        Client-->>Agent: Dict[str, Any]

        opt Save note
            Agent->>Client: save_note(content)
            Client->>Server: POST /step
            Server->>Env: step(CodeDarkAction)
            Env->>Tools: save_note(content, notes)
            Tools->>Tools: Append to notes list
            Tools-->>Env: Success
            Env-->>Server: CodeDarkObservation
            Server-->>Client: JSON
            Client-->>Agent: Dict[str, Any]
        end

        opt Ask clarification
            Agent->>Client: clarify(question)
            Client->>Server: POST /step
            Server->>Env: step(CodeDarkAction)
            Env->>Tools: clarify(question, ...)
            Tools->>Tools: Match to ambiguities
            Tools-->>Env: Clarification response
            Env-->>Server: CodeDarkObservation
            Server-->>Client: JSON
            Client-->>Agent: Dict[str, Any]
        end

        Agent->>Client: submit_answer(value)
        Client->>Server: POST /step
        Server->>Env: step(CodeDarkAction)
        Env->>Tools: submit_answer(answer_str)
        Tools->>Tools: Parse answer
        Tools-->>Env: Parsed answer
        Env->>Env: Set submitted=True
        Env->>Scoring: compute_reward(submitted, expected, ...)
        Scoring->>Scoring: score_correctness (80%)
        Scoring->>Scoring: score_efficiency (10%)
        Scoring->>Scoring: score_token_cost (10%)
        Scoring-->>Env: (reward, correctness, efficiency, cost)
        Env-->>Server: CodeDarkObservation (done=True, reward)
        Server-->>Client: JSON with reward
        Client-->>Agent: Dict[str, Any]
    end

    Agent->>Client: state()
    Client->>Server: GET /state
    Server->>Env: @property state
    Env-->>Server: CodeDarkState
    Server-->>Client: JSON state
    Client-->>Agent: Dict[str, Any]
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 53 to 55
"locals": locals,
"globals": globals,
"dir": dir,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Security risk: locals, globals, dir, and vars in SAFE_BUILTINS allow sandbox escape.

These introspection functions can access __builtins__, __import__, and other dangerous objects that bypass the restricted execution context.

Suggested change
"locals": locals,
"globals": globals,
"dir": dir,
# "locals": locals, # Removed - allows sandbox escape
# "globals": globals, # Removed - allows sandbox escape
# "dir": dir, # Removed - allows introspection of restricted objects
# "vars": vars, # Removed - allows introspection of restricted objects
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/codedark_env/server/tools.py
Line: 53:55

Comment:
**logic:** Security risk: `locals`, `globals`, `dir`, and `vars` in `SAFE_BUILTINS` allow sandbox escape.

These introspection functions can access `__builtins__`, `__import__`, and other dangerous objects that bypass the restricted execution context.

```suggestion
    # "locals": locals,  # Removed - allows sandbox escape
    # "globals": globals,  # Removed - allows sandbox escape
    # "dir": dir,  # Removed - allows introspection of restricted objects
    # "vars": vars,  # Removed - allows introspection of restricted objects
```

How can I resolve this? If you propose a fix, please make it concise.

Remove globals, locals, dir, and vars from SAFE_BUILTINS as they allow
sandbox escape in the run_python tool. These introspection functions
could be used to access the full execution environment and bypass the
sandboxed code execution.

Addresses security concern raised in Greptile review.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 23, 2026
@vj-09
Copy link
Author

vj-09 commented Jan 31, 2026

Next steps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant