-
Notifications
You must be signed in to change notification settings - Fork 174
feat: Add CodeDark multi-turn data analytics environment #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
CodeDark is the first data analytics environment for OpenEnv Hub. It challenges AI agents to analyze real CSV datasets using Python/Pandas through multi-turn tool-use conversations. Key features: - Real business tasks: Bank marketing (750K rows) + Road safety (500K rows) - Multi-turn interaction: run_python, save_note, read_notes, clarify, submit_answer - Shaped rewards: 80% correctness + 10% efficiency + 10% token cost - Pre-benchmarked: 25 curated L4-L6 tasks, 11+ models, 1,844 completions - Live demo: https://huggingface.co/spaces/albert-einstein-09/codedark Benchmark results: - Claude Opus 4.5: 77.3% accuracy - Qwen3 Max: 46.7% - Mistral Large: 45.3% - Llama 4 Maverick: 38.7% Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Hi @vj-09! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Greptile SummaryCodeDark adds a multi-turn data analytics environment where agents analyze CSV datasets using Python/Pandas tools. The implementation follows OpenEnv's core Gym-like API ( Key Changes:
Issues Found:
Architectural Notes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent
participant Client as CodeDarkEnv<br/>(HTTP Client)
participant Server as FastAPI Server
participant Env as CodeDarkEnvironment
participant Tools as Tool Functions
participant Scoring as Reward System
Agent->>Client: reset(task_id, seed)
Client->>Server: POST /reset
Server->>Env: reset(task_id, seed)
Env->>Env: Load task from JSONL
Env->>Env: Load CSV (bank/road)
Env->>Env: Initialize state
Env-->>Server: CodeDarkObservation
Server-->>Client: JSON observation
Client-->>Agent: Dict[str, Any]
loop Multi-turn episode (max 10 turns)
Agent->>Client: run_python(code)
Client->>Server: POST /step {tool, args}
Server->>Env: step(CodeDarkAction)
Env->>Tools: run_python(code, df)
Tools->>Tools: exec() with SAFE_BUILTINS
Tools-->>Env: (stdout, stderr, exit_code)
Env->>Env: Increment turn_count
Env-->>Server: CodeDarkObservation
Server-->>Client: JSON observation
Client-->>Agent: Dict[str, Any]
opt Save note
Agent->>Client: save_note(content)
Client->>Server: POST /step
Server->>Env: step(CodeDarkAction)
Env->>Tools: save_note(content, notes)
Tools->>Tools: Append to notes list
Tools-->>Env: Success
Env-->>Server: CodeDarkObservation
Server-->>Client: JSON
Client-->>Agent: Dict[str, Any]
end
opt Ask clarification
Agent->>Client: clarify(question)
Client->>Server: POST /step
Server->>Env: step(CodeDarkAction)
Env->>Tools: clarify(question, ...)
Tools->>Tools: Match to ambiguities
Tools-->>Env: Clarification response
Env-->>Server: CodeDarkObservation
Server-->>Client: JSON
Client-->>Agent: Dict[str, Any]
end
Agent->>Client: submit_answer(value)
Client->>Server: POST /step
Server->>Env: step(CodeDarkAction)
Env->>Tools: submit_answer(answer_str)
Tools->>Tools: Parse answer
Tools-->>Env: Parsed answer
Env->>Env: Set submitted=True
Env->>Scoring: compute_reward(submitted, expected, ...)
Scoring->>Scoring: score_correctness (80%)
Scoring->>Scoring: score_efficiency (10%)
Scoring->>Scoring: score_token_cost (10%)
Scoring-->>Env: (reward, correctness, efficiency, cost)
Env-->>Server: CodeDarkObservation (done=True, reward)
Server-->>Client: JSON with reward
Client-->>Agent: Dict[str, Any]
end
Agent->>Client: state()
Client->>Server: GET /state
Server->>Env: @property state
Env-->>Server: CodeDarkState
Server-->>Client: JSON state
Client-->>Agent: Dict[str, Any]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12 files reviewed, 1 comment
envs/codedark_env/server/tools.py
Outdated
| "locals": locals, | ||
| "globals": globals, | ||
| "dir": dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Security risk: locals, globals, dir, and vars in SAFE_BUILTINS allow sandbox escape.
These introspection functions can access __builtins__, __import__, and other dangerous objects that bypass the restricted execution context.
| "locals": locals, | |
| "globals": globals, | |
| "dir": dir, | |
| # "locals": locals, # Removed - allows sandbox escape | |
| # "globals": globals, # Removed - allows sandbox escape | |
| # "dir": dir, # Removed - allows introspection of restricted objects | |
| # "vars": vars, # Removed - allows introspection of restricted objects |
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/codedark_env/server/tools.py
Line: 53:55
Comment:
**logic:** Security risk: `locals`, `globals`, `dir`, and `vars` in `SAFE_BUILTINS` allow sandbox escape.
These introspection functions can access `__builtins__`, `__import__`, and other dangerous objects that bypass the restricted execution context.
```suggestion
# "locals": locals, # Removed - allows sandbox escape
# "globals": globals, # Removed - allows sandbox escape
# "dir": dir, # Removed - allows introspection of restricted objects
# "vars": vars, # Removed - allows introspection of restricted objects
```
How can I resolve this? If you propose a fix, please make it concise.Remove globals, locals, dir, and vars from SAFE_BUILTINS as they allow sandbox escape in the run_python tool. These introspection functions could be used to access the full execution environment and bypass the sandboxed code execution. Addresses security concern raised in Greptile review. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Next steps? |
Summary
This PR adds CodeDark, the first data analytics environment for the OpenEnv Hub. CodeDark challenges AI agents to analyze real CSV datasets using Python/Pandas through multi-turn tool-use conversations, testing their ability to be data scientists rather than just code executors.
Key Features
Benchmark Results
Pre-benchmarked performance on the 25-task benchmark:
Full leaderboard: https://www.analytics-rl.com
Tools
run_pythonread_notessave_noteclarifysubmit_answerTask Difficulty Levels
Quick Start
Files Added
Checklist
reset(),step(),stateAPILinks
Author
Vijay Athithya (@vj-09)
cc @jspisak