-
Notifications
You must be signed in to change notification settings - Fork 3
docs: draft of readme #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,75 @@ | ||
# itbench-leaderboard | ||
Code repository for leaderboard as part of ITBench | ||
# ITBench leaderboard | ||
|
||
This repository powers the community leaderboard for [ITBench](https://github.com/IBM/ITBench), an open benchmark for evaluating AI agents on real-world IT automation tasks. | ||
|
||
The leaderboard showcases how different agents perform across officially defined scenarios spanning Site Reliability Engineering (SRE), Financial Operations (FinOps), and Compliance/Security (CISO) domains. | ||
|
||
## 🏆 About the leaderboard | ||
|
||
The leaderboard is automatically updated from submitted results and displays performance metrics for each agent, such as: | ||
|
||
- Percentage of scenarios successfully completed | ||
- Scenario-specific scores | ||
- Runtime or efficiency metrics (if applicable) | ||
|
||
For details on the benchmarks, environment setup, and submitting your agent to the leaderboard, refer to the [ITBench main repo](https://github.com/IBM/ITBench). | ||
|
||
## 📊 Sample leaderboard (placeholder) | ||
|
||
| Rank | Agent Name | Overall Score | SRE | FinOps | CISO | Notes | | ||
|------|------------------------|---------------|-----|--------|------|---------------------------| | ||
| 1 | Baseline SRE Agent | 85% | ✅✅✅✅✅✅ | N/A | N/A | Solved all SRE scenarios | | ||
| 2 | Baseline CISO Agent | 80% | N/A | N/A | ✅✅✅ | High compliance coverage | | ||
| 3 | Your Agent Here | TBD | TBD | TBD | TBD | Submit to find out | | ||
|
||
> This is placeholder data. Actual results will be posted after public submissions open. | ||
|
||
--- | ||
|
||
<!-- ## 📤 Submitting your results | ||
|
||
Ready to submit your agent? Follow these steps: | ||
|
||
1. **Run the official ITBench scenarios** | ||
Use [ITBench-Scenarios](https://github.com/IBM/ITBench-Scenarios) and run your agent (e.g., [SRE agent](https://github.com/IBM/itbench-sre-agent) or [CISO agent](https://github.com/IBM/itbench-ciso-caa-agent)) on each task. | ||
|
||
2. **Collect your results** | ||
Capture outcome data (e.g., scenario success/failure, logs, scores). You can use the utilities provided in [ITBench-Utilities](https://github.com/IBM/ITBench-Utilities) to format and summarize results. | ||
|
||
3. **Fork this repository and submit a pull request** | ||
Include: | ||
- Your agent name and affiliation (if applicable) | ||
- A link to your agent’s source repo | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this needed for CISO? For SRE this is optional. |
||
- A structured summary of your results (in `data/` or `leaderboard.md`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this needed for CISO? For SRE this is optional. |
||
- Any logs or metadata that aid verification | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this needed for CISO? For SRE this is optional. |
||
|
||
4. **Verification** | ||
The ITBench team may re-run your agent on selected scenarios. Verified submissions will be added to the public leaderboard. | ||
|
||
--- --> | ||
|
||
<!-- | ||
## 📁 Repository structure | ||
|
||
``` | ||
. | ||
├── data/ # Submitted results and metadata | ||
├── leaderboard.md # Markdown version of the public leaderboard | ||
├── scripts/ # (Optional) Helper scripts for formatting or validation | ||
└── README.md | ||
``` | ||
--> | ||
|
||
## 📄 License | ||
|
||
This project is licensed under the [Apache License 2.0](LICENSE). | ||
|
||
## 🙋 Contributing | ||
|
||
To improve the leaderboard infrastructure or submission process: | ||
|
||
- Open an issue to propose enhancements | ||
- Submit a pull request for code, formatting improvements, or submission guidelines | ||
- Follow the [contribution guidelines in the main ITBench repo](https://github.com/IBM/ITBench) | ||
|
||
By contributing, you agree to license your work under Apache 2.0 and follow the ITBench project’s code of conduct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why would they need to submit a pull request, Connor (@connor-leech).
cc: @yuji-watanabe-jp