Two production-quality AI agent benchmark tasks built on the Harbor evaluation framework.
Both tasks pass full Harbor validation: Oracle →1.0, NOP →0.0, ruff → clean.
- What Is This Project?
- How Harbor Works
- Project Structure
- Quick Start
- Task 1 — user_data_pipeline
- Task 2 — ecommerce_analytics_pipeline
- Scoring System
- Architecture Deep Dive
- Dataset Details
- Developer Workflow
- Troubleshooting
This repository contains two Harbor benchmark tasks designed to evaluate whether an AI agent can correctly process structured data and produce accurate analytical outputs.
| Task | Difficulty | Features | Scoring |
|---|---|---|---|
user_data_pipeline |
Level 1 | 8 data-processing features | Binary (pass/fail) |
ecommerce_analytics_pipeline |
Level 2 | 7 advanced analytics features | Partial credit (0.0–1.0) |
Each task follows the Harbor specification:
- An Oracle agent runs the reference solution and must score 1.0
- A NOP agent (does nothing) must score 0.0
- All processing is dynamic — results are computed from the input data, never hardcoded
┌─────────────────────────────────────────────────────────────────┐
│ Harbor Run │
│ │
│ 1. Builds Docker image from environment/Dockerfile │
│ 2. Starts container, copies input data into /app/ │
│ 3. Agent runs (Oracle: executes solution/solve.sh) │
│ 4. Verifier copies tests/ → /tests/ inside container │
│ 5. Verifier executes /tests/test.sh │
│ 6. test.sh runs test_outputs.py and writes a float to │
│ /logs/verifier/reward.txt (bind-mounted to host) │
│ 7. Harbor reads reward.txt → final score │
└─────────────────────────────────────────────────────────────────┘
| Path | Purpose |
|---|---|
/app/ |
Task working directory (input + output) |
/solution/ |
Oracle's reference solution (injected by Harbor at runtime) |
/tests/ |
Test suite (injected by Harbor after agent runs) |
/logs/verifier/reward.txt |
Score file — Harbor reads this float as the final score |
/logs/verifier/ |
Bind-mounted to host's trial directory |
⚠️ Critical:reward.txtmust always be written bytest.sh, even when tests fail.
If it is missing, Harbor throwsRewardFileNotFoundErrorand counts it as an error, not a0.0score.
Harbor agent/
├── Makefile ← One-command runner with timestamped job names
├── pyproject.toml ← uv Python project manifest
├── harbor_tasks/
│ ├── user_data_pipeline/ ← Level 1 task
│ │ ├── task.toml ← Harbor config (resources, timeouts)
│ │ ├── instruction.md ← Agent instructions (absolute paths)
│ │ ├── environment/
│ │ │ ├── Dockerfile ← python:3.11-slim, stdlib only
│ │ │ └── input.json ← 18-record user dataset
│ │ ├── solution/
│ │ │ └── solve.sh ← Oracle reference solution
│ │ └── tests/
│ │ ├── test.sh ← Harbor verifier entry point
│ │ └── test_outputs.py ← 27-assertion test suite (binary score)
│ │
│ └── ecommerce_analytics_pipeline/ ← Level 2 task
│ ├── task.toml
│ ├── instruction.md
│ ├── environment/
│ │ ├── Dockerfile ← python:3.11-slim + pandas + numpy
│ │ └── input/
│ │ ├── orders.json ← 35 orders across 9 months
│ │ └── customers.json ← 14 customers across 4 regions
│ ├── solution/
│ │ └── solve.sh ← pandas-powered analytics
│ └── tests/
│ ├── test.sh
│ └── test_outputs.py ← Partial-credit test suite (0.0–1.0)
- Docker Desktop — must be running (not just installed)
- Python 3.11+
# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"
# 2. Add Docker to PATH (Harbor calls docker as a subprocess — this is required)
export PATH="/usr/local/bin:$PATH"
# Make it permanent:
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.zshrc
# 3. Install dependencies
uv add harbor
# 4. Run everything
make allRead /app/input.json (an array of user records), validate and process the data, and write 8 structured output files to /app/output/.
18 user records with the schema: id, name, age, email
| Category | Count | Details |
|---|---|---|
| Valid records | 16 | All 4 required fields present |
| Invalid (missing fields) | 2 | ID 14 missing age; ID 16 missing name |
| Duplicate emails | 4 records | IDs 1+13 share alice@example.com; IDs 2+11 share bob@gmail.com |
| Output File | Description | Key Logic |
|---|---|---|
users.csv |
Valid records in CSV format | Header: id,name,age,email |
users_sorted.csv |
Valid records sorted by age ascending | Same data, sorted |
invalid_records.json |
Records that failed validation | Missing any required field |
duplicates.json |
Records with shared email addresses | All copies, not just the extras |
stats.json |
total_users, average_age, min_age, max_age |
Valid records only |
age_groups.json |
Count per bracket: 18-25, 26-35, 36-50, 50+ |
Inclusive on both ends |
domain_counts.json |
Email domain → occurrence count | Valid records only |
process.log |
Timestamped execution log | Must contain "Pipeline started" and "completed" |
Written entirely with Python stdlib (no pip packages required). Processing pipeline:
1. json.load() → load all 18 records
2. Field validation → separate valid (16) from invalid (2)
3. defaultdict groupby → detect duplicate emails
4. csv.DictWriter → write users.csv and users_sorted.csv
5. sum/min/max arithmetic → compute stats.json
6. range comparisons → segment age_groups.json
7. str.split("@")[-1] → extract domain_counts.json
8. datetime.now(UTC) → timestamped process.log
27 assertions — pure Python stdlib, zero external dependencies.
Section 1 — File existence ............. 8 assertions
Section 2 — users.csv ............. 3 assertions
Section 3 — users_sorted.csv ............. 2 assertions
Section 4 — invalid_records ............. 2 assertions
Section 5 — duplicates.json ............. 2 assertions
Section 6 — stats.json ............. 4 assertions
Section 7 — age_groups.json ............. 4 assertions
Section 8 — domain_counts.json ............. 5 assertions
Section 9 — process.log ............. 3 assertions
Ground truth is computed dynamically from input.json at test time. If you change the dataset, tests automatically update — no hardcoded expected values.
version = "1.0"
[environment]
memory_mb = 512 # RAM limit for the container
storage_mb = 256 # Disk limit
[verifier]
timeout_sec = 120.0 # Max seconds test.sh may takeJoin two datasets (orders.json + customers.json) from /app/input/ and produce 7 analytical output files in /app/output/ covering revenue, customer lifetime value, regional performance, and churn risk.
orders.json — 35 orders spanning January–September 2024
| Field | Type | Values |
|---|---|---|
order_id |
int | Unique identifier |
customer_id |
int | Foreign key → customers |
product |
string | Laptop Pro 15, Smart Watch, Running Shoes, Wireless Earbuds, Coffee Maker, USB-C Hub, Yoga Mat, Novel Collection, Desk Lamp |
category |
string | Electronics / Apparel / Home / Sports / Books |
amount |
float | USD value |
date |
YYYY-MM-DD | 2024-01-05 to 2024-09-30 |
status |
string | completed (30) / cancelled (3) / pending (2) |
customers.json — 14 customers
| Field | Type | Values |
|---|---|---|
customer_id |
int | 1–14 |
name |
string | Full name |
email |
string | Contact email |
region |
string | North / South / East / West |
signup_date |
YYYY-MM-DD | 2023-10-01 to 2024-06-01 |
Rule: Only
completedorders count toward revenue, LTV, rankings, and regional stats.
order_status_dist.jsoncounts all orders.
| Output File | Algorithm Summary |
|---|---|
monthly_revenue.json |
{YYYY-MM: revenue} — group by month, sum amounts, sort chronologically |
top_products.json |
[{product, revenue}] — aggregate by product, sort descending, take top 5 |
customer_ltv.json |
[{customer_id, name, ltv}] — sum by customer_id, merge names, sort by ltv desc |
regional_summary.json |
{region: {revenue, order_count}} — join orders→customers on customer_id, group by region |
order_status_dist.json |
{status: count} — value counts across all 35 orders |
churn_risk.json |
Customers whose last completed order is > 60 days before the most recent order in the dataset, or who have no completed orders at all |
analytics_report.md |
Markdown BI summary combining all the above metrics |
pipeline.log |
UTC-timestamped execution log |
Uses pandas 2.2.3 + numpy 1.26.4 pre-installed in the Docker image:
# Core data loading
orders_df = pd.read_json("/app/input/orders.json")
customers_df = pd.read_json("/app/input/customers.json")
# Feature 1 — Monthly revenue
completed.groupby(month)["amount"].sum().sort_values("month")
# Feature 2 — Top products
completed.groupby("product")["amount"].sum().sort_values(ascending=False).head(5)
# Feature 3 — Customer LTV
completed.groupby("customer_id")["amount"].sum().merge(customers)
# Feature 4 — Regional summary
completed.merge(customers).groupby("region").agg(revenue=sum, order_count=count)
# Feature 5 — Status distribution
orders_df["status"].value_counts()
# Feature 6 — Churn risk
max_date = orders_df["date"].max()
cutoff = max_date - timedelta(days=60)
churned = customers where last_order_date < cutoff OR no ordersEach of the 7 features is independently verified. Score = (features passed) / 7
| Feature | Weight | Validation |
|---|---|---|
| File existence | 1/7 | All 8 output files present |
| Monthly revenue | 1/7 | Correct months, values (±0.05), sorted chronologically |
| Top 5 products | 1/7 | Exact product ranking order and revenue values |
| Customer LTV | 1/7 | Correct LTV values, descending sort by ltv |
| Regional summary | 1/7 | Correct revenue + order_count per region |
| Order status dist | 1/7 | Exact counts: 30 completed, 3 cancelled, 2 pending |
| Churn risk | 1/7 | Correct set of at-risk customer IDs |
Examples:
- 7/7 features →
1.0(Oracle) - 5/7 features →
0.714 - 0/7 features →
0.0(NOP)
version = "1.0"
[environment]
memory_mb = 1024 # Higher limit — pandas loads both datasets into memory
storage_mb = 512
[verifier]
timeout_sec = 180.0 # pandas-based solution needs more time than stdlibContainer Host (bind-mount)
───────── ─────────────────
test.sh
→ python3 /tests/test_outputs.py
→ writes float to → jobs/<job>/trial/verifier/reward.txt
/logs/verifier/reward.txt
Harbor reads this file → final score
#!/usr/bin/env bash
set -uo pipefail # set -e is OFF — intentional
mkdir -p /logs/verifier
set +e # ← disable exit-on-error temporarily
python3 /tests/test_outputs.py 2>&1
TEST_EXIT=$? # ← capture exit code before set -e restores
set -e # ← re-enable strict mode
if [ "${TEST_EXIT}" -eq 0 ]; then
echo "1.0" > /logs/verifier/reward.txt
else
echo "0.0" > /logs/verifier/reward.txt
fiIf you write set -euo pipefail and then python3 test.py exits with code 1, bash kills the script immediately before the if block runs — reward.txt is never written, and Harbor sees RewardFileNotFoundError as an error (not a score of 0.0).
Records total: 18
Valid: 16
Invalid: 2 (IDs 14, 16)
Duplicates by email:
alice@example.com → IDs 1, 13
bob@gmail.com → IDs 2, 11
Age groups (valid records):
18–25: 4 users
26–35: 6 users
36–50: 4 users
50+: 2 users
Stats:
total_users: 16
average_age: 35.12
min_age: 19
max_age: 60
Email domains:
example.com 6 occurrences
gmail.com 5 occurrences
yahoo.com 3 occurrences
outlook.com 2 occurrences
Monthly revenue (completed orders):
2024-01: $1,389.98 2024-06: $ 209.48
2024-02: $1,464.96 2024-07: $1,494.96
2024-03: $ 489.48 2024-08: $ 244.48
2024-04: $ 169.97 2024-09: $1,724.96
2024-05: $1,489.96
Top 5 products by revenue:
1. Laptop Pro 15 $6,499.95
2. Smart Watch $ 599.98
3. Running Shoes $ 388.50
4. Wireless Earbuds $ 359.96
5. Coffee Maker $ 319.96
Regional performance (completed):
North $5,734.91 9 orders
South $1,994.43 8 orders
West $ 513.96 6 orders
East $ 434.93 7 orders
Order status: completed=30, cancelled=3, pending=2
Churn risk (7 customers):
Last completed orders before 2024-08-01 (cutoff = max_date - 60 days)
IDs: 1, 2, 3, 8, 9, 10, 11
make oracle-l1 # L1 oracle → expect reward 1.0
make nop-l1 # L1 nop → expect reward 0.0
make oracle-l2 # L2 oracle → expect reward 1.0
make nop-l2 # L2 nop → expect reward 0.0
make lint # ruff check both tasks
make all # run all 4 tests + lint sequentially
make clean # delete jobs/ cacheAll make targets automatically:
- Export
PATH="/usr/local/bin:$PATH"so Docker is found - Generate timestamped job names (e.g.
l1-oracle-20260307-170651) to avoid Harbor caching
mkdir -p harbor_tasks/my_new_task/{environment,solution,tests}Minimum required files:
| File | Must contain |
|---|---|
task.toml |
version = "1.0", [environment] with memory_mb and storage_mb |
instruction.md |
Agent instructions using absolute paths (/app/) |
environment/Dockerfile |
Build the container — do NOT copy solution/ or tests/ |
solution/solve.sh |
Reference solution (absolute paths only) |
tests/test.sh |
Must write float to /logs/verifier/reward.txt |
tests/test_outputs.py |
Validation logic (stdlib only recommended) |
Before submitting a new task, verify:
-
uv run harbor run --agent oracle→ score1.0, Trials=1, Errors=0 -
uv run harbor run --agent nop→ score0.0, Trials=1, Errors=0 -
uv run ruff check harbor_tasks/<task>→ All checks passed - No hardcoded values in
test_outputs.py— all ground truth computed from input files -
Dockerfilecopies onlyenvironment/contents, notsolution/ortests/
Cause: Harbor's asyncio subprocess can't find the docker binary.
Fix:
export PATH="/usr/local/bin:$PATH"
# Permanent fix:
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.zshrcCause: set -e is active in test.sh and the Python test exits with code 1, aborting the script before reward.txt is written.
Fix: Wrap the python call with set +e / set -e:
set +e
python3 /tests/test_outputs.py 2>&1
TEST_EXIT=$?
set -eCause: Harbor cached a previous failed run under the same job name.
Fix:
make clean # removes all cached jobs/
make all # Makefile auto-generates new timestamped job namesCause: Using python3 -m ruff when the venv's python doesn't have ruff.
Fix: Always use uv run ruff check <path> — uv handles the tool resolution independently of the active venv.
Cause: No git repo in the workspace. Harbor tries to read git metadata.
Fix:
git init && git add . && git commit -m "init"This repo already has git initialized — this warning should not appear.
MIT — free to use as templates for your own Harbor benchmark tasks.