Harbor Benchmark Tasks

Two production-quality AI agent benchmark tasks built on the Harbor evaluation framework.
Both tasks pass full Harbor validation: Oracle → 1.0, NOP → 0.0, ruff → clean.

What Is This Project?

This repository contains two Harbor benchmark tasks designed to evaluate whether an AI agent can correctly process structured data and produce accurate analytical outputs.

Task	Difficulty	Features	Scoring
`user_data_pipeline`	Level 1	8 data-processing features	Binary (pass/fail)
`ecommerce_analytics_pipeline`	Level 2	7 advanced analytics features	Partial credit (0.0–1.0)

Each task follows the Harbor specification:

An Oracle agent runs the reference solution and must score 1.0
A NOP agent (does nothing) must score 0.0
All processing is dynamic — results are computed from the input data, never hardcoded

How Harbor Works

┌─────────────────────────────────────────────────────────────────┐
│                        Harbor Run                               │
│                                                                 │
│  1. Builds Docker image from environment/Dockerfile             │
│  2. Starts container, copies input data into /app/              │
│  3. Agent runs (Oracle: executes solution/solve.sh)             │
│  4. Verifier copies tests/ → /tests/ inside container           │
│  5. Verifier executes /tests/test.sh                            │
│  6. test.sh runs test_outputs.py and writes a float to          │
│     /logs/verifier/reward.txt  (bind-mounted to host)           │
│  7. Harbor reads reward.txt → final score                       │
└─────────────────────────────────────────────────────────────────┘

Key Paths Inside the Container

Path	Purpose
`/app/`	Task working directory (input + output)
`/solution/`	Oracle's reference solution (injected by Harbor at runtime)
`/tests/`	Test suite (injected by Harbor after agent runs)
`/logs/verifier/reward.txt`	Score file — Harbor reads this float as the final score
`/logs/verifier/`	Bind-mounted to host's trial directory

⚠️ Critical: reward.txt must always be written by test.sh, even when tests fail.
If it is missing, Harbor throws RewardFileNotFoundError and counts it as an error, not a 0.0 score.

Project Structure

Harbor agent/
├── Makefile                                  ← One-command runner with timestamped job names
├── pyproject.toml                            ← uv Python project manifest
├── harbor_tasks/
│   ├── user_data_pipeline/                  ← Level 1 task
│   │   ├── task.toml                        ← Harbor config (resources, timeouts)
│   │   ├── instruction.md                   ← Agent instructions (absolute paths)
│   │   ├── environment/
│   │   │   ├── Dockerfile                   ← python:3.11-slim, stdlib only
│   │   │   └── input.json                  ← 18-record user dataset
│   │   ├── solution/
│   │   │   └── solve.sh                     ← Oracle reference solution
│   │   └── tests/
│   │       ├── test.sh                      ← Harbor verifier entry point
│   │       └── test_outputs.py              ← 27-assertion test suite (binary score)
│   │
│   └── ecommerce_analytics_pipeline/        ← Level 2 task
│       ├── task.toml
│       ├── instruction.md
│       ├── environment/
│       │   ├── Dockerfile                   ← python:3.11-slim + pandas + numpy
│       │   └── input/
│       │       ├── orders.json              ← 35 orders across 9 months
│       │       └── customers.json           ← 14 customers across 4 regions
│       ├── solution/
│       │   └── solve.sh                     ← pandas-powered analytics
│       └── tests/
│           ├── test.sh
│           └── test_outputs.py              ← Partial-credit test suite (0.0–1.0)

Quick Start

Prerequisites

Docker Desktop — must be running (not just installed)
Python 3.11+

Installation

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"

# 2. Add Docker to PATH (Harbor calls docker as a subprocess — this is required)
export PATH="/usr/local/bin:$PATH"
# Make it permanent:
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.zshrc

# 3. Install dependencies
uv add harbor

# 4. Run everything
make all

Task 1 — `user_data_pipeline`

Goal

Read /app/input.json (an array of user records), validate and process the data, and write 8 structured output files to /app/output/.

Input Dataset (`input.json`)

18 user records with the schema: id, name, age, email

Category	Count	Details
Valid records	16	All 4 required fields present
Invalid (missing fields)	2	ID 14 missing `age`; ID 16 missing `name`
Duplicate emails	4 records	IDs 1+13 share `alice@example.com`; IDs 2+11 share `bob@gmail.com`

Required Outputs

Output File	Description	Key Logic
`users.csv`	Valid records in CSV format	Header: `id,name,age,email`
`users_sorted.csv`	Valid records sorted by age ascending	Same data, sorted
`invalid_records.json`	Records that failed validation	Missing any required field
`duplicates.json`	Records with shared email addresses	All copies, not just the extras
`stats.json`	`total_users`, `average_age`, `min_age`, `max_age`	Valid records only
`age_groups.json`	Count per bracket: `18-25`, `26-35`, `36-50`, `50+`	Inclusive on both ends
`domain_counts.json`	Email domain → occurrence count	Valid records only
`process.log`	Timestamped execution log	Must contain "Pipeline started" and "completed"

Reference Solution (`solve.sh`)

Written entirely with Python stdlib (no pip packages required). Processing pipeline:

1. json.load()             → load all 18 records
2. Field validation        → separate valid (16) from invalid (2)
3. defaultdict groupby     → detect duplicate emails
4. csv.DictWriter          → write users.csv and users_sorted.csv
5. sum/min/max arithmetic  → compute stats.json
6. range comparisons       → segment age_groups.json
7. str.split("@")[-1]      → extract domain_counts.json
8. datetime.now(UTC)       → timestamped process.log

Test Suite (`test_outputs.py`)

27 assertions — pure Python stdlib, zero external dependencies.

Section 1 — File existence     ............. 8 assertions
Section 2 — users.csv          ............. 3 assertions
Section 3 — users_sorted.csv   ............. 2 assertions
Section 4 — invalid_records    ............. 2 assertions
Section 5 — duplicates.json    ............. 2 assertions
Section 6 — stats.json         ............. 4 assertions
Section 7 — age_groups.json    ............. 4 assertions
Section 8 — domain_counts.json ............. 5 assertions
Section 9 — process.log        ............. 3 assertions

Ground truth is computed dynamically from input.json at test time. If you change the dataset, tests automatically update — no hardcoded expected values.

`task.toml`

version = "1.0"

[environment]
memory_mb = 512     # RAM limit for the container
storage_mb = 256    # Disk limit

[verifier]
timeout_sec = 120.0 # Max seconds test.sh may take

Task 2 — `ecommerce_analytics_pipeline`

Goal

Join two datasets (orders.json + customers.json) from /app/input/ and produce 7 analytical output files in /app/output/ covering revenue, customer lifetime value, regional performance, and churn risk.

Input Datasets

orders.json — 35 orders spanning January–September 2024

Field	Type	Values
`order_id`	int	Unique identifier
`customer_id`	int	Foreign key → customers
`product`	string	Laptop Pro 15, Smart Watch, Running Shoes, Wireless Earbuds, Coffee Maker, USB-C Hub, Yoga Mat, Novel Collection, Desk Lamp
`category`	string	Electronics / Apparel / Home / Sports / Books
`amount`	float	USD value
`date`	YYYY-MM-DD	2024-01-05 to 2024-09-30
`status`	string	`completed` (30) / `cancelled` (3) / `pending` (2)

customers.json — 14 customers

Field	Type	Values
`customer_id`	int	1–14
`name`	string	Full name
`email`	string	Contact email
`region`	string	North / South / East / West
`signup_date`	YYYY-MM-DD	2023-10-01 to 2024-06-01

Required Outputs

Rule: Only completed orders count toward revenue, LTV, rankings, and regional stats.
order_status_dist.json counts all orders.

Output File	Algorithm Summary
`monthly_revenue.json`	`{YYYY-MM: revenue}` — group by month, sum amounts, sort chronologically
`top_products.json`	`[{product, revenue}]` — aggregate by product, sort descending, take top 5
`customer_ltv.json`	`[{customer_id, name, ltv}]` — sum by customer_id, merge names, sort by ltv desc
`regional_summary.json`	`{region: {revenue, order_count}}` — join orders→customers on customer_id, group by region
`order_status_dist.json`	`{status: count}` — value counts across all 35 orders
`churn_risk.json`	Customers whose last completed order is > 60 days before the most recent order in the dataset, or who have no completed orders at all
`analytics_report.md`	Markdown BI summary combining all the above metrics
`pipeline.log`	UTC-timestamped execution log

Reference Solution (`solve.sh`)

Uses pandas 2.2.3 + numpy 1.26.4 pre-installed in the Docker image:

# Core data loading
orders_df    = pd.read_json("/app/input/orders.json")
customers_df = pd.read_json("/app/input/customers.json")

# Feature 1 — Monthly revenue
completed.groupby(month)["amount"].sum().sort_values("month")

# Feature 2 — Top products
completed.groupby("product")["amount"].sum().sort_values(ascending=False).head(5)

# Feature 3 — Customer LTV
completed.groupby("customer_id")["amount"].sum().merge(customers)

# Feature 4 — Regional summary
completed.merge(customers).groupby("region").agg(revenue=sum, order_count=count)

# Feature 5 — Status distribution
orders_df["status"].value_counts()

# Feature 6 — Churn risk
max_date = orders_df["date"].max()
cutoff   = max_date - timedelta(days=60)
churned  = customers where last_order_date < cutoff OR no orders

Partial-Credit Scoring

Each of the 7 features is independently verified. Score = (features passed) / 7

Feature	Weight	Validation
File existence	1/7	All 8 output files present
Monthly revenue	1/7	Correct months, values (±0.05), sorted chronologically
Top 5 products	1/7	Exact product ranking order and revenue values
Customer LTV	1/7	Correct LTV values, descending sort by ltv
Regional summary	1/7	Correct revenue + order_count per region
Order status dist	1/7	Exact counts: 30 completed, 3 cancelled, 2 pending
Churn risk	1/7	Correct set of at-risk customer IDs

Examples:

7/7 features → 1.0 (Oracle)
5/7 features → 0.714
0/7 features → 0.0 (NOP)

`task.toml`

version = "1.0"

[environment]
memory_mb = 1024    # Higher limit — pandas loads both datasets into memory
storage_mb = 512

[verifier]
timeout_sec = 180.0 # pandas-based solution needs more time than stdlib

Scoring System

How the Score Travels from Container to Harbor

Container                               Host (bind-mount)
─────────                               ─────────────────
test.sh
  → python3 /tests/test_outputs.py
  → writes float to                →    jobs/<job>/trial/verifier/reward.txt
       /logs/verifier/reward.txt
                                        Harbor reads this file → final score

The `set +e` Pattern — Why It Matters

#!/usr/bin/env bash
set -uo pipefail   # set -e is OFF — intentional

mkdir -p /logs/verifier

set +e             # ← disable exit-on-error temporarily
python3 /tests/test_outputs.py 2>&1
TEST_EXIT=$?       # ← capture exit code before set -e restores
set -e             # ← re-enable strict mode

if [ "${TEST_EXIT}" -eq 0 ]; then
    echo "1.0" > /logs/verifier/reward.txt
else
    echo "0.0" > /logs/verifier/reward.txt
fi

If you write set -euo pipefail and then python3 test.py exits with code 1, bash kills the script immediately before the if block runs — reward.txt is never written, and Harbor sees RewardFileNotFoundError as an error (not a score of 0.0).

Dataset Details

Level 1 Computed Statistics

Records total:   18
Valid:           16
Invalid:          2 (IDs 14, 16)

Duplicates by email:
  alice@example.com → IDs 1, 13
  bob@gmail.com     → IDs 2, 11

Age groups (valid records):
  18–25: 4 users
  26–35: 6 users
  36–50: 4 users
  50+:   2 users

Stats:
  total_users:  16
  average_age:  35.12
  min_age:      19
  max_age:      60

Email domains:
  example.com   6 occurrences
  gmail.com     5 occurrences
  yahoo.com     3 occurrences
  outlook.com   2 occurrences

Level 2 Computed Statistics

Monthly revenue (completed orders):
  2024-01: $1,389.98    2024-06: $  209.48
  2024-02: $1,464.96    2024-07: $1,494.96
  2024-03: $  489.48    2024-08: $  244.48
  2024-04: $  169.97    2024-09: $1,724.96
  2024-05: $1,489.96

Top 5 products by revenue:
  1. Laptop Pro 15      $6,499.95
  2. Smart Watch        $  599.98
  3. Running Shoes      $  388.50
  4. Wireless Earbuds   $  359.96
  5. Coffee Maker       $  319.96

Regional performance (completed):
  North  $5,734.91  9 orders
  South  $1,994.43  8 orders
  West   $  513.96  6 orders
  East   $  434.93  7 orders

Order status: completed=30, cancelled=3, pending=2

Churn risk (7 customers):
  Last completed orders before 2024-08-01 (cutoff = max_date - 60 days)
  IDs: 1, 2, 3, 8, 9, 10, 11

Developer Workflow

Makefile Targets

make oracle-l1    # L1 oracle → expect reward 1.0
make nop-l1       # L1 nop    → expect reward 0.0
make oracle-l2    # L2 oracle → expect reward 1.0
make nop-l2       # L2 nop    → expect reward 0.0
make lint         # ruff check both tasks
make all          # run all 4 tests + lint sequentially
make clean        # delete jobs/ cache

All make targets automatically:

Export PATH="/usr/local/bin:$PATH" so Docker is found
Generate timestamped job names (e.g. l1-oracle-20260307-170651) to avoid Harbor caching

Adding a New Harbor Task

mkdir -p harbor_tasks/my_new_task/{environment,solution,tests}

Minimum required files:

File	Must contain
`task.toml`	`version = "1.0"`, `[environment]` with `memory_mb` and `storage_mb`
`instruction.md`	Agent instructions using absolute paths (`/app/`)
`environment/Dockerfile`	Build the container — do NOT copy `solution/` or `tests/`
`solution/solve.sh`	Reference solution (absolute paths only)
`tests/test.sh`	Must write float to `/logs/verifier/reward.txt`
`tests/test_outputs.py`	Validation logic (stdlib only recommended)

Validation Checklist

Before submitting a new task, verify:

uv run harbor run --agent oracle → score 1.0, Trials=1, Errors=0
uv run harbor run --agent nop → score 0.0, Trials=1, Errors=0
uv run ruff check harbor_tasks/<task> → All checks passed
No hardcoded values in test_outputs.py — all ground truth computed from input files
Dockerfile copies only environment/ contents, not solution/ or tests/

Troubleshooting

`FileNotFoundError: 'docker'`

Cause: Harbor's asyncio subprocess can't find the docker binary.

Fix:

export PATH="/usr/local/bin:$PATH"
# Permanent fix:
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.zshrc

`RewardFileNotFoundError` on NOP run

Cause: set -e is active in test.sh and the Python test exits with code 1, aborting the script before reward.txt is written.

Fix: Wrap the python call with set +e / set -e:

set +e
python3 /tests/test_outputs.py 2>&1
TEST_EXIT=$?
set -e

Oracle returns `0.0` with old job name

Cause: Harbor cached a previous failed run under the same job name.

Fix:

make clean   # removes all cached jobs/
make all     # Makefile auto-generates new timestamped job names

`uv run ruff check` → `No module named ruff`

Cause: Using python3 -m ruff when the venv's python doesn't have ruff.

Fix: Always use uv run ruff check <path> — uv handles the tool resolution independently of the active venv.

`fatal: bad revision 'HEAD'`

Cause: No git repo in the workspace. Harbor tries to read git metadata.

Fix:

git init && git add . && git commit -m "init"

This repo already has git initialized — this warning should not appear.

License

MIT — free to use as templates for your own Harbor benchmark tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
harbor_tasks		harbor_tasks
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Harbor Benchmark Tasks

Table of Contents

What Is This Project?

How Harbor Works

Key Paths Inside the Container

Project Structure

Quick Start

Prerequisites

Installation

Task 1 — user_data_pipeline

Goal

Input Dataset (input.json)

Required Outputs

Reference Solution (solve.sh)

Test Suite (test_outputs.py)

task.toml

Task 2 — ecommerce_analytics_pipeline

Goal

Input Datasets

Required Outputs

Reference Solution (solve.sh)

Partial-Credit Scoring

task.toml

Scoring System

How the Score Travels from Container to Harbor

The set +e Pattern — Why It Matters

Dataset Details

Level 1 Computed Statistics

Level 2 Computed Statistics

Developer Workflow

Makefile Targets

Adding a New Harbor Task

Validation Checklist

Troubleshooting

FileNotFoundError: 'docker'

RewardFileNotFoundError on NOP run

Oracle returns 0.0 with old job name

uv run ruff check → No module named ruff

fatal: bad revision 'HEAD'

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Task 1 — `user_data_pipeline`

Input Dataset (`input.json`)

Reference Solution (`solve.sh`)

Test Suite (`test_outputs.py`)

`task.toml`

Task 2 — `ecommerce_analytics_pipeline`

Reference Solution (`solve.sh`)

`task.toml`

The `set +e` Pattern — Why It Matters

`FileNotFoundError: 'docker'`

`RewardFileNotFoundError` on NOP run

Oracle returns `0.0` with old job name

`uv run ruff check` → `No module named ruff`

`fatal: bad revision 'HEAD'`

Packages