Skip to content

moienQ/Harbor-Benchmark-Tasks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Harbor Benchmark Tasks

Two production-quality AI agent benchmark tasks built on the Harbor evaluation framework.
Both tasks pass full Harbor validation: Oracle → 1.0, NOP → 0.0, ruff → clean.


Table of Contents

  1. What Is This Project?
  2. How Harbor Works
  3. Project Structure
  4. Quick Start
  5. Task 1 — user_data_pipeline
  6. Task 2 — ecommerce_analytics_pipeline
  7. Scoring System
  8. Architecture Deep Dive
  9. Dataset Details
  10. Developer Workflow
  11. Troubleshooting

What Is This Project?

This repository contains two Harbor benchmark tasks designed to evaluate whether an AI agent can correctly process structured data and produce accurate analytical outputs.

Task Difficulty Features Scoring
user_data_pipeline Level 1 8 data-processing features Binary (pass/fail)
ecommerce_analytics_pipeline Level 2 7 advanced analytics features Partial credit (0.0–1.0)

Each task follows the Harbor specification:

  • An Oracle agent runs the reference solution and must score 1.0
  • A NOP agent (does nothing) must score 0.0
  • All processing is dynamic — results are computed from the input data, never hardcoded

How Harbor Works

┌─────────────────────────────────────────────────────────────────┐
│                        Harbor Run                               │
│                                                                 │
│  1. Builds Docker image from environment/Dockerfile             │
│  2. Starts container, copies input data into /app/              │
│  3. Agent runs (Oracle: executes solution/solve.sh)             │
│  4. Verifier copies tests/ → /tests/ inside container           │
│  5. Verifier executes /tests/test.sh                            │
│  6. test.sh runs test_outputs.py and writes a float to          │
│     /logs/verifier/reward.txt  (bind-mounted to host)           │
│  7. Harbor reads reward.txt → final score                       │
└─────────────────────────────────────────────────────────────────┘

Key Paths Inside the Container

Path Purpose
/app/ Task working directory (input + output)
/solution/ Oracle's reference solution (injected by Harbor at runtime)
/tests/ Test suite (injected by Harbor after agent runs)
/logs/verifier/reward.txt Score file — Harbor reads this float as the final score
/logs/verifier/ Bind-mounted to host's trial directory

⚠️ Critical: reward.txt must always be written by test.sh, even when tests fail.
If it is missing, Harbor throws RewardFileNotFoundError and counts it as an error, not a 0.0 score.


Project Structure

Harbor agent/
├── Makefile                                  ← One-command runner with timestamped job names
├── pyproject.toml                            ← uv Python project manifest
├── harbor_tasks/
│   ├── user_data_pipeline/                  ← Level 1 task
│   │   ├── task.toml                        ← Harbor config (resources, timeouts)
│   │   ├── instruction.md                   ← Agent instructions (absolute paths)
│   │   ├── environment/
│   │   │   ├── Dockerfile                   ← python:3.11-slim, stdlib only
│   │   │   └── input.json                  ← 18-record user dataset
│   │   ├── solution/
│   │   │   └── solve.sh                     ← Oracle reference solution
│   │   └── tests/
│   │       ├── test.sh                      ← Harbor verifier entry point
│   │       └── test_outputs.py              ← 27-assertion test suite (binary score)
│   │
│   └── ecommerce_analytics_pipeline/        ← Level 2 task
│       ├── task.toml
│       ├── instruction.md
│       ├── environment/
│       │   ├── Dockerfile                   ← python:3.11-slim + pandas + numpy
│       │   └── input/
│       │       ├── orders.json              ← 35 orders across 9 months
│       │       └── customers.json           ← 14 customers across 4 regions
│       ├── solution/
│       │   └── solve.sh                     ← pandas-powered analytics
│       └── tests/
│           ├── test.sh
│           └── test_outputs.py              ← Partial-credit test suite (0.0–1.0)

Quick Start

Prerequisites

  • Docker Desktop — must be running (not just installed)
  • Python 3.11+

Installation

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"

# 2. Add Docker to PATH (Harbor calls docker as a subprocess — this is required)
export PATH="/usr/local/bin:$PATH"
# Make it permanent:
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.zshrc

# 3. Install dependencies
uv add harbor

# 4. Run everything
make all

Task 1 — user_data_pipeline

Goal

Read /app/input.json (an array of user records), validate and process the data, and write 8 structured output files to /app/output/.

Input Dataset (input.json)

18 user records with the schema: id, name, age, email

Category Count Details
Valid records 16 All 4 required fields present
Invalid (missing fields) 2 ID 14 missing age; ID 16 missing name
Duplicate emails 4 records IDs 1+13 share alice@example.com; IDs 2+11 share bob@gmail.com

Required Outputs

Output File Description Key Logic
users.csv Valid records in CSV format Header: id,name,age,email
users_sorted.csv Valid records sorted by age ascending Same data, sorted
invalid_records.json Records that failed validation Missing any required field
duplicates.json Records with shared email addresses All copies, not just the extras
stats.json total_users, average_age, min_age, max_age Valid records only
age_groups.json Count per bracket: 18-25, 26-35, 36-50, 50+ Inclusive on both ends
domain_counts.json Email domain → occurrence count Valid records only
process.log Timestamped execution log Must contain "Pipeline started" and "completed"

Reference Solution (solve.sh)

Written entirely with Python stdlib (no pip packages required). Processing pipeline:

1. json.load()             → load all 18 records
2. Field validation        → separate valid (16) from invalid (2)
3. defaultdict groupby     → detect duplicate emails
4. csv.DictWriter          → write users.csv and users_sorted.csv
5. sum/min/max arithmetic  → compute stats.json
6. range comparisons       → segment age_groups.json
7. str.split("@")[-1]      → extract domain_counts.json
8. datetime.now(UTC)       → timestamped process.log

Test Suite (test_outputs.py)

27 assertions — pure Python stdlib, zero external dependencies.

Section 1 — File existence     ............. 8 assertions
Section 2 — users.csv          ............. 3 assertions
Section 3 — users_sorted.csv   ............. 2 assertions
Section 4 — invalid_records    ............. 2 assertions
Section 5 — duplicates.json    ............. 2 assertions
Section 6 — stats.json         ............. 4 assertions
Section 7 — age_groups.json    ............. 4 assertions
Section 8 — domain_counts.json ............. 5 assertions
Section 9 — process.log        ............. 3 assertions

Ground truth is computed dynamically from input.json at test time. If you change the dataset, tests automatically update — no hardcoded expected values.

task.toml

version = "1.0"

[environment]
memory_mb = 512     # RAM limit for the container
storage_mb = 256    # Disk limit

[verifier]
timeout_sec = 120.0 # Max seconds test.sh may take

Task 2 — ecommerce_analytics_pipeline

Goal

Join two datasets (orders.json + customers.json) from /app/input/ and produce 7 analytical output files in /app/output/ covering revenue, customer lifetime value, regional performance, and churn risk.

Input Datasets

orders.json — 35 orders spanning January–September 2024

Field Type Values
order_id int Unique identifier
customer_id int Foreign key → customers
product string Laptop Pro 15, Smart Watch, Running Shoes, Wireless Earbuds, Coffee Maker, USB-C Hub, Yoga Mat, Novel Collection, Desk Lamp
category string Electronics / Apparel / Home / Sports / Books
amount float USD value
date YYYY-MM-DD 2024-01-05 to 2024-09-30
status string completed (30) / cancelled (3) / pending (2)

customers.json — 14 customers

Field Type Values
customer_id int 1–14
name string Full name
email string Contact email
region string North / South / East / West
signup_date YYYY-MM-DD 2023-10-01 to 2024-06-01

Required Outputs

Rule: Only completed orders count toward revenue, LTV, rankings, and regional stats.
order_status_dist.json counts all orders.

Output File Algorithm Summary
monthly_revenue.json {YYYY-MM: revenue} — group by month, sum amounts, sort chronologically
top_products.json [{product, revenue}] — aggregate by product, sort descending, take top 5
customer_ltv.json [{customer_id, name, ltv}] — sum by customer_id, merge names, sort by ltv desc
regional_summary.json {region: {revenue, order_count}} — join orders→customers on customer_id, group by region
order_status_dist.json {status: count} — value counts across all 35 orders
churn_risk.json Customers whose last completed order is > 60 days before the most recent order in the dataset, or who have no completed orders at all
analytics_report.md Markdown BI summary combining all the above metrics
pipeline.log UTC-timestamped execution log

Reference Solution (solve.sh)

Uses pandas 2.2.3 + numpy 1.26.4 pre-installed in the Docker image:

# Core data loading
orders_df    = pd.read_json("/app/input/orders.json")
customers_df = pd.read_json("/app/input/customers.json")

# Feature 1 — Monthly revenue
completed.groupby(month)["amount"].sum().sort_values("month")

# Feature 2 — Top products
completed.groupby("product")["amount"].sum().sort_values(ascending=False).head(5)

# Feature 3 — Customer LTV
completed.groupby("customer_id")["amount"].sum().merge(customers)

# Feature 4 — Regional summary
completed.merge(customers).groupby("region").agg(revenue=sum, order_count=count)

# Feature 5 — Status distribution
orders_df["status"].value_counts()

# Feature 6 — Churn risk
max_date = orders_df["date"].max()
cutoff   = max_date - timedelta(days=60)
churned  = customers where last_order_date < cutoff OR no orders

Partial-Credit Scoring

Each of the 7 features is independently verified. Score = (features passed) / 7

Feature Weight Validation
File existence 1/7 All 8 output files present
Monthly revenue 1/7 Correct months, values (±0.05), sorted chronologically
Top 5 products 1/7 Exact product ranking order and revenue values
Customer LTV 1/7 Correct LTV values, descending sort by ltv
Regional summary 1/7 Correct revenue + order_count per region
Order status dist 1/7 Exact counts: 30 completed, 3 cancelled, 2 pending
Churn risk 1/7 Correct set of at-risk customer IDs

Examples:

  • 7/7 features → 1.0 (Oracle)
  • 5/7 features → 0.714
  • 0/7 features → 0.0 (NOP)

task.toml

version = "1.0"

[environment]
memory_mb = 1024    # Higher limit — pandas loads both datasets into memory
storage_mb = 512

[verifier]
timeout_sec = 180.0 # pandas-based solution needs more time than stdlib

Scoring System

How the Score Travels from Container to Harbor

Container                               Host (bind-mount)
─────────                               ─────────────────
test.sh
  → python3 /tests/test_outputs.py
  → writes float to                →    jobs/<job>/trial/verifier/reward.txt
       /logs/verifier/reward.txt
                                        Harbor reads this file → final score

The set +e Pattern — Why It Matters

#!/usr/bin/env bash
set -uo pipefail   # set -e is OFF — intentional

mkdir -p /logs/verifier

set +e             # ← disable exit-on-error temporarily
python3 /tests/test_outputs.py 2>&1
TEST_EXIT=$?       # ← capture exit code before set -e restores
set -e             # ← re-enable strict mode

if [ "${TEST_EXIT}" -eq 0 ]; then
    echo "1.0" > /logs/verifier/reward.txt
else
    echo "0.0" > /logs/verifier/reward.txt
fi

If you write set -euo pipefail and then python3 test.py exits with code 1, bash kills the script immediately before the if block runs — reward.txt is never written, and Harbor sees RewardFileNotFoundError as an error (not a score of 0.0).


Dataset Details

Level 1 Computed Statistics

Records total:   18
Valid:           16
Invalid:          2 (IDs 14, 16)

Duplicates by email:
  alice@example.com → IDs 1, 13
  bob@gmail.com     → IDs 2, 11

Age groups (valid records):
  18–25: 4 users
  26–35: 6 users
  36–50: 4 users
  50+:   2 users

Stats:
  total_users:  16
  average_age:  35.12
  min_age:      19
  max_age:      60

Email domains:
  example.com   6 occurrences
  gmail.com     5 occurrences
  yahoo.com     3 occurrences
  outlook.com   2 occurrences

Level 2 Computed Statistics

Monthly revenue (completed orders):
  2024-01: $1,389.98    2024-06: $  209.48
  2024-02: $1,464.96    2024-07: $1,494.96
  2024-03: $  489.48    2024-08: $  244.48
  2024-04: $  169.97    2024-09: $1,724.96
  2024-05: $1,489.96

Top 5 products by revenue:
  1. Laptop Pro 15      $6,499.95
  2. Smart Watch        $  599.98
  3. Running Shoes      $  388.50
  4. Wireless Earbuds   $  359.96
  5. Coffee Maker       $  319.96

Regional performance (completed):
  North  $5,734.91  9 orders
  South  $1,994.43  8 orders
  West   $  513.96  6 orders
  East   $  434.93  7 orders

Order status: completed=30, cancelled=3, pending=2

Churn risk (7 customers):
  Last completed orders before 2024-08-01 (cutoff = max_date - 60 days)
  IDs: 1, 2, 3, 8, 9, 10, 11

Developer Workflow

Makefile Targets

make oracle-l1    # L1 oracle → expect reward 1.0
make nop-l1       # L1 nop    → expect reward 0.0
make oracle-l2    # L2 oracle → expect reward 1.0
make nop-l2       # L2 nop    → expect reward 0.0
make lint         # ruff check both tasks
make all          # run all 4 tests + lint sequentially
make clean        # delete jobs/ cache

All make targets automatically:

  1. Export PATH="/usr/local/bin:$PATH" so Docker is found
  2. Generate timestamped job names (e.g. l1-oracle-20260307-170651) to avoid Harbor caching

Adding a New Harbor Task

mkdir -p harbor_tasks/my_new_task/{environment,solution,tests}

Minimum required files:

File Must contain
task.toml version = "1.0", [environment] with memory_mb and storage_mb
instruction.md Agent instructions using absolute paths (/app/)
environment/Dockerfile Build the container — do NOT copy solution/ or tests/
solution/solve.sh Reference solution (absolute paths only)
tests/test.sh Must write float to /logs/verifier/reward.txt
tests/test_outputs.py Validation logic (stdlib only recommended)

Validation Checklist

Before submitting a new task, verify:

  • uv run harbor run --agent oracle → score 1.0, Trials=1, Errors=0
  • uv run harbor run --agent nop → score 0.0, Trials=1, Errors=0
  • uv run ruff check harbor_tasks/<task> → All checks passed
  • No hardcoded values in test_outputs.py — all ground truth computed from input files
  • Dockerfile copies only environment/ contents, not solution/ or tests/

Troubleshooting

FileNotFoundError: 'docker'

Cause: Harbor's asyncio subprocess can't find the docker binary.

Fix:

export PATH="/usr/local/bin:$PATH"
# Permanent fix:
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.zshrc

RewardFileNotFoundError on NOP run

Cause: set -e is active in test.sh and the Python test exits with code 1, aborting the script before reward.txt is written.

Fix: Wrap the python call with set +e / set -e:

set +e
python3 /tests/test_outputs.py 2>&1
TEST_EXIT=$?
set -e

Oracle returns 0.0 with old job name

Cause: Harbor cached a previous failed run under the same job name.

Fix:

make clean   # removes all cached jobs/
make all     # Makefile auto-generates new timestamped job names

uv run ruff checkNo module named ruff

Cause: Using python3 -m ruff when the venv's python doesn't have ruff.

Fix: Always use uv run ruff check <path> — uv handles the tool resolution independently of the active venv.

fatal: bad revision 'HEAD'

Cause: No git repo in the workspace. Harbor tries to read git metadata.

Fix:

git init && git add . && git commit -m "init"

This repo already has git initialized — this warning should not appear.


License

MIT — free to use as templates for your own Harbor benchmark tasks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors