poRTLe

poRTLe is a platform for running, tracking, and analyzing AI agent performance across RTL benchmarks, datasets, and tasks.

It was developed by Dakota Barnes and Nik Belle during an internship at Barcelona Supercomputing Center.

Contact nbelle@ucsb.edu and dakotabarnes@ucsb.edu with any questions, suggestions, or discussion. We are excited to hear about its potential use cases across the community.

Check out poRTLe's demo site at: https://portle.hpai.bsc.es

Motivation

Research on AI agents for RTL design is fragmented across benchmarks, toolchains, and evaluation practices, making fair comparison, iteration, and collaboration difficult. Agent failures are often hard to attribute to model limitations versus task ambiguity, and benchmarks lack feedback loops for improvement. poRTLe provides a unified platform to run agents across multiple RTL benchmarks, track results with rich metadata, and analyze failures systematically. It enables easy creation and modification of tasks, supports evaluation across different architectures, and serves as a shared collaboration layer between hardware engineers and AI agent developers.

Overview

poRTLe provides a unified interface for:

Running AI agents against multiple benchmarks (CVDP, TuRTLe, DIY etc.)
Creating custom DIY tasks from your own RTL code
Tracking results in a structured database with rich metadata
Visualizing performance with an interactive UI
Managing benchmarks with persistent background execution
Filtering and analyzing tasks with advanced query capabilities

Quick Start

1. Clone External Dependencies

Clone the external projects that live in gitignored directories (only the runner is required for the quick start):

benchmark_runners/ – Required for Quick Start. Follow benchmark_runners/README.md to clone the CVDP runner (cvdp_benchmark) and set up its cvdp_env virtual environment.
benchmark_datasets/ – Benchmark task datasets (optional extras for Quick Start; see README).
agents/ – Agent implementations (optional; see README).
results/ – Shared results repository (optional for Quick Start; see README).
src/datasets/builders/dataset_repos/ – Source repos used by dataset builders. To build or regenerate TuRTLe-based datasets (RTLLM, VerilogEval via CVDP), clone TuRTLe here:
```
git clone https://github.com/HPAI-BSC/TuRTLe src/datasets/builders/dataset_repos/TuRTLe
```
Any new benchmark you integrate via CVDP should have its source data cloned into this shared dataset_repos/ directory.

If you’re creating a new runner (execution stack) follow the docs under benchmark_runners/. If you’re adding a new benchmark dataset for an existing runner (e.g., CVDP), see src/datasets/builders/README.md for the builder workflow and required folder layout.

Note: results/ is its own git repository so teams can share benchmark data independently of the main codebase.

2. Create Python Environment

Use a dedicated virtual environment so poRTLe’s dependencies stay isolated:

python3 -m venv .venv
source .venv/bin/activate        # macOS/Linux
# .\.venv\Scripts\activate        # Windows PowerShell
pip install --upgrade pip
pip install -r requirements.txt

Reactivate .venv in every new shell and run deactivate when you are finished.

3. Set Up Results Directory

Copy the example results to initialize the results directory:

cp -r results_example/. results/

4. Launch the UI Shell

Start Streamlit to drive most workflows through the graphical interface:

streamlit run src/ui/portle_ui.py

Open http://localhost:8501 (the app may take a few seconds to load on first launch).

Tip: Most operations can be performed directly inside the UI instead of running Python scripts manually.

5. Run Benchmarks and View Results

Note: You must have your Docker daemon running in the background for this to work.

Run the Example Agent

Start with the example agent to verify your setup:

UI path: In Commands ▸ Run Benchmark, select the cvdp_example benchmark and the example agent, configure options (background execution, task filters, execution mode), then click Run Benchmark. Monitor progress in the Process Monitor tab. Afterwards, explore results in the Plots tab (double-click cells for drill-down) or the Search tab for database queries.

CLI alternative: Configure and execute the dataset runner:

cp src/run_example.yaml src/run.yaml        # configure once
python src/run_dataset.py

Run the OpenCode Agent

Once you've verified the example agent works, try the OpenCode agent:

First, follow the setup instructions in agents/opencode-agent/README.md to configure your API key.
Then run benchmarks using the opencode agent in the UI or CLI.

Run LLM with Force-Copilot

To run benchmarks using direct LLM calls (no Docker agent):

Set your OpenAI API key in benchmark_runners/cvdp_benchmark/.env:
```
OPENAI_USER_KEY=sk-your-openai-api-key-here
```
In Commands ▸ Run Benchmark, select the llm_gpt-5-mini agent (pre-configured).
Set Execution Mode to LLM Mode (force-copilot).
Run the benchmark.

To use other LLMs, add models to the model factory in the benchmark runner. Ensure you set the cvdp_llm_name when you add a new agent that matches the cvdp model factory name.

View Results

After a successful run:

Check results/tmp/ for raw run output and logs.
Go to Commands ▸ Build Database to load your results into the SQLite database.
View results in the Plots tab (heatmaps, drill-down) or Search tab (database queries).

Project Structure

poRTLe/
├── agents/                    # AI agent implementations (not tracked)
├── benchmark_datasets/        # Benchmark task datasets (not tracked)
├── benchmark_runners/         # Benchmark execution scripts (not tracked)
├── results/                   # Generated results and database (not tracked)
│   ├── json/                  # JSON results by benchmark/dataset/run
│   ├── tmp/                   # Temporary run data
│   └── poRTLe.db             # SQLite database
├── src/
│   ├── benchmarks/           # Benchmark adapter system (NEW!)
│   │   ├── base.py          # Abstract adapter interface
│   │   ├── registry.py      # Adapter registry
│   │   └── cvdp_adapter.py  # CVDP implementation
│   ├── ui/                   # Streamlit UI components
│   ├── build_benchmark_json.py    # Build benchmark JSON from datasets
│   ├── build_datatable.py         # Build SQLite database from results
│   ├── run_dataset.py             # Run agents against benchmarks
│   ├── convert_run_results.py     # Convert raw results to JSON
│   └── run_example.yaml           # Example run configuration
└── README.md

Architecture: Benchmark Adapters

poRTLe uses a pluggable adapter system to support multiple benchmarks. Each benchmark has its own adapter that handles:

Dataset parsing (JSONL → JSON)
Task execution (Docker, local, etc.)
Result parsing (logs, metrics, token counts)
Metadata extraction (benchmark-specific fields)

Current Adapters:

cvdp - CVDP benchmark (src/benchmarks/cvdp_adapter.py)

Adding a new benchmark: See src/benchmarks/base.py for the adapter interface.

Data

Benchmarks (CVDP, TuRTLe, etc.)
  ↓ contains
Datasets (Commercial/Non-Commercial, Agentic/Non-Agentic)
  ↓ contains
Tasks (Individual problems)
  ↓ tested by
Agents (AI models/systems)
  ↓ produces
Runs (Benchmark execution instances)
  ↓ generates
Datapoints (Results: score, time, tokens, errors)

Key Commands

All commands can be executed via the UI (Commands tab) or CLI. The UI provides additional features like metadata editing, task filtering, and background execution.

Build Benchmark JSON

python src/build_benchmark_json.py <benchmark_name>

UI: Commands ▸ Build Benchmark tab. Add optional metadata (keys, notes, custom fields).

Converts JSONL datasets in benchmark_datasets/<benchmark_name>/ to structured JSON.

Build Database

python src/build_datatable.py

UI: Commands ▸ Build Database tab.

Builds SQLite database from all JSON results in results/json/.

Add Agent

python src/add_agent.py --agent-id <id> --about <description> --backend-model <model> --agent-folder <path>

UI: Commands ▸ Add Agent tab. Add optional metadata and custom configuration.

Registers a new agent in the agents registry.

Run Benchmark

python src/run_dataset.py

UI: Commands ▸ Run Benchmark tab. Configure:

Task selection (manual or advanced filters)
Execution mode (Agent or LLM/force-copilot)
Background execution (persistent processes)
Metadata (keys, notes, custom fields)

Runs agents against benchmarks using configuration in src/run.yaml.

Create DIY Task

python src/create_diy_task.py --task-dir <path> --prompt <description> --benchmark-id <benchmark> --dataset-id <dataset>

UI: Commands ▸ Create DIY Task tab. Load from input.jsonl or configure manually.

Creates a custom task from your own RTL code and test infrastructure.

Add Existing Run

python src/convert_run_results.py --run-dir <path> --benchmark-id <benchmark> --dataset-id <dataset> --agent-id <agent> --hardware-info <hardware>

UI: Commands ▸ Add Existing Run tab.

Converts an existing run directory into a finalized JSON entry.

Monitor Processes

UI: Process Monitor tab.

View status, logs, and control background benchmark executions.

Launch UI

streamlit run src/ui/portle_ui.py

Interactive web interface for browsing, analyzing, and managing results.

UI Features

The Streamlit UI provides several powerful features for managing benchmarks:

Pages

🔍 Search - Find and view any entry (agents, benchmarks, datasets, tasks, runs, datapoints)
📊 Plots - Interactive heatmaps and visualizations with drill-down capabilities
⚙️ Commands - Execute all Python scripts from the UI with rich configuration options
🔄 Process Monitor - View status and logs for background benchmark executions
ℹ️ About - System overview and database statistics

Metadata System

Add rich metadata to any entity (agents, benchmarks, runs, tasks):

Keys - Categorize entries with tags (e.g., production, experimental, verified)
Notes - Add timestamped notes with author attribution
Custom Fields - Store arbitrary key-value pairs for additional metadata

Background Execution

Run benchmarks as persistent background processes:

Benchmarks continue running even if you close the UI
Monitor multiple benchmarks simultaneously
View live logs and status updates
Automatic cleanup of orphaned processes

Advanced Task Filtering

Filter tasks by metadata for targeted benchmark runs:

Filter by difficulty, categories, or custom fields
Combine multiple filters with AND/OR logic
Save filter configurations for reuse
Quick selection for manual task picking

LLM Mode (force-copilot)

For CVDP benchmarks, run in LLM-only mode without Docker agents:

Configure cvdp_llm_name in agent custom config
Run benchmarks with direct LLM integration
No Docker agent required
Useful for testing and rapid iteration

DIY Task Creation

Create custom tasks from your own RTL code:

Organize code in rtl/, docs/, verif/, src/ directories
Load configuration from input.jsonl templates
Automatically converts to JSONL format
Integrates seamlessly with existing benchmarks

Configuration

Edit src/run.yaml (copy from src/run_example.yaml):

benchmark_id: CVDP
dataset_id: CVDP__dataset_name
agent_id: my-agent
hardware_info: local-machine
n: 3                    # Number of runs per task
threads: 1              # Parallel execution threads
adapter: cvdp           # Optional: specify adapter (defaults to benchmark_id.lower())
task_ids:               # Optional: specific tasks (omit for all)
  - CVDP__dataset__task_id_001
metadata:               # Optional: add metadata to the run
  keys:
    - experiment-1
    - production
  notes:
    - date_added: "01-15-25"
      author: "User"
      text: "Testing new configuration"
  custom:
    force_copilot: true  # For LLM mode
    difficulty: hard

Git Ignore

The following are not tracked by git (see .gitignore):

agents/* - Clone agents from external repositories
benchmark_datasets/* (except cvdp_example/ and README)
benchmark_runners/* - Clone runners from external repositories
results/* - Generated locally from runs
src/run.yaml - Per-user configuration file
*.db - SQLite databases

See each directory's README for compatible repositories and commit hashes.

Contributing

To add support for a new benchmark:

Implement a BenchmarkAdapter in src/benchmarks/<benchmark>_adapter.py
Register it in src/benchmarks/__init__.py
Clone the benchmark's dataset into benchmark_datasets/
Clone the benchmark's runner into benchmark_runners/
Document repositories and commits in respective READMEs

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
agents		agents
benchmark_datasets		benchmark_datasets
benchmark_runners		benchmark_runners
diy_tasks		diy_tasks
docker		docker
plan		plan
results_example/json		results_example/json
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

poRTLe

Motivation

Overview

Quick Start

1. Clone External Dependencies

2. Create Python Environment

3. Set Up Results Directory

4. Launch the UI Shell

5. Run Benchmarks and View Results

Run the Example Agent

Run the OpenCode Agent

Run LLM with Force-Copilot

View Results

Project Structure

Architecture: Benchmark Adapters

Data

Key Commands

Build Benchmark JSON

Build Database

Add Agent

Run Benchmark

Create DIY Task

Add Existing Run

Monitor Processes

Launch UI

UI Features

Pages

Metadata System

Background Execution

Advanced Task Filtering

LLM Mode (force-copilot)

DIY Task Creation

Configuration

Git Ignore

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages