Skip to content

Add initial Inspect AI evaluation harness#87

Draft
thomas-bartlett wants to merge 1 commit into
mainfrom
feat/add-inspect-eval-harness
Draft

Add initial Inspect AI evaluation harness#87
thomas-bartlett wants to merge 1 commit into
mainfrom
feat/add-inspect-eval-harness

Conversation

@thomas-bartlett
Copy link
Copy Markdown
Contributor

Summary

Adds an initial Inspect AI evaluation harness under evaluations/codeguard-inspect/.

This first version uses password storage as the initial secure-coding task and compares baseline model output with CodeGuard-guided output. It establishes the project structure, task wiring, deterministic scoring pattern, and local validation flow for future CodeGuard evaluation work.

What This Adds

  • Isolated uv subproject for optional Inspect AI evaluations
  • One initial secure-coding task with baseline and CodeGuard-guided variants
  • Deterministic scorer for generated output
  • Focused pytest coverage for task wiring and scorer behavior
  • README instructions for installation, running, validation, and current limitations

Notes

This PR is intended as a seed harness, not a full benchmark or effectiveness claim. Future updates can add additional tasks, scoring approaches, CodeGuard delivery modes, agent-based runs, benchmark adapters, or execution-based checks where appropriate.

Validation

  • uv run pytest
  • uv run python -m compileall codeguard_harness
  • uv run inspect list tasks codeguard_harness/tasks/password_storage.py

Related

@thomas-bartlett thomas-bartlett self-assigned this Jun 2, 2026
@thomas-bartlett thomas-bartlett added the enhancement New feature or request label Jun 2, 2026
"codeguard-1-crypto-algorithms",
"codeguard-1-hardcoded-credentials",
)
DISCUSSION_URL = "https://github.com/cosai-oasis/project-codeguard/discussions/70"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DISCUSSION_URL = "https://github.com/cosai-oasis/project-codeguard/discussions/70"

We probably don't need this but can keep for the purpose of being a seed.

"execution": "none",
"guidance_delivery": "prompt_block",
"codeguard_rule_ids": list(CODEGUARD_RULE_IDS),
"discussion_url": DISCUSSION_URL,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"discussion_url": DISCUSSION_URL,

We probably don't need this but can keep for the purpose of being a seed.

@bact
Copy link
Copy Markdown

bact commented Jun 2, 2026

Thank you. Looks solid for the job and we can build further on this.

Btw, to have this eval registered on Inspect Evals at https://ukgovernmentbeis.github.io/inspect_evals/ we need a small YAML file in the Inspect Eval Register. Something like this:

title: "CodeGuard Password Storage Eval"
description: |
  Evaluate password storage security practice
contributors:
  - thomas-bartlett
tags:                                      # optional; the upstream repo name is added automatically
  - CodeGuard
  - Coding
  - Password
  - Security
tasks:
  - name: password_storage                 # must match the @task function name
    task_path: evaluations/codeguard-inspect/codeguard_harness/tasks/password_storage.py
source:
  repository_url: https://github.com/cosai-oasis/project-codeguard
  repository_commit: <40-char SHA>         # grab from the commit page on GitHub; tags and branches aren't accepted
  comment: "Project CodeGuard is an open-source, model-agnostic security framework that embeds secure-by-default practices into AI coding agent workflows."

(use example from https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/register/example_eval.yaml )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants