Add initial Inspect AI evaluation harness#87
Draft
thomas-bartlett wants to merge 1 commit into
Draft
Conversation
bact
reviewed
Jun 2, 2026
| "codeguard-1-crypto-algorithms", | ||
| "codeguard-1-hardcoded-credentials", | ||
| ) | ||
| DISCUSSION_URL = "https://github.com/cosai-oasis/project-codeguard/discussions/70" |
There was a problem hiding this comment.
Suggested change
| DISCUSSION_URL = "https://github.com/cosai-oasis/project-codeguard/discussions/70" |
We probably don't need this but can keep for the purpose of being a seed.
bact
reviewed
Jun 2, 2026
| "execution": "none", | ||
| "guidance_delivery": "prompt_block", | ||
| "codeguard_rule_ids": list(CODEGUARD_RULE_IDS), | ||
| "discussion_url": DISCUSSION_URL, |
There was a problem hiding this comment.
Suggested change
| "discussion_url": DISCUSSION_URL, |
We probably don't need this but can keep for the purpose of being a seed.
|
Thank you. Looks solid for the job and we can build further on this. Btw, to have this eval registered on Inspect Evals at https://ukgovernmentbeis.github.io/inspect_evals/ we need a small YAML file in the Inspect Eval Register. Something like this: title: "CodeGuard Password Storage Eval"
description: |
Evaluate password storage security practice
contributors:
- thomas-bartlett
tags: # optional; the upstream repo name is added automatically
- CodeGuard
- Coding
- Password
- Security
tasks:
- name: password_storage # must match the @task function name
task_path: evaluations/codeguard-inspect/codeguard_harness/tasks/password_storage.py
source:
repository_url: https://github.com/cosai-oasis/project-codeguard
repository_commit: <40-char SHA> # grab from the commit page on GitHub; tags and branches aren't accepted
comment: "Project CodeGuard is an open-source, model-agnostic security framework that embeds secure-by-default practices into AI coding agent workflows."(use example from https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/register/example_eval.yaml ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an initial Inspect AI evaluation harness under
evaluations/codeguard-inspect/.This first version uses password storage as the initial secure-coding task and compares baseline model output with CodeGuard-guided output. It establishes the project structure, task wiring, deterministic scoring pattern, and local validation flow for future CodeGuard evaluation work.
What This Adds
uvsubproject for optional Inspect AI evaluationsNotes
This PR is intended as a seed harness, not a full benchmark or effectiveness claim. Future updates can add additional tasks, scoring approaches, CodeGuard delivery modes, agent-based runs, benchmark adapters, or execution-based checks where appropriate.
Validation
uv run pytestuv run python -m compileall codeguard_harnessuv run inspect list tasks codeguard_harness/tasks/password_storage.pyRelated