Skip to content

Commit 840a966

Browse files
authored
Merge pull request trustyai-explainability#139 from trustyai-explainability/bug-bash-improvements
Improve AI Bug Automation Readiness
2 parents c204d82 + c01ca3e commit 840a966

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+3011
-2327
lines changed

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Default owners for the entire repository
2+
* @trustyai-explainability/developers
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
name: Bug Report
2+
description: Report a bug in llama-stack-provider-trustyai-garak
3+
labels: ["bug"]
4+
body:
5+
- type: markdown
6+
attributes:
7+
value: |
8+
Thank you for reporting a bug. Please fill out the sections below
9+
to help us reproduce and fix the issue.
10+
11+
- type: textarea
12+
id: description
13+
attributes:
14+
label: Bug Description
15+
description: A clear and concise description of the bug.
16+
validations:
17+
required: true
18+
19+
- type: textarea
20+
id: reproduction
21+
attributes:
22+
label: Steps to Reproduce
23+
description: Minimal steps to reproduce the behavior.
24+
placeholder: |
25+
1. Register benchmark with config...
26+
2. Run eval with...
27+
3. Observe error...
28+
validations:
29+
required: true
30+
31+
- type: textarea
32+
id: expected
33+
attributes:
34+
label: Expected Behavior
35+
description: What you expected to happen.
36+
validations:
37+
required: true
38+
39+
- type: textarea
40+
id: actual
41+
attributes:
42+
label: Actual Behavior
43+
description: What actually happened, including any error messages.
44+
validations:
45+
required: true
46+
47+
- type: textarea
48+
id: logs
49+
attributes:
50+
label: Error Logs
51+
description: Paste relevant logs or stack traces.
52+
render: text
53+
54+
- type: dropdown
55+
id: execution-mode
56+
attributes:
57+
label: Execution Mode
58+
options:
59+
- Llama Stack Inline (local garak)
60+
- Llama Stack Remote (KFP pipelines)
61+
- Llama Stack (all modes)
62+
- Eval-Hub Simple (direct pod execution)
63+
- Eval-Hub KFP (KFP pipeline execution)
64+
- Eval-Hub (all modes)
65+
validations:
66+
required: true
67+
68+
- type: textarea
69+
id: environment
70+
attributes:
71+
label: Environment
72+
description: Provide environment details.
73+
placeholder: |
74+
- Provider version:
75+
- Python version:
76+
- Garak version:
77+
- Llama Stack version:
78+
- OS / Platform:
79+
- Kubernetes version (if remote):
80+
validations:
81+
required: true
82+
83+
- type: textarea
84+
id: config
85+
attributes:
86+
label: Benchmark / Garak Config
87+
description: Paste relevant benchmark config or garak_config if applicable.
88+
render: yaml

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
## Summary
2+
3+
<!-- Brief description of what this PR does and why. -->
4+
5+
## Changes
6+
7+
<!-- List the key changes. -->
8+
9+
-
10+
11+
## Testing Checklist
12+
13+
- [ ] Unit tests pass (`make test`)
14+
- [ ] Linting passes (`make lint`)
15+
- [ ] New/changed code has test coverage
16+
- [ ] No breaking changes to existing benchmark configs
17+
- [ ] Documentation updated (if applicable)
18+
19+
## Related Issues
20+
21+
<!-- Link any related issues: Fixes #123, Relates to #456 -->

.github/workflows/lint.yml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
name: Lint
2+
3+
on:
4+
pull_request:
5+
branches: [main]
6+
push:
7+
branches: [main]
8+
9+
jobs:
10+
ruff:
11+
name: Ruff Lint & Format Check
12+
runs-on: ubuntu-latest
13+
14+
steps:
15+
- name: Checkout code
16+
uses: actions/checkout@v4
17+
18+
- name: Set up Python
19+
uses: actions/setup-python@v4
20+
with:
21+
python-version: '3.12'
22+
23+
- name: Install tools
24+
run: pip install ruff mypy
25+
26+
- name: Ruff check
27+
run: ruff check src/ tests/
28+
29+
- name: Ruff format check
30+
run: ruff format --check src/ tests/
31+
32+
- name: Mypy type check
33+
run: mypy src/

.pre-commit-config.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
rev: v0.11.4
4+
hooks:
5+
- id: ruff
6+
args: [--fix, --exit-non-zero-on-fix]
7+
- id: ruff-format

AGENTS.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# AI Agent Context
2+
3+
## What This Repo Is
4+
5+
This repo contains **two independent integrations** for running
6+
[Garak](https://github.com/NVIDIA/garak) LLM red-teaming scans. They share
7+
core logic but serve different orchestration surfaces:
8+
9+
1. **Llama Stack Provider** — An out-of-tree eval provider for the
10+
[Llama Stack](https://llamastack.github.io/) framework. Exposes garak
11+
through the Llama Stack `benchmarks.register` / `eval.run_eval` API.
12+
13+
2. **Eval-Hub Adapter** — A `FrameworkAdapter` for the eval-hub SDK.
14+
Completely independent of Llama Stack. Used by the RHOAI evaluation
15+
platform to orchestrate garak scans via K8s jobs.
16+
17+
## Four Execution Modes
18+
19+
```
20+
Llama Stack Eval-Hub
21+
(Llama Stack API) (eval-hub SDK)
22+
┌────────┬────────┐ ┌────────┬────────┐
23+
│ Inline │ Remote │ │ Simple │ KFP │
24+
│ │ KFP │ │ (pod) │ (pod + │
25+
│ │ │ │ │ KFP) │
26+
└────────┴────────┘ └────────┴────────┘
27+
local KFP in-pod K8s job
28+
garak pipelines garak submits to
29+
KFP, polls
30+
```
31+
32+
| Mode | Code Location | How Garak Runs | Intents Support |
33+
|------|--------------|----------------|-----------------|
34+
| **Llama Stack Inline** | `inline/` | Locally in the Llama Stack server process | No |
35+
| **Llama Stack Remote KFP** | `remote/` | As KFP pipeline steps on Kubernetes | **Yes** |
36+
| **Eval-Hub Simple** | `evalhub/` (simple mode) | Directly in the eval-hub K8s job pod | No |
37+
| **Eval-Hub KFP** | `evalhub/` (KFP mode) | K8s job submits to KFP, polls status, pulls artifacts via S3 | **Yes** |
38+
39+
**Intents** is a key upcoming feature — it uses SDG (synthetic data generation),
40+
TAPIntent probes, and MulticlassJudge detectors to test model behavior against
41+
policy taxonomies. Only the two KFP-based modes support it because it requires
42+
the six-step pipeline (`core/pipeline_steps.py`) running as KFP components.
43+
44+
## Code Layout
45+
46+
```
47+
src/llama_stack_provider_trustyai_garak/
48+
├── core/ # Shared logic used by ALL modes
49+
│ ├── config_resolution.py # Deep-merge user overrides onto benchmark profiles
50+
│ ├── command_builder.py # Build garak CLI args for OpenAI-compatible endpoints
51+
│ ├── garak_runner.py # Subprocess runner for garak CLI
52+
│ └── pipeline_steps.py # Six-step pipeline (validate→taxonomy→SDG→prompts→scan→parse)
53+
54+
├── inline/ # Llama Stack Inline mode
55+
│ ├── garak_eval.py # Async adapter wrapping garak subprocess
56+
│ └── provider.py # Provider spec with pip dependencies
57+
58+
├── remote/ # Llama Stack Remote KFP mode
59+
│ ├── garak_remote_eval.py # Async adapter managing KFP job lifecycle
60+
│ └── kfp_utils/ # KFP pipeline DAG and @dsl.component steps
61+
62+
├── evalhub/ # Eval-Hub integration (NO Llama Stack dependency)
63+
│ ├── garak_adapter.py # FrameworkAdapter: benchmark resolution, intents overlay, callbacks
64+
│ ├── kfp_adapter.py # KFP-specific adapter (forces KFP execution mode)
65+
│ ├── kfp_pipeline.py # Eval-hub KFP pipeline with S3 artifact flow
66+
│ └── s3_utils.py # S3/Data Connection client
67+
68+
├── base_eval.py # Shared Llama Stack eval lifecycle (NOT used by eval-hub)
69+
├── garak_command_config.py # Pydantic models for garak YAML config
70+
├── intents.py # Policy taxonomy dataset loading (SDG/intents flows)
71+
├── sdg.py # Synthetic data generation via sdg-hub
72+
├── result_utils.py # Parse garak outputs, TBSA scoring, HTML reports
73+
└── resources/ # Jinja2 templates and Vega chart specs
74+
```
75+
76+
## Key Conventions
77+
78+
- **Config merging**: User overrides are deep-merged onto benchmark profiles via
79+
`deep_merge_dicts` in `core/config_resolution.py`. Only leaf values are replaced.
80+
- **Intents model overlay**: When `intents_models` is provided, model endpoints
81+
are applied using `x.get("key") or default` pattern — fills empty slots but
82+
preserves user-configured values. `api_key` is always forced to `__FROM_ENV__`
83+
(K8s Secret injection).
84+
- **Benchmark profiles**: Predefined configs live in `base_eval.py` (Llama Stack)
85+
and `evalhub/garak_adapter.py` (eval-hub). The `intents` profile is the most
86+
complex — it includes TAPIntent, MulticlassJudge, and SDG configuration.
87+
- **Provider specs**: `inline/provider.py` and `remote/provider.py` define Llama
88+
Stack provider specs. `pip_packages` is auto-populated from `get_garak_version()`.
89+
90+
## Build & Install
91+
92+
```bash
93+
pip install -e . # Core (Llama Stack remote mode)
94+
pip install -e ".[inline]" # With garak for local scans
95+
pip install -e ".[dev]" # Dev (tests + ruff + pre-commit)
96+
```
97+
98+
## Running Tests
99+
100+
```bash
101+
make test # All tests (no cluster/GPU/network needed)
102+
make coverage # With coverage report
103+
make lint # ruff check
104+
```
105+
106+
Tests are 100% unit tests. Garak is mocked — it does not need to be installed.
107+
108+
## Debugging
109+
110+
- `GARAK_SCAN_DIR` — controls where scan artifacts land
111+
- `LOG_LEVEL=DEBUG` — verbose eval-hub adapter logging
112+
- `scan.log` in scan directory — garak subprocess output
113+
- `__FROM_ENV__` in configs — placeholder for K8s Secret api_key injection

0 commit comments

Comments
 (0)