feat(metrics): add AdversarialRobustnessMetric (RoMA-based) (#2150) by xr843 · Pull Request #2812 · confident-ai/deepeval

xr843 · 2026-06-27T04:08:18Z

What

Closes #2150. Adds AdversarialRobustnessMetric — a black-box metric measuring how robustly a system's output survives meaning-preserving perturbations of the input.

Approach (RoMA-inspired, arXiv:2504.17723)

Per test case:

A judge LLM generates meaning-preserving perturbations of input (semantic synonym/rephrase + orthographic typos).
The system-under-test is probed on each perturbation via a model_callback.
The judge LLM grades whether each perturbed output stays semantically consistent with the reference actual_output.

Score = fraction consistent (1.0 = perfectly robust); higher-is-better, success = score >= threshold.

Conventions

Mirrors BiasMetric exactly: sync/async measure/a_measure, generate_with_schema_and_extract, compiled template bundle (regenerated both templates.json bundles + added the new template method to the typing Literal so the sync-guard test passes), strict_mode/verbose_mode/include_reason, cost/token accrual, is_successful. Public export added to deepeval.metrics. Fully-mocked tests (no real API calls); ruff + black clean.

Supersedes #2181

The prior (8-month-stale) PR pulled heavyweight optional deps (gensim + a ~1.6GB Word2Vec download at runtime, nltk, numpy), measured robustness by brittle exact string match, and predates the repo's current compiled-template system. This implementation drops all extra deps and uses LLM-generated perturbations + LLM-graded semantic consistency aligned to the current template system.

One design point — feedback welcome

A faithful robustness measure must actually run the model on perturbed inputs, so this metric takes a model_callback to probe the system-under-test — no existing deepeval metric does this today (it's closer to the deepteam/red-team pattern, though #2181 already established the "metric needs the target model" shape). If maintainers would prefer this live in deepteam, or prefer different ergonomics (e.g. accepting a DeepEvalBaseLLM target instead of a raw callable), happy to adjust.

Add a black-box metric that measures how robust an LLM is to adversarial, meaning-preserving perturbations of its input, addressing confident-ai#2150. Inspired by the RoMA framework (arXiv:2504.17723), for each test case the metric: 1. uses the evaluation model to generate meaning-preserving adversarial perturbations of the input (semantic synonym/rephrasing swaps and orthographic typo-style character noise); 2. probes the system under test via a `model_callback` on every perturbation; 3. uses the evaluation model to judge whether each perturbed response stays semantically consistent with the reference `actual_output`. The score is the fraction of perturbations the system stayed consistent on (1.0 = perfectly robust); higher is better, so a case passes when `score >= threshold`. Follows the existing BaseMetric conventions: sync/async `measure`/`a_measure`, schema-based generation, the compiled prompt-template bundle, strict/verbose modes, and cost/token accrual. Unlike the earlier draft (confident-ai#2181) this pulls in no heavyweight runtime dependencies (no gensim/nltk/numpy and no large Word2Vec download) and judges robustness by LLM-graded semantic consistency rather than brittle exact string matching. Adds the metric to the public `deepeval.metrics` exports, the compiled metric-template bundles (Python + TypeScript), and a fully mocked test suite (no real API calls). Signed-off-by: xr843 <xianren843@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-06-27T04:08:22Z

Someone is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

xr843 · 2026-06-28T01:47:37Z

CI note: the three red checks here are pre-existing, repo-wide gate failures unrelated to this PR — they're currently red on essentially every open PR (and reproduce on a clean main):

Lint Lint Lint (black --check .): the action pins psf/black@stable, which has drifted from main; ~47 existing .py files would be reformatted. None are touched by this PR — every .py file added here passes black 25.12 → 26.5.
TypeScript Lint (prettier --check "src/**/*.ts" "test/**/*.ts"): ~66 existing src/**/*.ts files would be reformatted. This PR adds no .ts files — only the auto-generated typescript/src/templates/metrics/templates.json, which the prettier --check globs don't cover.
TypeScript Tests (jest): test/test-core/evaluate.test.ts fails to compile (TS2307: Cannot find module '../../src/confident/evaluate') and the suite needs CONFIDENT_API_KEY / OPENAI_API_KEY, which fork PRs don't receive.

This PR's own suites are all green: Metric Templates (Py + TS bundle sync), Core Tests, Metrics Tests, Confident Tests, and all Integration Tests.

I'm happy to open a separate chore: reformat with black/prettier PR to get the formatting gates green repo-wide if that'd be helpful. 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(metrics): add AdversarialRobustnessMetric (RoMA-based) (#2150)#2812

feat(metrics): add AdversarialRobustnessMetric (RoMA-based) (#2150)#2812
xr843 wants to merge 1 commit into
confident-ai:mainfrom
xr843:feat/adversarial-robustness-metric

xr843 commented Jun 27, 2026

Uh oh!

vercel Bot commented Jun 27, 2026

Uh oh!

xr843 commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xr843 commented Jun 27, 2026

What

Approach (RoMA-inspired, arXiv:2504.17723)

Conventions

Supersedes #2181

One design point — feedback welcome

Uh oh!

vercel Bot commented Jun 27, 2026

Uh oh!

xr843 commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant