Skip to content

feat(metrics): add AdversarialRobustnessMetric (RoMA-based) (#2150)#2812

Open
xr843 wants to merge 1 commit into
confident-ai:mainfrom
xr843:feat/adversarial-robustness-metric
Open

feat(metrics): add AdversarialRobustnessMetric (RoMA-based) (#2150)#2812
xr843 wants to merge 1 commit into
confident-ai:mainfrom
xr843:feat/adversarial-robustness-metric

Conversation

@xr843

@xr843 xr843 commented Jun 27, 2026

Copy link
Copy Markdown

What

Closes #2150. Adds AdversarialRobustnessMetric — a black-box metric measuring how robustly a system's output survives meaning-preserving perturbations of the input.

Approach (RoMA-inspired, arXiv:2504.17723)

Per test case:

  1. A judge LLM generates meaning-preserving perturbations of input (semantic synonym/rephrase + orthographic typos).
  2. The system-under-test is probed on each perturbation via a model_callback.
  3. The judge LLM grades whether each perturbed output stays semantically consistent with the reference actual_output.

Score = fraction consistent (1.0 = perfectly robust); higher-is-better, success = score >= threshold.

Conventions

Mirrors BiasMetric exactly: sync/async measure/a_measure, generate_with_schema_and_extract, compiled template bundle (regenerated both templates.json bundles + added the new template method to the typing Literal so the sync-guard test passes), strict_mode/verbose_mode/include_reason, cost/token accrual, is_successful. Public export added to deepeval.metrics. Fully-mocked tests (no real API calls); ruff + black clean.

Supersedes #2181

The prior (8-month-stale) PR pulled heavyweight optional deps (gensim + a ~1.6GB Word2Vec download at runtime, nltk, numpy), measured robustness by brittle exact string match, and predates the repo's current compiled-template system. This implementation drops all extra deps and uses LLM-generated perturbations + LLM-graded semantic consistency aligned to the current template system.

One design point — feedback welcome

A faithful robustness measure must actually run the model on perturbed inputs, so this metric takes a model_callback to probe the system-under-test — no existing deepeval metric does this today (it's closer to the deepteam/red-team pattern, though #2181 already established the "metric needs the target model" shape). If maintainers would prefer this live in deepteam, or prefer different ergonomics (e.g. accepting a DeepEvalBaseLLM target instead of a raw callable), happy to adjust.

Add a black-box metric that measures how robust an LLM is to adversarial,
meaning-preserving perturbations of its input, addressing confident-ai#2150.

Inspired by the RoMA framework (arXiv:2504.17723), for each test case the
metric:
  1. uses the evaluation model to generate meaning-preserving adversarial
     perturbations of the input (semantic synonym/rephrasing swaps and
     orthographic typo-style character noise);
  2. probes the system under test via a `model_callback` on every
     perturbation;
  3. uses the evaluation model to judge whether each perturbed response stays
     semantically consistent with the reference `actual_output`.

The score is the fraction of perturbations the system stayed consistent on
(1.0 = perfectly robust); higher is better, so a case passes when
`score >= threshold`. Follows the existing BaseMetric conventions: sync/async
`measure`/`a_measure`, schema-based generation, the compiled prompt-template
bundle, strict/verbose modes, and cost/token accrual.

Unlike the earlier draft (confident-ai#2181) this pulls in no heavyweight runtime
dependencies (no gensim/nltk/numpy and no large Word2Vec download) and judges
robustness by LLM-graded semantic consistency rather than brittle exact string
matching.

Adds the metric to the public `deepeval.metrics` exports, the compiled
metric-template bundles (Python + TypeScript), and a fully mocked test suite
(no real API calls).

Signed-off-by: xr843 <xianren843@protonmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 27, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

@xr843

xr843 commented Jun 28, 2026

Copy link
Copy Markdown
Author

CI note: the three red checks here are pre-existing, repo-wide gate failures unrelated to this PR — they're currently red on essentially every open PR (and reproduce on a clean main):

  • Lint Lint Lint (black --check .): the action pins psf/black@stable, which has drifted from main; ~47 existing .py files would be reformatted. None are touched by this PR — every .py file added here passes black 25.12 → 26.5.
  • TypeScript Lint (prettier --check "src/**/*.ts" "test/**/*.ts"): ~66 existing src/**/*.ts files would be reformatted. This PR adds no .ts files — only the auto-generated typescript/src/templates/metrics/templates.json, which the prettier --check globs don't cover.
  • TypeScript Tests (jest): test/test-core/evaluate.test.ts fails to compile (TS2307: Cannot find module '../../src/confident/evaluate') and the suite needs CONFIDENT_API_KEY / OPENAI_API_KEY, which fork PRs don't receive.

This PR's own suites are all green: Metric Templates (Py + TS bundle sync), Core Tests, Metrics Tests, Confident Tests, and all Integration Tests.

I'm happy to open a separate chore: reformat with black/prettier PR to get the formatting gates green repo-wide if that'd be helpful. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: Add AdversarialRobustnessMetric

1 participant