feat(metrics): add AdversarialRobustnessMetric (RoMA-based) (#2150)#2812
Open
xr843 wants to merge 1 commit into
Open
feat(metrics): add AdversarialRobustnessMetric (RoMA-based) (#2150)#2812xr843 wants to merge 1 commit into
xr843 wants to merge 1 commit into
Conversation
Add a black-box metric that measures how robust an LLM is to adversarial, meaning-preserving perturbations of its input, addressing confident-ai#2150. Inspired by the RoMA framework (arXiv:2504.17723), for each test case the metric: 1. uses the evaluation model to generate meaning-preserving adversarial perturbations of the input (semantic synonym/rephrasing swaps and orthographic typo-style character noise); 2. probes the system under test via a `model_callback` on every perturbation; 3. uses the evaluation model to judge whether each perturbed response stays semantically consistent with the reference `actual_output`. The score is the fraction of perturbations the system stayed consistent on (1.0 = perfectly robust); higher is better, so a case passes when `score >= threshold`. Follows the existing BaseMetric conventions: sync/async `measure`/`a_measure`, schema-based generation, the compiled prompt-template bundle, strict/verbose modes, and cost/token accrual. Unlike the earlier draft (confident-ai#2181) this pulls in no heavyweight runtime dependencies (no gensim/nltk/numpy and no large Word2Vec download) and judges robustness by LLM-graded semantic consistency rather than brittle exact string matching. Adds the metric to the public `deepeval.metrics` exports, the compiled metric-template bundles (Python + TypeScript), and a fully mocked test suite (no real API calls). Signed-off-by: xr843 <xianren843@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Someone is attempting to deploy a commit to the Confident AI Team on Vercel. A member of the Team first needs to authorize it. |
Author
|
CI note: the three red checks here are pre-existing, repo-wide gate failures unrelated to this PR — they're currently red on essentially every open PR (and reproduce on a clean
This PR's own suites are all green: Metric Templates (Py + TS bundle sync), Core Tests, Metrics Tests, Confident Tests, and all Integration Tests. I'm happy to open a separate |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Closes #2150. Adds
AdversarialRobustnessMetric— a black-box metric measuring how robustly a system's output survives meaning-preserving perturbations of the input.Approach (RoMA-inspired, arXiv:2504.17723)
Per test case:
input(semanticsynonym/rephrase +orthographictypos).model_callback.actual_output.Score = fraction consistent (1.0 = perfectly robust); higher-is-better,
success = score >= threshold.Conventions
Mirrors
BiasMetricexactly: sync/asyncmeasure/a_measure,generate_with_schema_and_extract, compiled template bundle (regenerated bothtemplates.jsonbundles + added the new template method to the typingLiteralso the sync-guard test passes),strict_mode/verbose_mode/include_reason, cost/token accrual,is_successful. Public export added todeepeval.metrics. Fully-mocked tests (no real API calls);ruff+blackclean.Supersedes #2181
The prior (8-month-stale) PR pulled heavyweight optional deps (gensim + a ~1.6GB Word2Vec download at runtime, nltk, numpy), measured robustness by brittle exact string match, and predates the repo's current compiled-template system. This implementation drops all extra deps and uses LLM-generated perturbations + LLM-graded semantic consistency aligned to the current template system.
One design point — feedback welcome
A faithful robustness measure must actually run the model on perturbed inputs, so this metric takes a
model_callbackto probe the system-under-test — no existing deepeval metric does this today (it's closer to the deepteam/red-team pattern, though #2181 already established the "metric needs the target model" shape). If maintainers would prefer this live in deepteam, or prefer different ergonomics (e.g. accepting aDeepEvalBaseLLMtarget instead of a raw callable), happy to adjust.