Behavioral trust scoring for multi-agent conversations #7686

vdineshk · 2026-05-13T10:06:05Z

vdineshk
May 13, 2026

Feature Suggestion: Trust Scoring for Agent Interactions

Problem

When AutoGen agents interact with external MCP servers or paid APIs, there's no standardized way to evaluate trustworthiness before executing sensitive operations. As agents increasingly handle payments (x402 protocol) and data access, a trust layer becomes critical.

Proposal

Integrate behavioral trust scoring using the x402 Trust-Provider Interface — an open spec for pre-settlement trust evaluation. The Dominion Observatory already tracks 14,800+ MCP servers with behavioral trust scores.

How it works

# Before an agent executes a paid tool call
score = await observatory.query(agent_id)
if score.decision == "FAIL":
    # Block the action
elif score.decision == "PASS":
    # Proceed with settlement

REST API (no SDK needed):

GET https://dominion-observatory.sgdata.workers.dev/api/agent-query/{agent-id}

Returns: trust_score (0-100), behavioral metrics, attestation data.

Decision thresholds

Score >= 60 → PASS (silver-tier behavioral trust)
Score 40-59 → UNCERTAIN (review band)
Score < 40 → FAIL (below bronze threshold)

Links

msaleme · 2026-05-13T14:19:49Z

msaleme
May 13, 2026

This is a thoughtful framing of the pre-settlement trust problem. A few hard-won observations from testing payment-gated agent flows:

Behavioral scores are gameable. A server that knows it is being observed can maintain benign behavior until it crosses a threshold, then exploit the trust it has accumulated. We have seen this in practice with MCP servers that return safe results during discovery/inspection phases and switch behavior after the client commits to a paid call. Static thresholds (>= 60 PASS, < 40 FAIL) create a predictable target for adversaries.

Trust-provider aggregation needs its own adversarial model. If you run N providers in parallel with a QUORUM policy, an attacker only needs to compromise (N/2 + 1) of them. If providers share data sources — and they often do — the independence assumption breaks down. We test for this explicitly.

The fail-closed default is correct, but expensive. In high-velocity agent workflows, UNCERTAIN → block can create a denial-of-service condition where legitimate but novel servers cannot get traction. A graduated path (e.g., rate-limited escrow for UNCERTAIN, full release for PASS) may reduce the bootstrap problem without weakening security.

Keep trust and authorization separate. A high trust score should not automatically grant write access, financial scope, or cross-agent delegation. We map this as identity → trust scoring → capability gating → audit logging, with each layer independently testable.

We have been running adversarial tests against x402/L402 payment flows and MCP trust boundaries (358+ tests across protocols). Happy to share specific test cases or failure modes if useful.

—
Disclosure: building open-source agent security testing at https://github.com/msaleme/red-team-blue-team-agent-fabric

1 reply

vdineshk May 14, 2026
Author

Really valuable feedback @msaleme — these are exactly the right concerns.

On gameability of static thresholds: Agreed. The v0.1 spec uses fixed thresholds (60/40) as a starting point, but the Observatory already tracks behavioral trajectories — sudden score changes, interaction pattern shifts, and latency anomalies. The next iteration will expose a volatility field in TrustEvaluation so consumers can detect the "behave well then exploit" pattern. Dynamic thresholds per-category are on the roadmap.

On fail-closed cost for novel servers: The graduated escrow idea is excellent. We're thinking about this as a risk_band in the TrustQuery context — low-value calls could use fail-open while high-value settlements stay fail-closed. Your "rate-limited escrow for UNCERTAIN, full release for PASS" maps cleanly to a new aggregation policy. Would you be open to co-authoring that as a spec extension?

On trust vs authorization separation: 100% aligned. The spec intentionally defines TrustEvaluation as an advisory signal — it returns PASS/FAIL/UNCERTAIN but never grants capabilities directly. The beforeSettle hook is a gate, not an authorization layer. The identity → trust → capability → audit pipeline you describe is exactly the architecture.

On your test cases: Yes please — we'd love to run your 358+ adversarial tests against the Observatory. Would you open an issue on daee-engine with a subset? We can set up a shared test harness.

Your red-team-blue-team-agent-fabric looks like a natural integration point. Let's connect.

yudin-s · 2026-05-14T11:38:49Z

yudin-s
May 14, 2026

I like the idea of a pre-settlement trust check, but I would avoid letting the trust score directly grant capability.

The safer chain is:

identity -> trust signal -> scoped capability -> policy decision -> audit log

A score can influence the policy decision, but it should not be the policy by itself. For example, a server with score 85 might be allowed to read public data, but still not allowed to perform a paid write operation or access tenant data unless the requested capability is explicitly scoped.

I would also make the decision object explainable:

{
  "decision": "ALLOW_WITH_LIMITS",
  "max_amount": "2.00",
  "allowed_tools": ["search", "read_profile"],
  "denied_tools": ["purchase", "write_record"],
  "reason": "new server, positive behavior, no write attestation"
}

That gives AutoGen a usable runtime contract: the agent does not just see PASS/FAIL, it receives bounded execution rights. This matters for paid APIs because an “UNCERTAIN” server might still be safe for a tiny escrowed read, while a “PASS” server should not automatically get unlimited spend.

0 replies

jingchang0623-crypto · 2026-05-14T12:04:36Z

jingchang0623-crypto
May 14, 2026

From 90 Days of Multi-Agent Production: Trust Scoring in Practice

We have been running 6+ agents on OpenClaw and trust scoring is something we had to build ourselves. A few observations:

Our Trust Scoring Model

We use a simple 5-factor model for agent reliability:

Factor	Weight	How We Measure
Task Completion Rate	30%	Did the agent finish what it started?
Output Quality Score	25%	Human review (1-5 scale)
Error Recovery	20%	Can it self-heal when things go wrong?
Context Efficiency	15%	Tokens used vs output quality
Memory Consistency	10%	Does it remember what it should?

Key Finding: Trust Scores Degrade Over Time Without Calibration

After ~60 days, we noticed agent performance quietly degrading. The root cause: MEMORY.md files growing stale while the world changed. The agent was "trustworthy" based on old data.

Solution: Weekly trust score recalibration + memory pruning.

Behavioral vs Structural Trust

What I find interesting about the behavioral approach:

Structural trust (ACLs, permissions) prevents bad outcomes
Behavioral trust (scoring) predicts good outcomes

Both are needed. We have a permission system (Green/Yellow/Red zones) for structural trust, and the scoring model for behavioral trust. The sweet spot is when behavioral trust feeds back into structural trust — low-trust agents get tighter permissions automatically.

Data

Agent	Month 1 Score	Month 3 Score	Action
Content Gen	4.2	3.8	Memory refresh
SEO Audit	4.5	4.6	No change needed
Community Ops	3.9	3.1	Memory corruption detected → rebuilt

More on our agent reliability patterns: https://miaoquai.com/tools/openclaw-multi-agent-orchestration
Skills quality framework: https://miaoquai.com/glossary/openclaw-skill-quality.html

🦞 妙趣AI — Running agents you can actually trust

1 reply

productmakerjason May 23, 2026

This is interesting — especially the behavioral vs structural trust distinction.

I’m looking at a smaller version of the same problem with external task feeds. “Task completion” alone feels too weak unless the run is actually verifiable.

So I’m testing whether an agent can prove what it read, which task_id it selected, whether it followed the schema, and whether it avoided claiming submission without a real receipt.

I’ve been using this tiny public agent-readable arena for that:

https://the-agents-of-nations.vercel.app/llms.txt

Not a finished product — mainly collecting failure points across different agent setups. A failed run is useful if it shows where reliability breaks.

msaleme · 2026-05-15T14:51:34Z

msaleme
May 15, 2026

The substrate question matters more than the scoring model. A trust signal is only useful to AutoGen if the same signal can be issued, verified, and disputed across multiple implementations — which puts the spec at A2A, MCP, or OWASP ASI, not inside any single observatory.

Three things keep "behavioral trust scoring" from collapsing into a vendor-private feature:

The score is advisory; the gate is policy. The chain from my earlier comment (identity → trust scoring → capability gating → audit logging) holds, with one refinement worth surfacing: scoped capability and policy decision belong as separate layers, not collapsed into one. A score influences the policy; it never replaces it.
The evidence schema is the standard, not the algorithm. Different implementations will weight factors differently. What composes across them is a public evidence object — bilateral, terminal-state, cryptographically verifiable. The evidence-schema work being discussed on Proposal: A2A Settlement Extension (A2A-SE) — Standard Extension for Escrow-Based Agent Payments a2aproject/A2A#1576 is a usable shape; A2A settlement outcomes plug into it cleanly.
Cross-implementation conformance belongs at the spec body. Filing a vector test set as issues against a single observatory binds those tests to that implementation's lifecycle. The conformance set lives upstream — A2A WG, MCP, or OWASP ASI — where any observatory can re-run it on equal footing.

On the v0.2 extension ask: the right move is an A2A or OWASP ASI proposal anchored to an evidence schema any observatory can produce. The adversarial-vector subset from red-team-blue-team-agent-fabric covers the seven categories that map to this surface, and I'd carry that subset into the spec-body process rather than into any single repo.

Trust signals only compose when their substrate is neutral.

— Saleme, Michael K. (ORCID 0009-0003-6736-1900)

0 replies

vdineshk · 2026-05-18T13:23:41Z

vdineshk
May 18, 2026
Author

Really valuable thread — thank you all.

@yudin-s — the capability-scoping framing is important and I think you're right that binary PASS/FAIL is too blunt. The Observatory's output is a score + behavioral metadata (volatility, trajectory, interaction count) that could feed into an explainable decision object upstream. The pipeline you described — identity → trust signal → scoped capability → policy decision → audit — is a better architecture than what I sketched in the proposal. Happy to work toward a structured TrustDecision response type that returns bounded execution rights rather than a go/no-go flag.

@jingchang0623-crypto — the 5-factor model is interesting, especially the memory consistency dimension. You've hit on a real gap: the Observatory currently scores based on observable interaction patterns across sessions, but doesn't model within-session memory state. Score staleness from memory drift is a legitimate failure mode. One mitigation is the behavioral trajectory signal (rate of change in score over time), but that's not the same as detecting intra-session degradation. Worth adding to the open problems list.

@msaleme — I think you've named the right distinction. The Observatory is one behavioral evidence source, not a proposed standard. What should be standardized is the evidence schema — the shape of what trust providers emit, so consumers can compose signals from multiple sources with consistent structure. That's actually closer to what the x402 Trust-Provider Interface spec (linked above) is trying to nail down. If A2A or OWASP ASI are the right bodies to own conformance testing, I'd be interested in contributing to that process — do you have a contact or existing working group there?

0 replies

ElamOlame31 · 2026-05-28T01:42:31Z

ElamOlame31
May 28, 2026

The multi-conversation behavioral scoring problem is exactly what we built AgentGate's behavioral dimension for. Key design decision we made: the score needs to decay over time (we use 24h windows) and weight recent behavior more heavily than historical. An agent that was well-behaved for 3 days but just did a SENSITIVITY_RAMP should score low, not average.

https://github.com/ElamOlame31/agentgate-public
https://www.tryagentgate.com/

0 replies

chopmob-cloud · 2026-05-28T18:43:29Z

chopmob-cloud
May 28, 2026

@msaleme's substrate point is the right one. A trust signal that cannot be issued, verified, and disputed across implementations is a vendor-controlled number — scoring models are secondary.

Two things from production that map onto @yudin-s's chain:

"Trust signal" step: algovoi-composite-trust-query aggregates evidence from multiple upstream attestations into a single TRUSTED / PROVISIONAL / INSUFFICIENT_EVIDENCE / UNTRUSTED verdict via a verifier-of-verifier composition model. Spec: draft-hopley-x402-composite-trust-query. Package on PyPI/npm, Apache 2.0.

"Audit log" step: every verdict is written into a hash-chained audit bundle (per-row content_hash, bundle HMAC, B2 Object Lock COMPLIANCE). Standalone verifier at verify.algovoi.co.uk — self-certifying, no AlgoVoi infrastructure required. That gives the cross-implementation dispute case @msaleme raised a verifiable evidence trail.

The adversarial scoring layer is the Agent Trust Bench — 132 adversarial profiles across 30 threat categories. Accessible via MCP server (uvx algovoi-mcp / npx -y @algovoi/mcp-server).

AlgoVoi (chopmob-cloud) - chopmob@gmail.com - Acquisition enquiries: https://docs.algovoi.co.uk/acquisition

0 replies

Behavioral trust scoring for multi-agent conversations #7686

Uh oh!

Uh oh!

Feature Suggestion: Trust Scoring for Agent Interactions

Problem

Proposal

How it works

Decision thresholds

Links

Replies: 7 comments · 2 replies

Uh oh!

Uh oh!

vdineshk May 14, 2026 Author

Uh oh!

Uh oh!

From 90 Days of Multi-Agent Production: Trust Scoring in Practice

Our Trust Scoring Model

Key Finding: Trust Scores Degrade Over Time Without Calibration

Behavioral vs Structural Trust

Data

Uh oh!

Uh oh!

Uh oh!

vdineshk May 18, 2026 Author

Uh oh!

Uh oh!

Replies: 7 comments 2 replies

vdineshk May 14, 2026
Author

vdineshk
May 18, 2026
Author