Skip to content

Scoring Formula

Daniel Babjak edited this page May 9, 2026 · 1 revision

Scoring Formula

This page defines the final scoring model for Agent Bench.

The formula must be deterministic, inspectable, and explainable.

Score Axes

Axis Meaning
Seniority Proven maturity, history, and technical depth
Relevance Fit for the selected evaluation context
Confidence Trustworthiness and completeness of the evidence

The headline score is derived from seniority and relevance. Confidence caps the displayed tier.

Headline Formula

headlineScore = seniority * 0.55 + relevance * 0.45

Tier caps:

Confidence Max tier
High S
Medium A
Low B
Public-read A
Mock no production tier

Seniority Components

Component Weight
Sourcify verified-code maturity 30
GitHub engineering maturity 25
On-chain operational history 20
ENS identity maturity 10
Portfolio evidence depth 15
Total 100

Sourcify Verified-Code Maturity

Weight: 30.

Attribute Weight Normalization
verifiedContractsCount 6 min(count / 5, 1)
exactMatchRatio 6 exactMatches / totalContracts
metadataCompleteness 4 avg(ABI, source, compiler, storageLayout presence)
storageHygieneScore 6 1 no incompatible changes, 0.5 unknown, 0 collision
proxyResolutionCoverage 4 resolved proxies / proxy-like contracts
riskySelectorPenalty 4 1 - min(riskySelectors / 3, 1)

GitHub Engineering Maturity

Weight: 25.

Attribute Weight Normalization
repoCount 3 min(repos / 5, 1)
repoAgeMonths 4 min(ageMonths / 18, 1)
recentCommits90d 4 min(commits / 60, 1)
ciPassRate 4 successful runs / total recent runs
testPresence 3 1 tests found, 0.5 partial, 0 none
releaseCount 2 min(releases / 5, 1)
issueHygiene 3 closed/resolved ratio with stale issue penalty
securityHygiene 2 avg(SECURITY, dependabot, lockfiles)

Apply GitHub trust multiplier after component normalization.

On-Chain Operational History

Weight: 20.

Attribute Weight Normalization
firstSeenAgeDays 4 min(days / 365, 1)
txCountTotal 4 log-scaled to avoid spam dominance
txCountRecent90d 4 log-scaled recent activity
contractsDeployedCount 3 min(count / 5, 1)
uniqueInteractors 3 log-scaled
activityContinuity 2 active months / observed months

Spam rule:

High transaction count without source diversity does not produce high seniority.

ENS Identity Maturity

Weight: 10.

Attribute Weight Normalization
ensNameAgeDays 2 min(days / 365, 1)
manifestPresent 2 1 present, 0 absent
recordsCompleteness 2 present expected records / expected records
ownerConsistency 2 owner/operator/signers align
endpointPresence 1 web/context records present
subnameSignal 1 capped subname count

Portfolio Evidence Depth

Weight: 15.

Attribute Weight Normalization
portfolioItemCount 3 min(items / 6, 1)
portfolioSourceDiversity 3 source kinds present / 5
verifiedClaimRatio 4 verified claims / total claims
contractBackedItemRatio 2 contract-backed items / applicable items
evidenceLinkCoverage 3 claims with usable evidence / total claims

Relevance Components

Default context:

agentic venture due diligence
Component Weight
Category fit 25
Recent activity 20
Portfolio alignment 20
Public-good / Umia fit 15
Claim credibility 10
Demo readiness 10
Total 100

Category Fit

Weight: 25.

Signals:

  • manifest context tags,
  • repository topics,
  • README keywords,
  • portfolio item roles,
  • claims taxonomy.

Example contexts:

  • public goods,
  • grants,
  • audit/safety,
  • governance,
  • data extraction,
  • developer tooling,
  • trading,
  • research.

Recent Activity

Weight: 20.

Signals:

  • GitHub commits last 90 days,
  • releases last 180 days,
  • on-chain activity last 90 days,
  • recent verified contract update,
  • manifest update freshness.

Portfolio Alignment

Weight: 20.

Signals:

  • portfolio items match claimed category,
  • contracts/repos support stated capability,
  • evidence exists for core claim,
  • no major contradiction between sources.

Public-Good / Umia Fit

Weight: 15.

Signals:

  • due-diligence utility,
  • agentic venture applicability,
  • civic/public-good usefulness,
  • non-extractive framing,
  • compatibility with launch platform review.

Claim Credibility

Weight: 10.

Signals:

  • verified claims,
  • signed claims,
  • discounted claims,
  • missing evidence.

Demo Readiness

Weight: 10.

Signals:

  • public endpoint works,
  • report is shareable,
  • portfolio has inspectable items,
  • clear story for reviewer.

Confidence Formula

Confidence is separate.

confidence =
  sourceCoverage * 0.25 +
  verificationCoverage * 0.30 +
  identityBinding * 0.20 +
  freshness * 0.15 +
  errorHealth * 0.10
Component Meaning
sourceCoverage Required source cards returned usable data
verificationCoverage Inputs are verified/signed rather than self-asserted
identityBinding ENS, GitHub, contracts, and operator connect cleanly
freshness Data is within TTL
errorHealth Few failed collectors

Trust Multipliers

Trust state Multiplier
Sourcify verified exact match 1.0
On-chain direct read 1.0
ENS direct record 1.0
Cross-signed GitHub 1.0
ENS-signed claim 0.85
Public but unverified GitHub claim 0.6
Public URL without ownership proof 0.6
Self-asserted manifest claim 0.35
Missing evidence 0
Mock 0 for production score; demo-only

Score Row Calculation

For every row:

contribution = normalizedValue * weight * trustMultiplier

The UI must show all three factors.

Report Outcomes

Outcome Suggested thresholds
fast-track headline >= 75 and confidence high/medium
emerging-review relevance >= 70 and seniority < 60
evidence-required confidence low or verifiedClaimRatio < 0.4
manual-security-review any portfolio contract returns SIREN
reject-or-redirect relevance < 45 or evidence count too low

These are reviewer routing labels, not investment advice.

Formula Governance

The formula is public.

Any weight change must update:

  • this page,
  • product constants,
  • report JSON schema version if output changes,
  • demo fixture expected scores,
  • explanation copy.

Hidden score changes are not allowed.

Clone this wiki locally