-
Notifications
You must be signed in to change notification settings - Fork 1
Scoring Formula
Daniel Babjak edited this page May 9, 2026
·
1 revision
This page defines the final scoring model for Agent Bench.
The formula must be deterministic, inspectable, and explainable.
| Axis | Meaning |
|---|---|
| Seniority | Proven maturity, history, and technical depth |
| Relevance | Fit for the selected evaluation context |
| Confidence | Trustworthiness and completeness of the evidence |
The headline score is derived from seniority and relevance. Confidence caps the displayed tier.
headlineScore = seniority * 0.55 + relevance * 0.45
Tier caps:
| Confidence | Max tier |
|---|---|
| High | S |
| Medium | A |
| Low | B |
| Public-read | A |
| Mock | no production tier |
| Component | Weight |
|---|---|
| Sourcify verified-code maturity | 30 |
| GitHub engineering maturity | 25 |
| On-chain operational history | 20 |
| ENS identity maturity | 10 |
| Portfolio evidence depth | 15 |
| Total | 100 |
Weight: 30.
| Attribute | Weight | Normalization |
|---|---|---|
verifiedContractsCount |
6 | min(count / 5, 1) |
exactMatchRatio |
6 | exactMatches / totalContracts |
metadataCompleteness |
4 | avg(ABI, source, compiler, storageLayout presence) |
storageHygieneScore |
6 | 1 no incompatible changes, 0.5 unknown, 0 collision |
proxyResolutionCoverage |
4 | resolved proxies / proxy-like contracts |
riskySelectorPenalty |
4 | 1 - min(riskySelectors / 3, 1) |
Weight: 25.
| Attribute | Weight | Normalization |
|---|---|---|
repoCount |
3 | min(repos / 5, 1) |
repoAgeMonths |
4 | min(ageMonths / 18, 1) |
recentCommits90d |
4 | min(commits / 60, 1) |
ciPassRate |
4 | successful runs / total recent runs |
testPresence |
3 | 1 tests found, 0.5 partial, 0 none |
releaseCount |
2 | min(releases / 5, 1) |
issueHygiene |
3 | closed/resolved ratio with stale issue penalty |
securityHygiene |
2 | avg(SECURITY, dependabot, lockfiles) |
Apply GitHub trust multiplier after component normalization.
Weight: 20.
| Attribute | Weight | Normalization |
|---|---|---|
firstSeenAgeDays |
4 | min(days / 365, 1) |
txCountTotal |
4 | log-scaled to avoid spam dominance |
txCountRecent90d |
4 | log-scaled recent activity |
contractsDeployedCount |
3 | min(count / 5, 1) |
uniqueInteractors |
3 | log-scaled |
activityContinuity |
2 | active months / observed months |
Spam rule:
High transaction count without source diversity does not produce high seniority.
Weight: 10.
| Attribute | Weight | Normalization |
|---|---|---|
ensNameAgeDays |
2 | min(days / 365, 1) |
manifestPresent |
2 | 1 present, 0 absent |
recordsCompleteness |
2 | present expected records / expected records |
ownerConsistency |
2 | owner/operator/signers align |
endpointPresence |
1 | web/context records present |
subnameSignal |
1 | capped subname count |
Weight: 15.
| Attribute | Weight | Normalization |
|---|---|---|
portfolioItemCount |
3 | min(items / 6, 1) |
portfolioSourceDiversity |
3 | source kinds present / 5 |
verifiedClaimRatio |
4 | verified claims / total claims |
contractBackedItemRatio |
2 | contract-backed items / applicable items |
evidenceLinkCoverage |
3 | claims with usable evidence / total claims |
Default context:
agentic venture due diligence
| Component | Weight |
|---|---|
| Category fit | 25 |
| Recent activity | 20 |
| Portfolio alignment | 20 |
| Public-good / Umia fit | 15 |
| Claim credibility | 10 |
| Demo readiness | 10 |
| Total | 100 |
Weight: 25.
Signals:
- manifest context tags,
- repository topics,
- README keywords,
- portfolio item roles,
- claims taxonomy.
Example contexts:
- public goods,
- grants,
- audit/safety,
- governance,
- data extraction,
- developer tooling,
- trading,
- research.
Weight: 20.
Signals:
- GitHub commits last 90 days,
- releases last 180 days,
- on-chain activity last 90 days,
- recent verified contract update,
- manifest update freshness.
Weight: 20.
Signals:
- portfolio items match claimed category,
- contracts/repos support stated capability,
- evidence exists for core claim,
- no major contradiction between sources.
Weight: 15.
Signals:
- due-diligence utility,
- agentic venture applicability,
- civic/public-good usefulness,
- non-extractive framing,
- compatibility with launch platform review.
Weight: 10.
Signals:
- verified claims,
- signed claims,
- discounted claims,
- missing evidence.
Weight: 10.
Signals:
- public endpoint works,
- report is shareable,
- portfolio has inspectable items,
- clear story for reviewer.
Confidence is separate.
confidence =
sourceCoverage * 0.25 +
verificationCoverage * 0.30 +
identityBinding * 0.20 +
freshness * 0.15 +
errorHealth * 0.10
| Component | Meaning |
|---|---|
| sourceCoverage | Required source cards returned usable data |
| verificationCoverage | Inputs are verified/signed rather than self-asserted |
| identityBinding | ENS, GitHub, contracts, and operator connect cleanly |
| freshness | Data is within TTL |
| errorHealth | Few failed collectors |
| Trust state | Multiplier |
|---|---|
| Sourcify verified exact match | 1.0 |
| On-chain direct read | 1.0 |
| ENS direct record | 1.0 |
| Cross-signed GitHub | 1.0 |
| ENS-signed claim | 0.85 |
| Public but unverified GitHub claim | 0.6 |
| Public URL without ownership proof | 0.6 |
| Self-asserted manifest claim | 0.35 |
| Missing evidence | 0 |
| Mock | 0 for production score; demo-only |
For every row:
contribution = normalizedValue * weight * trustMultiplier
The UI must show all three factors.
| Outcome | Suggested thresholds |
|---|---|
| fast-track | headline >= 75 and confidence high/medium |
| emerging-review | relevance >= 70 and seniority < 60 |
| evidence-required | confidence low or verifiedClaimRatio < 0.4 |
| manual-security-review | any portfolio contract returns SIREN |
| reject-or-redirect | relevance < 45 or evidence count too low |
These are reviewer routing labels, not investment advice.
The formula is public.
Any weight change must update:
- this page,
- product constants,
- report JSON schema version if output changes,
- demo fixture expected scores,
- explanation copy.
Hidden score changes are not allowed.