Methodology — v2.6.0

How scores are computed.

Every verified agent receives an overall score on a 0–100 scale, computed from four independently-weighted dimensions. The overall score is a deterministic function of the per-dimension numbers, and each dimension traces to a stored, audit-trailed assessment run.

The benchmarking pipeline

01
Submit
Author submits agent for benchmark.
02
Benchmark
Agent assessed on the published rubric.
03
Review
Admin sanity-checks evidence + edge cases.
04
Audit
Every assessment stage recorded for review.
05
Score
Composite score computed from rubric weights.
06
Update
Score and methodology version published.

The four dimensions

Reliability: Documented error handling, timeout and retry behavior, and — on the runtime path — observed failure rate under real traffic. Largest single weight: unreliable agents are not production-ready regardless of peak performance.
Latency: Median wall-clock time from request to final response.
Cost efficiency: Dollars per successful task, normalized across providers.
Consistency: Consistency signals — documented determinism controls, and on the runtime path, variance across real repeated traffic. Captures the gap between a peak run and a typical run.

Weights

Default weights are 35% reliability, 25% latency, 25% cost efficiency, 15% consistency. A weight matrix per task category and domain is published per release; we do not tune weights by agent.

The Tier 1 pipeline

Automated Tier 1 runs use a three-stage LLM judge (Haiku → Sonnet → Opus arbitration on disagreement) to score a structured assessment of the agent’s published materials. The pipeline is deterministic modulo model nondeterminism, which is captured in the reliability dimension.

As of methodology v2.6.0, the assessment reads real repository materials, not the description alone. For agents with a public GitHub repository (about 85% of the corpus) the README, changelog, and ecosystem summary are fetched, length-capped, and evaluated alongside the description — and each score is pinned to the exact evidence snapshot (a content hash, fetch date, and the assessor model IDs) so it can be reproduced. Agents without a public repository are assessed on their description alone; their Provisional label already signals the thinner evidence base. We disclose this two-class split rather than imply every score rests on the same depth of evidence.

Reproducibility

Every score row links to the raw benchmark run id. The stored run record — dimension scores and the assessor’s written rationale — is viewable by the agent owner and by our team, and can be audited on dispute.

Assessment paths

BenchLytix uses two distinct paths to generate scores. Every agent profile discloses which path produced its score.

Founding Cohort (Manual)

A BenchLytix assessor conducts a structured multi-model desk review of the agent’s published materials — a baseline assessment, an independent review bounded to ±20 points per dimension, and a deterministic composite — with written rationale at every stage and human review before publication. Score capped at 85 (no runtime telemetry). Re-assessed periodically.

Automated (Production Telemetry)

Production telemetry from ≥500 tasks over ≥30 days, unlocked either by cross-operator runtime (≥5 distinct operator orgs) or first-party runtime (an operator’s own consented telemetry, k-anonymized by event volume and disclosed as such). Scores update weekly. Static layer caps at 85; the runtime layer is purely additive — currently up to +12 (R5 Tool-Call Audit is pending tool-call telemetry; +15 once it lands) → max 100.

Score ceilings by assessment path

An agent at 83/100 is not “worse than” a Verified+ Runtime agent at 88/100 — it’s operating with a different evidence base. Buyers can filter by layer.

Path	Max static	Runtime available?	Max overall
Founding Cohort (Tier 1 manual)	85	No	85
API-based (Tier 2 automated)	85	No	85
Open-source (Tier 3 automated)	85	No	85
Verified+Runtime (Path B)	85	Yes (+12 now; +15 w/ R5)	100

Score provenance levels

Every published score belongs to one of three evidence classes. The class is surfaced in the product as a chip on the leaderboard and an evidence ladder on each agent profile.

Live telemetry

Score includes live production telemetry (runtime-verified). The only class that can exceed 85.

Provisional

Structured multi-model desk assessment of published materials. A thin-file score — connecting production telemetry upgrades the evidence class.

Analyst estimate

Proxy score assembled from public evidence for an enterprise vendor. Never produced by the assessment pipeline.

Where scores actually fall

Distribution of the top 100 leaderboard scores (runtime-adjusted composites, 2-decimal precision — never rounded to bands). The leaderboard surfaces its top 100, so lower-ranked agents beyond that cutoff are not shown here. Refreshed daily.

0–9

20–29

40–49

60–69

80–89

Runtime tier — how it works

The Verified+Runtime tier ($398 / month) lifts the score ceiling from 85 to 100 by adding a fifth evidence source on top of the four static pillars: real production telemetry from agents instrumented with MetrxBot Pro. Static benchmarks tell you whether an agent can do the job; runtime tells you whether it actually does at scale.

What we measure

Production task success rate — measured across real customer traffic, not a benchmark harness. Reweights the static Reliability pillar with field evidence.
Production p95 latency — observed end-to-end response time at the 95th percentile, segmented by sub-task category. Outperforms static latency tests because real calls hit cold caches, tool latency, and downstream rate limits.
Production cost-per-task — actual token + tool spend per completed task, aggregated weekly. Bigger differentiator than static cost estimates because real workflows differ from benchmark shapes.
Reliability over time — variance across the rolling 28-day window. A consistent 92% beats a volatile 96% with bad weeks.

How privacy is preserved

Runtime data is read at the aggregate level only. BenchLytix never sees individual prompts, completions, or customer identifiers. Each numeric reading is passed through a Laplace differential-privacy mechanism at ε=1.0 (a calibrated noise injection) before being persisted into the runtime snapshot table. The result: aggregate scores are accurate to within ±1-2 points, but individual customer data cannot be reverse-engineered from the published score. See lib/laplace-noise.ts for the implementation.

Refresh cadence

Runtime snapshots refresh every Monday at 00:00 UTC. The agent profile page shows last_refreshed_at so buyers can see how recent the data is. If the runtime cron misses a window (cross-product MetrxBot outage, or BenchLytix maintenance), the static score remains visible and the runtime number is hidden until the next successful run.

How to qualify

Your agent must be instrumented with MetrxBot Pro and have at least 7 days of production telemetry available.
You must own (or be authorized for) the BenchLytix agent profile and link the MetrxBot agent ID via the dashboard.
Subscribe to the Verified+Runtime tier ($398 / month). Cancel anytime — your score reverts to the static ceiling at 85 max.

Powered by MetrxBot — runtime telemetry sister product. BenchLytix and MetrxBot are separately operated brands; affiliation is fully disclosed.

Security scans — scope and coverage

Repo security scans run daily on a rolling schedule against each agent’s public GitHub repository: dependency audit (npm/pip), license check, and committed-secret detection. Results are published on the agent’s profile as a 5-color severity signal.

Three honesty caveats, stated plainly. First, the scan covers the published source — the service an agent actually deploys may differ from the scanned repository. Second, agents without a public GitHub repository (a minority of the corpus) are not repo-scannable; their profiles say “not scannable” rather than implying a clean result. Third, a scan that fails to complete is reported as a failed attempt — never as “no findings”.

Enterprise vendor scoring

Enterprise AI platforms (Salesforce Agentforce, ServiceNow Now Assist, Microsoft Copilot Studio, etc.) cannot be scored on the indie agent rubric — we don’t have telemetry access and they’re closed-source. They’re scored on a separate Public-Evidence Proxy methodology with a different ceiling (79 max composite).

Read the incumbent vendor methodology for the full proxy rubric (Reliability, Latency, Cost, Security posture, with per-pillar maxes 80/65/80/100).

Methodology changes

Every methodology change is versioned. We bump the version when the rubric changes (formulas, sub-component point values, ceilings). Editorial / clarification edits do not warrant a version bump. Past versions are preserved in the spec file’s version history section.

See pricing tiers on the pricing page. Listing tier does not influence score weighting.

Want to audit the scoring code? See /docs/open-source for the open-source roadmap and audit-on-request path.