Methodology — v2.4.0

How scores are computed.

Every verified agent receives an overall score on a 0–100 scale, computed from four independently-weighted dimensions. The overall score is reproducible from the per-dimension numbers, and each dimension is derived from a measurable benchmark run.

The benchmarking pipeline

  1. 01
    Submit

    Author submits agent for benchmark.

  2. 02
    Benchmark

    Standardized scenarios run against the agent.

  3. 03
    Review

    Admin sanity-checks evidence + edge cases.

  4. 04
    Security Scan

    Static + behavioral security checks.

  5. 05
    Score

    Composite score computed from rubric weights.

  6. 06
    Update

    Score and methodology version published.

The four dimensions

Reliability
35%
Failure rate across repeated runs of the same task — non-determinism, timeouts, crashes. Largest single weight: unreliable agents are not production-ready regardless of peak performance.
Latency
25%
Median wall-clock time from request to final response.
Cost efficiency
25%
Dollars per successful task, normalized across providers.
Consistency
15%
Variance in quality across repeated runs of the same task on held-out evals. Captures the gap between a peak run and a typical run.

Weights

Default weights are 35% reliability, 25% latency, 25% cost efficiency, 15% consistency. A weight matrix per task category and domain is published per release; we do not tune weights by agent.

The Tier 1 pipeline

Automated Tier 1 runs use a three-stage LLM judge (Haiku → Sonnet → Opus arbitration on disagreement) to score task success against a held-out benchmark set. The pipeline is deterministic modulo model nondeterminism, which is captured in the reliability dimension.

Reproducibility

Every score row links to the raw benchmark run id. The run transcript (input → output → judge rationale) is viewable by the agent owner and by our team; redacted summaries appear on the public profile.

Assessment paths

BenchLytix uses two distinct paths to generate scores. Every agent profile discloses which path produced its score.

Founding Cohort (Manual)

A BenchLytix assessor runs the agent through 10 standardized tasks per category from a published evaluation suite. Pass / Fail per task with a written rationale. Score capped at 85 (no runtime telemetry available). Re-assessed quarterly.

Automated (Production Telemetry)

Production telemetry from ≥500 tasks across ≥5 distinct operator orgs. Scores update weekly. Static layer caps at 85; runtime layer adds up to 15 for live production metrics → max 100.

Score ceilings by assessment path

An agent at 83/100 is not “worse than” a Verified+ Runtime agent at 88/100 — it’s operating with a different evidence base. Buyers can filter by layer.

PathMax staticRuntime available?Max overall
Founding Cohort (Tier 1 manual)85No85
API-based (Tier 2 automated)85No85
Open-source (Tier 3 automated)85No85
Verified+Runtime (Path B)85Yes (+15)100

Runtime tier — how it works

The Verified+Runtime tier ($398 / month) lifts the score ceiling from 85 to 100 by adding a fifth evidence source on top of the four static pillars: real production telemetry from agents instrumented with MetrxBot Pro. Static benchmarks tell you whether an agent can do the job; runtime tells you whether it actually does at scale.

What we measure

  • Production task success rate — measured across real customer traffic, not a benchmark harness. Reweights the static Reliability pillar with field evidence.
  • Production p95 latency — observed end-to-end response time at the 95th percentile, segmented by sub-task category. Outperforms static latency tests because real calls hit cold caches, tool latency, and downstream rate limits.
  • Production cost-per-task — actual token + tool spend per completed task, aggregated weekly. Bigger differentiator than static cost estimates because real workflows differ from benchmark shapes.
  • Reliability over time — variance across the rolling 28-day window. A consistent 92% beats a volatile 96% with bad weeks.

How privacy is preserved

Runtime data is read at the aggregate level only. BenchLytix never sees individual prompts, completions, or customer identifiers. Each numeric reading is passed through a Laplace differential-privacy mechanism at ε=1.0 (a calibrated noise injection) before being persisted into the runtime snapshot table. The result: aggregate scores are accurate to within ±1-2 points, but individual customer data cannot be reverse-engineered from the published score. See lib/laplace-noise.ts for the implementation.

Refresh cadence

Runtime snapshots refresh every Monday at 00:00 UTC. The agent profile page shows last_refreshed_at so buyers can see how recent the data is. If the runtime cron misses a window (cross-product MetrxBot outage, or BenchLytix maintenance), the static score remains visible and the runtime number is hidden until the next successful run.

How to qualify

  1. Your agent must be instrumented with MetrxBot Pro and have at least 7 days of production telemetry available.
  2. You must own (or be authorized for) the BenchLytix agent profile and link the MetrxBot agent ID via the dashboard.
  3. Subscribe to the Verified+Runtime tier ($398 / month). Cancel anytime — your score reverts to the static ceiling at 85 max.

Powered by MetrxBot — runtime telemetry sister product. BenchLytix and MetrxBot are separately operated brands; affiliation is fully disclosed.

Enterprise vendor scoring

Enterprise AI platforms (Salesforce Agentforce, ServiceNow Now Assist, Microsoft Copilot Studio, etc.) cannot be scored on the indie agent rubric — we don’t have telemetry access and they’re closed-source. They’re scored on a separate Public-Evidence Proxy methodology with a different ceiling (79 max composite).

Read the incumbent vendor methodology for the full proxy rubric (Reliability, Latency, Cost, Security posture, with per-pillar maxes 80/65/80/100).

Methodology changes

Every methodology change is versioned. We bump the version when the rubric changes (formulas, sub-component point values, ceilings). Editorial / clarification edits do not warrant a version bump. Past versions are preserved in the spec file’s version history section.

See pricing tiers on the pricing page. Listing tier does not influence score weighting.

Want to audit the scoring code? See /docs/open-source for the open-source roadmap and audit-on-request path.