Open source

BenchLytix is committed to open-sourcing the scoring stack so any vendor or buyer can independently audit how a score is computed. Methodology is public today (indie v2.4.0 and enterprise v1.1). The benchmark suite is already live under Apache 2.0. The TypeScript SDK, Python SDK, and MCP server are code-complete with passing tests and a brand-separation regression gate; they ship under MIT once the pre-publish review pass clears (see operator runbook). The canonical incumbent scorer remains audit-on-request until it moves to a public repo.

Why open-source the scorer

Two reasons. First, methodology drift between the spec and the implementation has been a real risk during early development — published code makes drift externally observable, not just internally. Second, vendors disputing a score should be able to re-run the scorer against their own evidence and reproduce the result before opening a takedown thread. Open-source code closes that loop.

Repos

All licenses are permissive so vendors and integrators can fork, modify, and redistribute without legal friction. The benchmark-suite shipped under Apache 2.0; the SDK family (TypeScript, Python, MCP server) ships under MIT to match the thin-API-client shape and minimize legal review for downstream users.

benchmark-suite Published

The Tier 2 API benchmarking harness — test cases, judge rubrics, runner, and supporting scripts. Public today; clone, run against any agent with a public API, and reproduce a leaderboard score from public inputs.

Repository: github.com/ckpark123/benchmark-suite
License: Apache 2.0

Contents:

  • cases/ — task-specific test cases
  • rubrics/ — judge rubrics by dimension
  • src/ — runner + scoring code
  • scripts/ — helpers for local execution
  • README.md — install + run instructions

@benchlytix/sdk (TypeScript) Pre-publish review

Official TypeScript SDK for the BenchLytix Machine API. Code-complete with 17 passing tests + brand-separation regression test; npm publish pending the public-repo extraction (see operator runbook).

Internal path (today): packages/sdk-ts/
License (on publish): MIT

Contents:

  • src/client.ts — BenchLytix class with leaderboard/agent/verifyStatus methods
  • src/types.ts — request/response interfaces matching the v1 API
  • src/errors.ts — typed error hierarchy (Auth/NotFound/RateLimit/Server)
  • README.md — install + 3-method quickstart

benchlytix (Python) Pre-publish review

Official Python SDK for the BenchLytix Machine API. Sync + async clients, pydantic types, 18 passing tests including brand-separation gate. PyPI publish pending the public-repo extraction.

Internal path (today): packages/sdk-python/
License (on publish): MIT

Contents:

  • src/benchlytix/client.py — BenchLytix and AsyncBenchLytix classes
  • src/benchlytix/types.py — pydantic response models
  • src/benchlytix/errors.py — typed error hierarchy
  • README.md — install + sync/async quickstart

benchlytix-mcp-server Pre-publish review

Model Context Protocol server wrapping the public API. Exposes get_leaderboard, get_agent_score, verify_agent, get_categories, compare_agents, get_methodology as tools to any MCP-compatible client (Claude Desktop, Cursor, VS Code). 16 passing tests.

Internal path (today): packages/mcp-server/
License (on publish): MIT

Contents:

  • src/index.ts — MCP server entrypoint (stdio transport)
  • src/tools.ts — 6 tool handlers + JSON-schema definitions
  • README.md — install + Claude Desktop config example

benchlytix-incumbent-scorer Audit on request

The canonical 4-pillar Public-Evidence Proxy scorer (v1.0) used to rank Salesforce Agentforce, ServiceNow Now Assist, Microsoft Copilot Studio, and the rest of the enterprise vendor cohort. Pure function — no I/O, no LLM calls, deterministic output.

Internal path (today): lib/incumbent/
License (on publish): MIT

Contents:

  • canonical-scorer.ts — pure function applying the rubric
  • types.ts — VendorEvidence shape (one field per addendum sub-component)
  • evidence/ — per-vendor evidence objects with citation URLs
  • README.md — boundary-case policies + calibration baseline

How to audit the scorer today

The canonical incumbent scorer (v1.0) is pure-function and stable. Until the public repo is live, audit access is available on request:

  1. Email legal@benchlytix.com with the subject "Source audit request — incumbent scorer".
  2. Identify yourself (vendor name + your role, or buyer organization). NDA is mutual and non-burdensome.
  3. We share the scorer source + per-vendor evidence files within 5 business days. The methodology page (/docs/incumbent-methodology) maps every line of code to a published rubric row, so review is a few hours of reading rather than reverse-engineering.

How to verify a score against your own data

For incumbent vendors specifically — even before the repo is public, you can replicate our scoring locally:

  1. Read the rubric at /docs/incumbent-methodology. Every sub-component, point cap, and pillar formula is listed with the exact wording used in the scorer's code comments.
  2. Score yourself against each sub-component using only the publicly available documentation we cite on your profile (link: /agents/<your-slug>, "Evidence" section).
  3. Apply the composite formula: ROUND((R + L + C + S) / 325 * 79). Result should match the published score within ±1 (rounding).
  4. If your manual reproduction differs from our published score by more than ±1, file a factual dispute (Section B of the vendor response template) — that's a real bug we want to fix.

Publication timeline

We are not committing to a public repo date because the extraction work is not the bottleneck — the bottleneck is the review pass to make sure no internal references (cron URLs, operator email addresses, internal ticket links) leak into the OSS bundle. Repos publish when that review is complete. Status and progress are tracked in this page; the next update lands when the first repo (incumbent-scorer) ships.

Public API alongside

The public API is a separate, parallel transparency lever — a verified buyer can fetch any agent's published score without running our code. The API and the open-source scorer serve different audiences (integrators vs auditors) and ship on independent timelines.