CapabilityBench

Read the paper

Current benchmarks measure intelligence: can the model solve this problem? But deployment requires knowing capability: does the model satisfy these specific requirements?

A score of 78% on a benchmark tells you nothing about whether a model can operate within your hospital's clinical workflow, navigate your firm's jurisdiction constraints, or execute your company's customer service protocol. It's an aggregate over problems you may never encounter, evaluated on criteria that may not match yours.

We're launching CapabilityBench, a public registry that replaces opaque intelligence scores with traceable capability verdicts.

The framework is simple. Organizations and researchers contribute policy packs encoding capability requirements for specific domains. Models are evaluated against these policies, and results show exactly which requirements each model satisfies or violates.

The question shifts from "how smart is this model?" to "can this model do what I need?"

CapabilityBench launches publicly in early 2026.

We're building an open, shared library of executable capabilities. If your domain has requirements that models should meet, we'd love to hear from you.

research@superficiallabs.com