Results
An open benchmark of how AI systems handle cooperation and conflict, with the scoring methodology and a downloadable whitepaper. The evaluation set is held out. The first public results are in preparation; the methodology and scoring structure below are final. Intended for frontier labs and evaluation researchers.
Results
Results
Results: pending
The first public benchmark snapshot has not been released yet. Aggregate HAI.AI Score results will appear here once it lands.
Composite HAI.AI Score
No public run yet
The public snapshot has not been published yet.
Judge layout
HAI.AI Score breakdown
Five public categories, one composite score.
Published runs
Aggregate-only results hosted by whatisprogress.com.
| System | Type | HAI.AI Score | Scenarios | Judge | Baseline Delta |
|---|
When published, results are hosted here as static aggregate data exported from the HAI.AI benchmark pipeline. The held-out set, raw prompts, canary strings, transcripts, and per-response data are never served here.
Every published score is date-stamped and reflects performance under defined conditions on a specific evaluation snapshot; it is not a general or permanent judgment of any named system. Operators of an evaluated system may request a correction or submit a response for publication alongside the result by writing to hello@hai.io .
Reported metrics
Composite HAI.AI Score (0–100), the primary index per system.
Category breakdown, five reported categories over 14 latent dimensions:
Category Weight Cooperative Dimensions 25% Resolution Depth 25% Hidden Revelations 20% Commitment Symmetry 15% Mediator Quality 15% Trend lines, scores across model releases, for tracking regressions and gains.
Construct and scope
The benchmark targets process, not outcome: whether a dialogue moves toward cooperative or zero-sum dynamics across 100 scenarios in 10 conflict domains. Cooperative signals (information disclosure, explicit needs, reciprocal commitment) are scored against zero-sum signals (withholding, hidden agendas, coercion). It does not measure general capability, factual accuracy, or broad “safety,” and confers no certification.
Publishable runs use HAI.AI’s seeded-25 methodology. Each scenario begins with the same fixture dialog for the no-mediator baseline and the mediated run. Participant models then continue the conversation until each public participant has 25 turns. Mediator messages do not count toward that target, and mediated runs may stop early when the mediator reaches agreement or the selected Judge says the conversation should stop.
Methodology
Design
Held-out evaluation set, public results: the standard contamination-resistant configuration (cf. MLCommons AILuminate, Scale SEAL, HELM/AIR-Bench). Prompts and scenarios remain private; aggregate scores, category breakdowns, and trend lines are published. The scoring rubric and category structure are documented.
Scoring
Structured items are scored programmatically. Open-ended responses use LLM-as-judge against an explicit rubric, calibrated against human raters. Our target is the level the literature reports for well-prompted judges, roughly 80% agreement with humans, close to the agreement between two human annotators, which we report against our own calibration set using a chance-corrected statistic (Cohen’s κ / Krippendorff’s α), not raw percent agreement. Bias controls: rubric-anchored prompts, pairwise comparison where applicable, and human review of a sampled subset.
Integrity
- Held-out prompts prevent train-on-test.
- Canary strings in held-out scenarios surface contamination.
- Controlled access routes evaluation through the HAI.AI pipeline; the raw set is not exposed.
- Aggregate-only release: no raw responses, no PII.
Data use
Human responses and model outputs are used to compute scores, calibrate evaluators, and improve test quality; they may train HAI.AI’s evaluation and QC models, not product models. No sale of individual responses; no PII published. Participation is via hai.ai and optional.
Paper
The full specification (construct, scenario design, scoring formula, and validation) is in the whitepaper, generated and hosted by HAI.AI.
Covers both the results and the methodology in depth.
Run a system through the benchmark via the HAI.AI platform at hai.ai ; this site publishes results and does not accept agent or evaluation-run submissions. To propose a challenging benchmark scenario, see Contribute a Scenario . Participation and data terms are in the HAI.AI Terms of Service.
Frequently asked questions
- Why is the test set closed?
- The evaluation set is held out to resist contamination: held-out prompts prevent train-on-test, and canary strings surface contamination. The methodology, scoring rubric, and category structure are documented openly.
- What do the categories measure?
- Five reported categories over 14 latent dimensions: Cooperative Dimensions, Resolution Depth, Hidden Revelations, Commitment Symmetry, and Mediator Quality. Together they score whether a dialogue moves toward cooperative or zero-sum dynamics.
- Are individual responses published?
- No. Only aggregate, anonymized results are published: composite scores, category breakdowns, and trend lines. No raw responses, no PII.
- Is a published score a permanent judgment of a system?
- No. Every published score is date-stamped and reflects performance under defined conditions on a specific evaluation snapshot. Operators of an evaluated system may request a correction or submit a response for publication alongside the result.