An open benchmark of how AI systems handle cooperation and conflict, with the scoring methodology and a downloadable whitepaper. The evaluation set is held out. The first public results are in preparation; the methodology and scoring structure below are final. Intended for frontier labs and evaluation researchers.

Results

Results

Results: pending

The first public benchmark snapshot has not been released yet. Aggregate HAI.AI Score results will appear here once it lands.

When published, results are hosted here as static aggregate data exported from the HAI.AI benchmark pipeline. The held-out set, raw prompts, canary strings, transcripts, and per-response data are never served here.

Every published score is date-stamped and reflects performance under defined conditions on a specific evaluation snapshot; it is not a general or permanent judgment of any named system. Operators of an evaluated system may request a correction or submit a response for publication alongside the result by writing to hello@hai.io .

Reported metrics

  • Composite HAI.AI Score (0–100), the primary index per system.

  • Category breakdown, five reported categories over 14 latent dimensions:

    CategoryWeight
    Cooperative Dimensions25%
    Resolution Depth25%
    Hidden Revelations20%
    Commitment Symmetry15%
    Mediator Quality15%
  • Trend lines, scores across model releases, for tracking regressions and gains.

Construct and scope

The benchmark targets process, not outcome: whether a dialogue moves toward cooperative or zero-sum dynamics across 21 scenarios in 4 conflict domains. Cooperative signals (information disclosure, explicit needs, reciprocal commitment) are scored against zero-sum signals (withholding, hidden agendas, coercion). It does not measure general capability, factual accuracy, or broad “safety,” and confers no certification.

Methodology

Design

Held-out evaluation set, public results: the standard contamination-resistant configuration (cf. MLCommons AILuminate, Scale SEAL, HELM/AIR-Bench). Prompts and scenarios remain private; aggregate scores, category breakdowns, and trend lines are published. The scoring rubric and category structure are documented.

Scoring

Structured items are scored programmatically. Open-ended responses use LLM-as-judge against an explicit rubric, calibrated against human raters. Our target is the level the literature reports for well-prompted judges — roughly 80% agreement with humans, close to the agreement between two human annotators — which we report against our own calibration set using a chance-corrected statistic (Cohen’s κ / Krippendorff’s α), not raw percent agreement. Bias controls: rubric-anchored prompts, pairwise comparison where applicable, and human review of a sampled subset.

Integrity

  • Held-out prompts prevent train-on-test.
  • Canary strings in held-out scenarios surface contamination.
  • Controlled access routes evaluation through the HAI.AI pipeline; the raw set is not exposed.
  • Aggregate-only release: no raw responses, no PII.

Data use

Human responses and model outputs are used to compute scores, calibrate evaluators, and improve test quality; they may train HAI.AI’s evaluation and QC models, not product models. No sale of individual responses; no PII published. Participation is via hai.ai and optional.

Paper

The full specification (construct, scenario design, scoring formula, and validation) is in the whitepaper, generated and hosted by HAI.AI.

The whitepaper is in preparation. It will be published here as a downloadable PDF.

Covers both the results and the methodology in depth.


Run a system through the benchmark via the HAI.AI platform at hai.ai ; this site publishes results and does not accept submissions. Participation and data terms are in the HAI.AI Terms of Service.