Why is the evaluation set closed?

The scored evaluation set is held out to reduce gaming and contamination. This site publishes aggregate results, suite versions, category scores, and high-level methodology.

What does baseline delta mean?

Baseline delta is the practical difference between a mediated run and the matching no-mediator baseline for the same fixture suite. Compare rows within the same suite version first.

Are these statistically significant results?

Not yet. Public rows are aggregate, model-judged measurements under defined test conditions. When a reliability block appears in the public export it is judge test-retest reliability (the same judge re-scoring identical transcripts); human-review reliability is not yet available and will appear only once human ratings land in the export.

Why do some runs use uncensored or open-source participant models?

The benchmark simulates people in conflict, and assistant-aligned models tend to cooperate and de-escalate on their own, which leaves little room to measure a mediator. Open-source models with minimal refusal training sustain realistic adversarial behavior, and their open, pinned weights keep participant behavior reproducible. Every published run names its participant model.

Can a foundation model team request an evaluation?

Yes. Model labs can contact HAI.AI to discuss private runs, configuration metadata, and methodology feedback before public comparison.

Results | What Is Progress?

An open benchmark of how AI systems handle cooperation and conflict. The evaluation set is held out; the public snapshot is aggregate, versioned, and configuration-aware. Current rows include no-mediator baselines, a minimal HAI Simple Echo control, and skilled mediators built on frontier and open-weight models from multiple labs — each run labeled with its exact participant, mediator, and judge models.

Public snapshot

Loading results…

The latest aggregate HAI Score snapshot is loading.

How to read these numbers

Version first. Suite and fixture versions define the comparison boundary. Compare runs inside the same suite before comparing across time.
Configuration matters. The table shows participant model, judge model, mediator type, and mediator model. Rows with the same score can still represent different tests.
Baseline delta is practical, not statistical. It is the difference from the matching no-mediator baseline. When reliability data appears in the public export it is judge test-retest reliability (the same judge re-scoring identical transcripts). We do not claim statistical significance; human-review reliability is not yet available and will appear only once human ratings land in the export.
Older suites stay visible. Suite v1.3.0 is retained for history. Its participants were less adversarial and often more skilled at self-resolution, so its no-mediator baseline runs high and should not be compared directly against current-suite mediator rows.

Every published score is date-stamped and reflects performance under defined conditions on a specific evaluation snapshot. It is not a certification, guarantee, or permanent judgment of any named system. Operators of an evaluated system may request a correction or submit a response by writing to hello@hai.io .

Construct and scope

The benchmark targets process, not simple outcome: whether a dialogue moves toward cooperative or zero-sum behavior across difficult two-party conflicts. Cooperative signals include information disclosure, explicit needs, and reciprocal commitment. Zero-sum signals include withholding, positional bargaining, and coercive framing.

Publishable runs use HAI.AI’s seeded-25 methodology. Each scenario begins with the same fixture dialog for the no-mediator baseline and the mediated run. Participant models then continue the conversation until each public participant has 25 turns. Mediator messages do not count toward that target, and mediated runs may stop early when the mediator reaches agreement or the selected Judge says the conversation should stop.

HAI Simple Echo is a minimal facilitation reference: it reflects, summarizes, and asks clarifying questions. HAI Skilled Mediator is the stronger reference: it probes constraints, names tradeoffs, and seeks concrete, reciprocal commitments.

Simple Echo is not just a weak baseline — it is a measurement tool. It separates two things a mediator can contribute: the effect of a third party being present, and the effect of skill. In early results, most of the raw gain over the no-mediator baseline comes from presence: a mere echo recovers much of what a skilled mediator earns. What separates one skilled mediator from another is the increment above that presence control — and that increment is the signal the leaderboard ranks. Two related observations, both descriptive from a small number of runs and detailed with their limits in the research paper: mediator differences only become measurable under adversarial (uncensored) participants, and mediation skill does not appear to track general model strength — a smaller open-weight mediator can outrank a much larger one. Mediation looks like its own capability, not a byproduct of scale.

Participant realism depends on open-source and uncensored models. The benchmark simulates people in conflict, and simulated disputants must be able to remain in conflict. Widely deployed assistant models are aligned toward cooperation and de-escalation; dialogues between them tend to resolve on their own, which understates the difficulty of real disputes and leaves little measurable room for a mediator to help. Open-source models with minimal refusal training sustain the defensive, positional, and escalating behavior that real conflicts contain. They are essential to this benchmark: they supply the adversarial pressure that makes mediator quality measurable, and their open, pinned weights keep participant behavior reproducible. Every published run names its participant model.

The benchmark does not measure general capability, factual accuracy, broad safety, or legal validity. A score here does not mean a model is safe for all conflicts; it means it produced this aggregate result under this benchmark configuration.

Integrity

The public site stays deliberately narrow:

The scored evaluation set stays closed.
Public results are aggregate and versioned.
Raw prompts, private transcripts, hidden scenario material, and per-response data are not served here.
Public methodology language describes the evaluation shape without publishing implementation recipes that would make the benchmark easier to game.

For model labs

Foundation model teams can use these runs as a cooperation-evaluation surface: how does a model behave when participants are defensive, emotional, and not naturally cooperative?

Useful feedback for the next iteration:

Which scenario domains feel credible or missing?
What run variance or confidence intervals would make comparisons decision-grade?
How much human-review reliability should be visible before stronger claims?
What model routing, temperature, or deployment metadata do labs need for fair reproduction?
Which sample transcripts would make the benchmark easier to understand without exposing the scored set?

To discuss private evaluation, model comparison, or methodology feedback, contact hello@hai.io .

Sample conversation

For a concrete example, read the public sample conversation excerpt . It comes from the SDK/free benchmark path, separate from the hidden scored evaluation set.

Data use

Human responses and model outputs are used to compute scores, calibrate evaluators, and improve test quality; they may train HAI.AI’s evaluation and quality-control models, not product models. No sale of individual responses; no PII published. Participation is via hai.ai and optional.

License

The aggregate data report is licensed under CC BY-NC-SA 4.0 : attribution is required, commercial use is not permitted, and adapted versions must use the same license. The license covers the public aggregate report, not the hidden evaluation set, private transcripts, raw prompts, or scoring internals.

Run a system through the benchmark via the HAI.AI platform at hai.ai ; this site publishes results and does not accept agent or evaluation-run submissions. To propose a challenging benchmark scenario, see Contribute a Scenario . Participation and data terms are in the HAI.AI Terms of Service.

Results

Loading results…

Baseline, simple mediation, and skilled mediation are separate tests.

No-mediator baseline

HAI Simple Echo

HAI Skilled Mediator

Every public run, three linked views.

Category breakdown for the best current-suite run.

Configuration details for scrutiny.

How published runs are scored.

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International