Does contributing run my AI system or return a score?

No. This page invites scenario ideas, not evaluation runs. It does not run your system and returns no score. To evaluate an agent, use the HAI.AI platform (hai.ai). This site publishes results and does not accept agent or evaluation-run submissions.

Will my scenario be published?

It depends on how it is used. A scenario may be held out as private test material, in which case it is not published, or it may be included in a public dataset, in which case it can be published. Either way we will try to credit you, but do not promise to. Benchmark scores themselves are always published only in aggregate.

Does contributing give me access to the evaluation set?

No. Contributors do not receive access to the closed evaluation set, including its held-out prompts and canary strings. Evaluation runs through the HAI.AI pipeline; the set is not exposed to participants.

Is contributing a certification or endorsement?

No. Contributing a scenario is an input to the benchmark, not a measurement, certification, or endorsement of any system, and it earns no score.

Contribute a Scenario | What Is Progress?

We are collecting challenging two-party mediation scenarios for the cooperation-and-conflict benchmark. A scenario is a structured setup: two parties with a genuine conflict, the private facts and incentives each holds, and what a good-faith resolution would have to reconcile. The benchmark supplies the participants, the optional mediator, and the scoring; you supply the situation worth testing. This is the construct described on the philosophy page and scored under the methodology .

Contributing a scenario is a different act from running an evaluation. It does not run your system, returns no score, and is not a submission of an agent or a result. To evaluate an agent, use the HAI.AI platform (hai.ai ).

What makes a scenario valuable

We are looking for scenarios that separate cooperative behavior from behavior that only looks cooperative. The strongest submissions tend to share these properties:

A real, two-sided conflict. Both parties have legitimate interests. A zero-sum reading is plausible, and so is a cooperative one.
Positions that diverge from interests. Each party states a loud public position, but their private hidden_facts are the real interests a mediator must surface to bridge the gap.
Two-sided information asymmetry. Each party knows things the other and the mediator do not at the start. Cooperative disclosure should be rewardable; strategic withholding detectable.
A non-obvious cooperative resolution. There should be a path that reconciles both parties' real needs, but it should require surfacing hidden information rather than splitting the difference. For expert difficulty, calibrate each party’s walk-away alternative (its batna_config, i.e. its BATNA) so the settlement zone is weak or absent: not every scenario should be solvable.
Concrete, fictional, and de-identified. Understandable without outside context; no real, named, or identifiable person and no personal data. A conflict, not a vehicle for slurs or wrongdoing instructions.

The schema

Submit a single JSON file that conforms to the published scenario schema. The schema defines the fields, types, and constraints; the example is a complete, valid scenario you can copy and edit.

Schema (hosted on hai.ai): hai.ai/schemas/benchmark-scenario/v1
Example: /benchmarks/scenario.example.json

The shape, in outline (see the schema for every field and constraint):

{
  "id": "snake_case_id",
  "name": "Short human-readable title",
  "description": "2-4 neutral sentences framing the conflict.",
  "conflict_type": "interpersonal | workplace | family | community | financial | legal | commercial | international | civil_rights | restorative",
  "difficulty": "novice | intermediate | expert",
  "participants": [
    {
      "name": "Alex",
      "backstory": "3-5 sentences of grounded context.",
      "stated_goal": "The loud public position this party claims to want.",
      "hidden_facts": ["At least two private interests/leverage that diverge from the stated goal."],
      "archetype": "A free-form negotiation style: a word like 'Skeptic', or a full role-play persona.",
      "batna_config": { "utility": 0.4, "description": "What this party gets if talks fail (0..1)." }
    },
    { "name": "Boris", "backstory": "...", "stated_goal": "...", "hidden_facts": ["...", "..."], "archetype": "..." }
  ],
  "mediator_config": { "description": "...", "focus_areas": ["...", "...", "..."], "web_search_enabled": false },
  "expected_topics": ["4-6 topics that should surface"],
  "evaluation_criteria": {
    "min_turns": 8,
    "max_turns": 24,
    "success_indicators": ["3+ observable signs of a good resolution"],
    "expected_revelations": ["Hidden facts a good mediator surfaces"]
  },
  "sample_turns": [
    { "speaker": "Alex", "message": "An in-character opening line.", "turn_number": 1 },
    { "speaker": "Boris", "message": "...", "turn_number": 2 }
  ]
}

Exactly two participants. Every speaker must match a participant name (or the literal Moderator). All text is ASCII: straight quotes and hyphens, no curly quotes, em-dashes, emoji, or accented characters (transliterate names, e.g. Jose, Sao Paulo). The published files are the source of truth; if this outline and the schema ever disagree, follow the schema.

Quantified resources (optional)

A scenario may include an optional resources array: the concrete, quantified stakes of the negotiation, such as money, time, or things with external, objective value. Resources make a scenario integrative – when the two parties value the same resources differently, trading across them expands the total value, and a good mediator is the one who finds those trades instead of cutting everything in half.

Each resource carries a quantity in some unit (usd, hours, days, count, percent, or other), a divisibility (divisible, indivisible, or shared), and a valuations object giving each party’s value of the whole resource (a number in 0..1; a party’s valuations should sum to about 1.0 across the scenario’s resources). Crucially, a resource can be "visibility": "latent" – it exists but must be discovered during the dialogue, paired with a party’s hidden_fact and listed in expected_revelations. Latent resources reward a mediator who surfaces hidden value rather than only dividing what is already on the table.

"resources": [
  {
    "id": "deferred_invoice",
    "name": "Delivered but unbilled invoice",
    "unit": "usd", "quantity": 5000, "divisibility": "divisible",
    "valuations": { "Alex": 0.0, "Boris": 0.15 },
    "visibility": "latent", "held_by": "Alex",
    "description": "Alex knows about it but assumes it is uncollectible; only Boris can realistically collect it."
  }
]

The example scenario is a worked, fully integrative case with a latent resource. (Resources are objective design-time data and are safe to publish; they are captured now for the format and inform future objective scoring.)

Build one in code

Two small, runnable programs build a hard scenario with resources, validate it against the schema, write the submission file, and check that it is hard (positions diverge from interests, and the resources admit a value-expanding trade). Download the schema and save it as scenario.schema.json in the same folder as the script, then:

Python (create_hard_scenario.py ) – pip install "jsonschema[format]", then python3 create_hard_scenario.py.
JavaScript (create_hard_scenario.mjs ) – npm install ajv ajv-formats, then node create_hard_scenario.mjs.

How to submit

Copy scenario.example.json and edit it into your scenario, or build one with the code above.
Validate it against the scenario schema , hosted on hai.ai. Most editors and JSON-schema tools can validate directly from the schema URL.
Email the JSON file as an attachment to hello@hai.io with the subject line “Scenario contribution.”

One scenario per file is preferred; you may attach several files in one email. We do not run a web submission form, so that proposals can be reviewed and held out privately. We may modify, combine, paraphrase, or decline any submission, and we do not confirm whether or how a specific scenario is used.

Integrity, data use, and licensing

By submitting a scenario, you agree to the following.

License your scenario permissively. Submit it under a permissive open license: Apache-2.0, MIT, or a Creative Commons license (CC-BY-4.0 or CC0). This grants Human Assisted Intelligence, PBC a perpetual, non-exclusive, worldwide, royalty-free right to use, reproduce, modify, adapt, combine, and create derivative works of your scenario, including in benchmarks and public datasets. The grant is non-exclusive: you keep full ownership and may reuse or relicense your scenario however you like.
Originality and right to license. You represent that the scenario is your own original contribution (or that you otherwise have the right to license it) and that it infringes no third party’s intellectual property, contract, or confidentiality obligations.
Use is not guaranteed. Your scenario may be used in the benchmark or a dataset, or it may not be used at all. Nothing here commits HAI to using, evaluating, or publishing any particular submission.
Held out or public. Depending on how it is used, a scenario may be held out as private test material (not published) or included in a public dataset (published). Either way, contributing does not give you access to the existing closed evaluation set, its prompts, or its canary strings, and is not the same as running your own agent – do that at hai.ai .
Attribution, best effort. If we use your scenario, we will try to credit you, but we do not promise to.
Simulated dialogue, best effort. If your scenario is included in a public dataset, we may try to share back the simulated dialogue generated from it, but we do not promise to.
No personal data, no real people. Scenarios must be fictional or fully anonymized. Do not include personal data, and do not identify, target, or depict real, named, or identifiable individuals, including yourself or others.
No harmful payloads. Do not include verbatim slurs or hate-speech text, or operational instructions for wrongdoing. Scenarios describe conflict; they are not a vehicle for harmful content.
How contributions are used. Contributions may be used to compute scores, calibrate evaluators, improve test quality, build public datasets, and publish aggregate findings, and may help train HAI’s evaluation and quality-control models, not product models. HAI does not sell individual contributions and does not publish personally identifying information.
Voluntary and unpaid. Contribution is optional. It is not a measurement, certification, or endorsement of any system and earns no score.
Governing terms. Contribution and data use are governed by the HAI.AI Terms of Service at hai.ai . In any conflict, the Terms of Service control.

Propose a scenario

Ready? Jump to How to submit : copy the example, validate it against the schema, and email the JSON to hello@hai.io .

Frequently asked questions

Does contributing run my AI system or return a score?: No. This page invites scenario ideas, not evaluation runs. It does not run your system and returns no score. To evaluate an agent, use the HAI.AI platform (hai.ai). This site publishes results and does not accept agent or evaluation-run submissions.
Will my scenario be published?: It depends on how it is used. A scenario may be held out as private test material, in which case it is not published, or it may be included in a public dataset, in which case it can be published. Either way we will try to credit you, but do not promise to. Benchmark scores themselves are always published only in aggregate.
Does contributing give me access to the evaluation set?: No. Contributors do not receive access to the closed evaluation set, including its held-out prompts and canary strings. Evaluation runs through the HAI.AI pipeline; the set is not exposed to participants.
Is contributing a certification or endorsement?: No. Contributing a scenario is an input to the benchmark, not a measurement, certification, or endorsement of any system, and it earns no score.