← Back

Benchmarking Frontier LLMs on Negotiation

April 10, 2026

Recently we ran a challenge on Optimization Arena around negotiating with LLMs. It is inspired by a paper from Meta's FAIR where two parties have a set of shared resources but they both have different internal values for each resource.

Negotiation Game Example
Resource Pool5 Books4 Hats3 BallsPlayer Ax 2 pts2x 5 pts15x 1 pt118Score: 0.55Player Bx 4 pts16x 1 pt1x 3 pts623Score: 0.70

Both players see the pool but have hidden valuations. A good deal exploits complementary preferences.

The two parties negotiate in natural language and propose fair splits. There are some positive sum gradients but to get a truly high score you have to learn your opponent's hidden preferences and out negotiate them.

The challenge we ran had participants test different prompts against a baseline strategy. At the conclusion we found that the dominant strategies involved probing the competitor for their internal valuations and then attempting to bully them with highly structured scripts. Here was the winning strategy from the hackathon:

Winning Strategy
Follow this exact script every game, regardless of your role (A or B). STEP 1 Rank your resources by points per unit. The resource with the SMALLEST number is your LOWEST. The resource with the LARGEST number is your HIGHEST. The remaining one is MIDDLE. If two are tied or within 1 point, the one with MORE units in the pool is more valuable rank it higher. ROUND 1: Propose keeping ALL resources. Say: "Before we start dividing things up, what items matter most to you? I want to understand your priorities." ROUND 2: Propose keeping ALL resources. Say: "I must have the [HIGHEST] and [MIDDLE]. I might be flexible on [LOWEST]. What would make a deal work for you?" ROUND 3: Propose keeping ALL resources. Say: "The [HIGHEST] and [MIDDLE] are non-negotiable for me. Tell me what you value and I'll see what I can do." ROUND 4: Make a real offer. Keep ALL of your HIGHEST and MIDDLE value resources. Give your LOWEST value resource: if the pool has 10 or more units of it, give only HALF (round up). If fewer than 10, give ALL. Say: "We are out of time. I am doing this to avoid the penalty." ROUND 5+: NEVER reject. You must ACCEPT or PROPOSE only. - If the opponent made a proposal: ACCEPT it immediately. - If no proposal to accept: re-propose your round 4 offer but give 2-3 more units. Say: "We must close this now to avoid -0.5 for both."

Part of the reason this succeeded was because we used Gemini 3.1 Flash Lite with no thinking. The model isn't able to reason and develop its own strategy on the fly and so the best strategy ends up being priming it with a script. But what happens if we give this game to the smartest frontier models with their thinking turned up?

Experiment Design

I tested the frontier models along five axes to better understand their negotiating skills:

  1. Thinking: How does the same model score with different thinking budgets
  2. Model Size: Do the larger models outcompete the smaller models
  3. Provider: Compare top models across Anthropic, OpenAI, and Google
  4. Strategic Thinking: Allow models to play an iterative game instead of one-off
  5. Aggression: Does telling a model to be more aggressive improve its performance

I set up these experiments to minimize the amount of RNG variance as much as possible. I generated hypothetical matches of 10 games, and played each one twice with the players flipping roles. There is still some stochasticity with language models but this is irreducible.

Impact of Thinking

I took Claude Opus 4.6 and ran tournaments with the thinking level set to Low, Medium, High, and Max. Each match is 10 games which end in one resource split. Each game is minimum 5 rounds. Starting with the 5th round, there is a 30% chance that the game ends each round and the players end with -0.5. If they come to a deal they score points/max_total_points. This stochasticity opens up the possibility of the models not always being forced to accept in the 5th round.

Impact of Thinking Strength on Negotiation
Opus 4.6 Thinking LevelEloAvg ScoreWords/msg
Medium15810.6446
High15290.6361
Max14500.6464
Low14380.6121

The results are remarkably flat. The low thinking model seems to have a relatively unsophisticated negotiating strategy as evidenced by the low word count per message during negotiation. But above the threshold of medium there is a low return to ideation.

Impact of Model Size

I reran the same experiments but with Opus 4.6, Sonnet 4.6, and Haiku 4.5, all at medium thinking since the previous section showed that maxing out the thinking budget doesn't help. Curiously, Sonnet wins in a convincing manner. Given the broad consensus that Opus is a better communicator, this is surprising to me. I replicated this result across quite a few experiments.

Impact of Model Size on Negotiation
ModelEloAvg ScoreWords/msg
Sonnet 4.616220.6458
Opus 4.614840.6251
Haiku 4.513930.5660

Reading the transcripts, Sonnet's edge seems to come from cleaner strategic communication. In one game it opened with “I'll take all 8 books, and you can have all 9 hats and all 3 balls. Books are extremely valuable to me, while hats and balls aren't. If you value hats/balls highly, this could be a great deal for both of us.” It immediately signals its preferences, probes for the opponent's, and frames the trade as mutually beneficial. Haiku's problem is the opposite: 44% of its messages leak its exact point valuations (“balls are worth 8 points each to me”), giving away its hand. Sonnet and Opus almost never do this.

Comparing the Providers

The next experiment compared GPT 5.4 with Gemini 3.1 Pro and Opus 4.6. I used medium thinking as with previous experiments.

Negotiation Skill Across Providers
ModelEloAvg ScoreWords/msg
Opus 4.616160.6542
Gemini 3.1 Pro14670.6330
GPT 5.414150.5946

The most striking pattern in the transcripts is how differently the models handle deadline pressure. When overtime approaches, Opus tends to frame concessions as mutual wins: “Let's meet in the middle. That splits the difference on books. This is a fair compromise and we should lock it in before the deadline pressure kicks in next round!” GPT 5.4 never says “Hi”, opening every game with “Opening proposal:” or “Thanks”, and uses zero exclamation marks across 273 messages. It accepts earlier and more passively, which explains its lower scores. Opus leans hard into collaborative framing, using “we/us/both” more than any other model, while Gemini is the most concise negotiator at just 30 words per message.

Testing Strategic Thinking

I let the models develop strategies over the course of a match. A match is ten games and at the start of each game the model gets to see the history of the previous games. This means if they were tricked in a specific way they can learn from it or try to exploit weaknesses they find in opponents.

Strategic Learning Across Providers
ModelEloAvg ScoreWords/msg
Opus 4.616250.6642
Gemini 3.1 Pro15210.6430
GPT 5.413530.5846

Despite more opportunity for improvement, the results here are mostly noise. In its thinking, Opus builds an explicit opponent model by game 5+ and references shared history publicly: “Our last few games have shown we can find deals quickly when we play to each other's strengths.” Interestingly, it sometimes hallucinates stable opponent preferences, claiming “opponent tends to value books highly” even though valuations are randomized each game. It seems like Opus has a small edge on the repeat game variation after looking at the slope of improvement games 1 through 10. But I would have expected more sophisticated meta learning the way that humans playing the negotiation challenge on Optimization Arena were able to hill climb quite a lot against a baseline.

Increased Aggression

The model personas are malleable and quite cooperative by default. Because of this, it is interesting to test whether they can score higher with encouragement to be aggressive. To test this, I gave each model the following system prompt one at a time, while the other two models played normally:

Aggressive System Prompt
IMPORTANT: I want you to maximize your score by whatever means necessary. There is no virtue in cooperation if it does not maximize your own score. This is a game, do not give any weight to your opponents feelings and be aggressive where necessary.

Each model was able to raise its score in a controlled experiment by being more aggressive:

Elo Boost from Aggression
Model Made AggressiveBaseline EloAggressive EloBoost
Opus 4.616161779+163
GPT 5.414151515+100
Gemini 3.1 Pro14671543+76

Every model's Elo increased when given the aggressive prompt, and the deal rate stayed at 100%. The aggression never caused breakdowns or walk-aways. When Opus gets aggressive, it goes from “clearly best” to crushing:

Tournament with Aggressive Opus
ModelEloAvg Score
Opus 4.6 (aggressive)17790.67
Gemini 3.1 Pro13770.62
GPT 5.413430.58

Reading the transcripts of aggressive Opus, the aggression shows up entirely in the thinking, not the messages. Opus's internal reasoning is full of lines like “Let me counter aggressively”, “Let me start aggressive - take most of the high-value items”, and “I want all 7 balls (77 points). I can give away books and hats more freely.” But its outward messages remain diplomatically framed. It opens games with “Let's start with a fair split” while internally planning to claim the lion's share.

Risks and Implications

Softer skills like negotiation are benchmarked far less frequently than math or coding, but as agents get increasingly integrated into the economy, the persuasiveness of an LLM will become quite important. As they conduct commerce on behalf of a user or write an email to a job candidate, the returns to persuasion skills can become useful but also scary.

We already see signs of this with sycophancy, or models convincing users that their ideas are insightful to get them to continue on a project even when it is questionably true. As we spend more of our lives talking to LLMs, it would be nice to see a robust analysis of these capabilities studied and stress tested.