← Back

Benchmarking Frontier LLMs on Chess

March 31, 2026

Computers Can Already Play Chess

Computers have been superhuman at chess for almost thirty years. But modern chess engines achieve this performance in a way that is impossible for humans to replicate. First, they use hyper-optimized search techniques to look 20-30 moves in advance across many different combinations to reason near perfectly about complex tactical positions. Second, they distill billions of games into sophisticated neural networks to quickly evaluate the subtleties of positions with features that are hard for humans to understand.

Search
20–30 moves deep
+
Evaluation
+1.7

LLMs Play Chess Differently

On the other hand, if you watch an LLM play chess it seems to reason in a way that is much more congruous with how humans approach chess, thinking through simple positional features or trying to calculate out lines and often getting lost or making small computational errors. Therefore, if LLMs can play chess at a high level, they may be able to distill insights to human players in a way that is much more tractable.

To understand their current performance, I built a series of chess benchmarks to evaluate them across a few facets of the game. This is by no means the first chess benchmark, but I think it is comprehensive, easy to understand, and documents a moment in time as these models appear to be moving past my skill level (I peaked around 1800 Elo many years ago).

8
7
6
5
4
3
2
1
♜︎
♞︎
♜︎
♚︎
♟︎
♟︎
♝︎
♟︎
♟︎
♝︎
♟︎
♟︎
♛︎
♞︎
♟︎
♟︎
♝︎
♟︎
♟︎
♛︎
♟︎
♟︎
♟︎
♜︎
♜︎
♚︎
a
b
c
d
e
f
g
h
Gemini 3.1 Pro
The Knight on g5 is staring directly at f7. If my Knight leaps to f7, it places the Black King in check. The King can't go to g8 because my Bishop controls the square. The g7 pawn stops it from going to g7. There's nowhere for the King to run. Checkmate!
Nf7#

Frontier Model Chess Performance

Chess Bench breaks down chess playing ability into full game, puzzle, and endgame performance. The model is treated just like a human. It is given the board state at every turn, and asked to reason about the correct move.

ModelEndgame Win %Puzzle EloFull Game Elo
Gemini 3.1 Pro75%21411920
GPT 5.455%2054
Opus 4.65%1027

All models were run with maximum thinking configured.

To make this as realistic as possible, I pass the previous move and the current board state into the model at every turn. This imitates how a human would play chess, they do not need to mentally recreate the board from the move sequence. They can just observe the board. I also maintain some reasoning history between moves so that the model can complete nuanced tactical sequences in a coherent manner.

Endgames

The standard formula for winning a chess game is to gain an advantage in the opening, convert it to a material gain in the middle game, and then use the material advantage to win the endgame. Because of this, early in a chess player's career they are taught how to win a variety of endgames. Some are simple (King + Queen vs. King) and some are quite complex (King + Rook + Pawn vs. King + Rook).

To test the models, I set them up with 20 theoretically won endgame positions across 4 difficulty tiers and had them play against Stockfish. The model must convert the winning position into checkmate. A draw or loss counts as a failure.

Endgame Results by Position
PositionStockfishGemini 3.1 ProGPT 5.4Opus 4.6
Tier 1: Elementary
KQ vs K, Central KingWin (7)Win (10)DrawDraw
KQ vs K, Corner DefenseWin (7)Win (7)Win (7)Win (8)
KR vs K, Central KingWin (16)Win (18)Win (19)Draw
KR vs K, Edge DefenseWin (12)Win (12)DrawDraw
Tier 2: Intermediate
KBB vs K, CentralWin (27)Win (19)Win (25)Draw
KP vs K, Advanced Passed PawnWin (8)Win (9)Win (9)Draw
KP vs K, King OutflanksWin (13)Win (19)Win (16)Draw
KP vs K, King Supports PawnWin (12)Win (15)Win (12)Draw
KP vs K, Opposition CriticalWin (11)Win (11)Win (19)Draw
Tier 3: Advanced
KBN vs K, Drive to Correct CornerWin (35)DrawDrawDraw
KBN vs K, Wrong Corner StartWin (31)Win (29)DrawDraw
KQ vs KR, CentralWin (20)LossLossLoss
KQ vs KR, Rook DefendingWin (31)DrawLossLoss
KRP vs KR, Advanced Rook PawnWin (14)Win (20)DrawDraw
KRP vs KR, Lucena PositionWin (13)Win (28)Win (16)Loss
KRP vs KR, Pawn on 6th with SupportWin (22)DrawDrawDraw
Tier 4: Complex
KQP vs KQ, Advanced PawnWin (11)Win (12)Win (19)Draw
KQP vs KQ, Pawn on 7thWin (9)Win (17)Win (15)Loss
KRBP vs KRB, Passed PawnWin (14)DrawWin (27)Loss
KRR vs KR, Two Rooks DominateWin (20)Win (34)DrawLoss
Total100%75%55%5%

Win (N) = checkmate in N moves.Number = checkmate in N moves. Draw/Loss = failed to convert.

Reading the traces, Gemini reasons about complex endgames in a way that feels very familiar to me. Below is a Tier 4 endgame where Gemini converts a Queen + Pawn vs. Queen position. It checks the king, skewers the queen, promotes the pawn, and then methodically mates with King and Queen.

KQP vs KQ: Advanced Pawn
8
7
6
5
4
3
2
1
♟︎
♚︎
♚︎
♛︎
♛︎
a
b
c
d
e
f
g
h
Start
Gemini 3.1 Pro
Use the arrows to step through the game.
Start

Puzzles

To benchmark tactical ability, I curated 100 recent puzzles from Lichess spanning ratings from 500 to 2500. Each puzzle presents a critical position where there is one clearly best move or forcing sequence. The model sees only the board state and must find the winning continuation.

Puzzle Accuracy by Rating Tier
Puzzle RatingCountGemini 3.1 ProGPT 5.4Opus 4.6
500–7001010/109/107/10
700–9001010/1010/106/10
900–11001010/1010/106/10
1100–1300109/1010/103/10
1300–1500108/1010/102/10
1500–1700106/103/101/10
1700–1900107/108/101/10
1900–2100109/109/101/10
2100–2300107/105/101/10
2300–2500105/102/101/10
Total10081/10076/10029/100
Estimated Elo214120541027

Again, Gemini excels both on speed and accuracy. GPT 5.4 is also quite strong but can easily take up to 30 minutes per move for a puzzle. Opus is hopeless; it cannot reason about even a mildly nuanced tactic. Below is a mate in 2 that requires spotting a queen sacrifice. Gemini and GPT both find the winning Qxg6+, while Opus considers it but talks itself out of it, playing f5 instead.

Puzzle: White to Move
8
7
6
5
4
3
2
1
♜︎
♛︎
♜︎
♚︎
♟︎
♞︎
♟︎
♟︎
♝︎
♟︎
♟︎
♞︎
♟︎
♛︎
♟︎
♝︎
♟︎
♝︎
♟︎
♟︎
♜︎
♚︎
♜︎
a
b
c
d
e
f
g
h
Mate in 2
The first thing I see is a juicy target: Black’s pawn on g6. My Queen on g4 is staring it down. Black’s King is on g8. My a2 Bishop is pointing right at f7. If I play Qxg6+, the King has to move — and if Kf8, then Qxf7 is checkmate! The Bishop on a2 covers f7 along the diagonal, and the King has nowhere to run.
Qxg6+

Gemini and GPT 5.4 both reason conditionally through the tactic. The key insight is that the bishop on a2 pins the f7 pawn along the diagonal, which means after Qxg6+ the pawn cannot recapture and the king is forced to move into a mating net. Opus investigates Qxg6+ but misses the bishop's role in the pin, concludes the queen capture is unsound, and settles for the much less interesting f5 instead.

Full Games

Puzzles and endgames test isolated skills, but full games require sustained play across all phases: opening preparation, middlegame tactics, and endgame technique. To measure this, I had Gemini 3.1 Pro climb the Elo ladder: 16 games (8 openings × 2 colors) at each Stockfish skill level, from 0 through 8. Gemini is the only model that can complete games at a reasonable speed, although based on the narrower benchmarks, I would expect GPT 5.4 could play around 1800 Elo given 8–12 hours per game.

Each Stockfish skill level maps to an estimated Elo rating, giving us a performance curve. A BayesElo analysis of the full 144 games estimates Gemini at 1920 Elo.

Gemini 3.1 Pro vs. Stockfish 18
Stockfish EloWLDWin %
13201600100%
1444133081%
1566106063%
172997056%
195379044%
220479044%
2363312122%
2500214013%
259611506%
Total6875147%

16 games per level (8 openings × 2 colors). BayesElo estimate: 1920 (95% CI: 1831–2010).

Below is Gemini's most impressive win, a 67-move Italian Game where it slowly builds a passed g-pawn against a highly rated Stockfish opponent.

Italian Game: Gemini (White) vs. Stockfish (~2596 Elo)
8
7
6
5
4
3
2
1
♜︎
♝︎
♛︎
♚︎
♝︎
♞︎
♜︎
♟︎
♟︎
♟︎
♟︎
♟︎
♟︎
♟︎
♞︎
♟︎
♝︎
♟︎
♞︎
♟︎
♟︎
♟︎
♟︎
♟︎
♟︎
♟︎
♜︎
♞︎
♝︎
♛︎
♚︎
♜︎
a
b
c
d
e
f
g
h
Start
Gemini 3.1 Pro
Use the arrows to step through the game.
Start

Observations

Gemini is clearly the most optimized model for chess. It slightly outperforms GPT 5.4 on Elo but massively outperforms on speed. GPT 5.4 performs significantly worse with thinking set to high instead of xhigh and can take up to 30 minutes per move while Gemini can easily play a full game in 30 minutes or less. It almost appears like GPT 5.4 is deriving chess from first principles while Gemini is explicitly trained to reason about chess (would make sense given how much time DeepMind has spent historically researching games). Opus has no understanding of the geometry of the board. I have noticed this in other settings as well where Opus struggles with spatial reasoning. This may be related to the fact that Anthropic has invested far less in multimodal capabilities and mathematical reasoning.

I would have expected the models to be a bit stronger playing endgames. Endgames are reasonably heuristic-based. Once you know the meta strategy for how to trap the king in the corner, you can convert a lot of similar-looking endgames. All this intuition should be available to them but they still struggle in certain scenarios to fully convert endgame strategy into wins. In a future post I will explore some techniques for improving language models on chess while preserving this reasoning trace style gameplay.