March 31, 2026
Computers have been superhuman at chess for almost thirty years. But modern chess engines achieve this performance in a way that is impossible for humans to replicate. First, they use hyper-optimized search techniques to look 20-30 moves in advance across many different combinations to reason near perfectly about complex tactical positions. Second, they distill billions of games into sophisticated neural networks to quickly evaluate the subtleties of positions with features that are hard for humans to understand.
On the other hand, if you watch an LLM play chess it seems to reason in a way that is much more congruous with how humans approach chess, thinking through simple positional features or trying to calculate out lines and often getting lost or making small computational errors. Therefore, if LLMs can play chess at a high level, they may be able to distill insights to human players in a way that is much more tractable.
To understand their current performance, I built a series of chess benchmarks to evaluate them across a few facets of the game. This is by no means the first chess benchmark, but I think it is comprehensive, easy to understand, and documents a moment in time as these models appear to be moving past my skill level (I peaked around 1800 Elo many years ago).
Chess Bench breaks down chess playing ability into full game, puzzle, and endgame performance. The model is treated just like a human. It is given the board state at every turn, and asked to reason about the correct move.
| Model | Endgame Win % | Puzzle Elo | Full Game Elo |
|---|---|---|---|
| Gemini 3.1 Pro | 75% | 2141 | 1920 |
| GPT 5.4 | 55% | 2054 | — |
| Opus 4.6 | 5% | 1027 | — |
All models were run with maximum thinking configured.
To make this as realistic as possible, I pass the previous move and the current board state into the model at every turn. This imitates how a human would play chess, they do not need to mentally recreate the board from the move sequence. They can just observe the board. I also maintain some reasoning history between moves so that the model can complete nuanced tactical sequences in a coherent manner.
The standard formula for winning a chess game is to gain an advantage in the opening, convert it to a material gain in the middle game, and then use the material advantage to win the endgame. Because of this, early in a chess player's career they are taught how to win a variety of endgames. Some are simple (King + Queen vs. King) and some are quite complex (King + Rook + Pawn vs. King + Rook).
To test the models, I set them up with 20 theoretically won endgame positions across 4 difficulty tiers and had them play against Stockfish. The model must convert the winning position into checkmate. A draw or loss counts as a failure.
| Position | Stockfish | Gemini 3.1 Pro | GPT 5.4 | Opus 4.6 |
|---|---|---|---|---|
| Tier 1: Elementary | ||||
| KQ vs K, Central King | Win (7) | Win (10) | Draw | Draw |
| KQ vs K, Corner Defense | Win (7) | Win (7) | Win (7) | Win (8) |
| KR vs K, Central King | Win (16) | Win (18) | Win (19) | Draw |
| KR vs K, Edge Defense | Win (12) | Win (12) | Draw | Draw |
| Tier 2: Intermediate | ||||
| KBB vs K, Central | Win (27) | Win (19) | Win (25) | Draw |
| KP vs K, Advanced Passed Pawn | Win (8) | Win (9) | Win (9) | Draw |
| KP vs K, King Outflanks | Win (13) | Win (19) | Win (16) | Draw |
| KP vs K, King Supports Pawn | Win (12) | Win (15) | Win (12) | Draw |
| KP vs K, Opposition Critical | Win (11) | Win (11) | Win (19) | Draw |
| Tier 3: Advanced | ||||
| KBN vs K, Drive to Correct Corner | Win (35) | Draw | Draw | Draw |
| KBN vs K, Wrong Corner Start | Win (31) | Win (29) | Draw | Draw |
| KQ vs KR, Central | Win (20) | Loss | Loss | Loss |
| KQ vs KR, Rook Defending | Win (31) | Draw | Loss | Loss |
| KRP vs KR, Advanced Rook Pawn | Win (14) | Win (20) | Draw | Draw |
| KRP vs KR, Lucena Position | Win (13) | Win (28) | Win (16) | Loss |
| KRP vs KR, Pawn on 6th with Support | Win (22) | Draw | Draw | Draw |
| Tier 4: Complex | ||||
| KQP vs KQ, Advanced Pawn | Win (11) | Win (12) | Win (19) | Draw |
| KQP vs KQ, Pawn on 7th | Win (9) | Win (17) | Win (15) | Loss |
| KRBP vs KRB, Passed Pawn | Win (14) | Draw | Win (27) | Loss |
| KRR vs KR, Two Rooks Dominate | Win (20) | Win (34) | Draw | Loss |
| Total | 100% | 75% | 55% | 5% |
Win (N) = checkmate in N moves.Number = checkmate in N moves. Draw/Loss = failed to convert.
Reading the traces, Gemini reasons about complex endgames in a way that feels very familiar to me. Below is a Tier 4 endgame where Gemini converts a Queen + Pawn vs. Queen position. It checks the king, skewers the queen, promotes the pawn, and then methodically mates with King and Queen.
To benchmark tactical ability, I curated 100 recent puzzles from Lichess spanning ratings from 500 to 2500. Each puzzle presents a critical position where there is one clearly best move or forcing sequence. The model sees only the board state and must find the winning continuation.
| Puzzle Rating | Count | Gemini 3.1 Pro | GPT 5.4 | Opus 4.6 |
|---|---|---|---|---|
| 500–700 | 10 | 10/10 | 9/10 | 7/10 |
| 700–900 | 10 | 10/10 | 10/10 | 6/10 |
| 900–1100 | 10 | 10/10 | 10/10 | 6/10 |
| 1100–1300 | 10 | 9/10 | 10/10 | 3/10 |
| 1300–1500 | 10 | 8/10 | 10/10 | 2/10 |
| 1500–1700 | 10 | 6/10 | 3/10 | 1/10 |
| 1700–1900 | 10 | 7/10 | 8/10 | 1/10 |
| 1900–2100 | 10 | 9/10 | 9/10 | 1/10 |
| 2100–2300 | 10 | 7/10 | 5/10 | 1/10 |
| 2300–2500 | 10 | 5/10 | 2/10 | 1/10 |
| Total | 100 | 81/100 | 76/100 | 29/100 |
| Estimated Elo | 2141 | 2054 | 1027 |
Again, Gemini excels both on speed and accuracy. GPT 5.4 is also quite strong but can easily take up to 30 minutes per move for a puzzle. Opus is hopeless; it cannot reason about even a mildly nuanced tactic. Below is a mate in 2 that requires spotting a queen sacrifice. Gemini and GPT both find the winning Qxg6+, while Opus considers it but talks itself out of it, playing f5 instead.
Gemini and GPT 5.4 both reason conditionally through the tactic. The key insight is that the bishop on a2 pins the f7 pawn along the diagonal, which means after Qxg6+ the pawn cannot recapture and the king is forced to move into a mating net. Opus investigates Qxg6+ but misses the bishop's role in the pin, concludes the queen capture is unsound, and settles for the much less interesting f5 instead.
Puzzles and endgames test isolated skills, but full games require sustained play across all phases: opening preparation, middlegame tactics, and endgame technique. To measure this, I had Gemini 3.1 Pro climb the Elo ladder: 16 games (8 openings × 2 colors) at each Stockfish skill level, from 0 through 8. Gemini is the only model that can complete games at a reasonable speed, although based on the narrower benchmarks, I would expect GPT 5.4 could play around 1800 Elo given 8–12 hours per game.
Each Stockfish skill level maps to an estimated Elo rating, giving us a performance curve. A BayesElo analysis of the full 144 games estimates Gemini at 1920 Elo.
| Stockfish Elo | W | L | D | Win % |
|---|---|---|---|---|
| 1320 | 16 | 0 | 0 | 100% |
| 1444 | 13 | 3 | 0 | 81% |
| 1566 | 10 | 6 | 0 | 63% |
| 1729 | 9 | 7 | 0 | 56% |
| 1953 | 7 | 9 | 0 | 44% |
| 2204 | 7 | 9 | 0 | 44% |
| 2363 | 3 | 12 | 1 | 22% |
| 2500 | 2 | 14 | 0 | 13% |
| 2596 | 1 | 15 | 0 | 6% |
| Total | 68 | 75 | 1 | 47% |
16 games per level (8 openings × 2 colors). BayesElo estimate: 1920 (95% CI: 1831–2010).
Below is Gemini's most impressive win, a 67-move Italian Game where it slowly builds a passed g-pawn against a highly rated Stockfish opponent.
Gemini is clearly the most optimized model for chess. It slightly outperforms GPT 5.4 on Elo but massively outperforms on speed. GPT 5.4 performs significantly worse with thinking set to high instead of xhigh and can take up to 30 minutes per move while Gemini can easily play a full game in 30 minutes or less. It almost appears like GPT 5.4 is deriving chess from first principles while Gemini is explicitly trained to reason about chess (would make sense given how much time DeepMind has spent historically researching games). Opus has no understanding of the geometry of the board. I have noticed this in other settings as well where Opus struggles with spatial reasoning. This may be related to the fact that Anthropic has invested far less in multimodal capabilities and mathematical reasoning.
I would have expected the models to be a bit stronger playing endgames. Endgames are reasonably heuristic-based. Once you know the meta strategy for how to trap the king in the corner, you can convert a lot of similar-looking endgames. All this intuition should be available to them but they still struggle in certain scenarios to fully convert endgame strategy into wins. In a future post I will explore some techniques for improving language models on chess while preserving this reasoning trace style gameplay.