← Back

Analyzing Chess Input Modalities

April 16, 2026

In a previous post I benchmarked the current crop of frontier LLMs on chess ability. I found that Gemini 3.1 Pro and GPT 5.4 played at nearly an expert level. But as I read more about previous attempts to benchmark chess, I stumbled onto debates about how the input modality was impacting performance. I also noticed this anecdotally. I have a pet project trying to integrate a robotic arm with a vision-action model to get it to play a chess move end to end.

Meridian office robotic arm trying to play chess

I've noticed that despite the fact that large foundation models have sophisticated chess understanding and strong vision capabilities, they still struggle to spatially reason about chess boards. To understand this better I built a series of diagnostic tests to have models solve chess puzzles using everything from move notation inputs to pictures of the board on my desk.

LLMs Struggle with 3D Vision Tasks

I picked 30 puzzles that varied from crowded middlegames down to simple endgames. They are all one-move puzzles that someone around 1000 Elo should be able to solve. I then rendered each puzzle in five different input modalities: UCI move history, PGN notation, a FEN string, a 2D PNG of the board, and a pair of photos of a real chess board on my desk captured from two angles. Below is an example of all five inputs for a middlegame puzzle.

UCI Move History
e2e4 c7c5 g1f3 b8c6 d2d4 c5d4 f3d4 d8b6 d4b3 g8f6 c1e3 b6b4 b1d2 f6e4 c2c3 b4a4 d1g4 e4c3 f1c4 d7d5 g4h4 c3b5 b3c5 a4b4 a2a3 b4b2 a1b1 b2a3 c4b5 e7e6 c5b3 a3b4 b5c6 b7c6 h4b4 f8b4 e3c5 a7a5 c5b4 a5b4 b3c5 a8b8 d2b3 e8e7 e1d2 e7d6 b1a1 e6e5 a1a7 c8e6 h1a1 e5e4 d2e3 g7g5 e3d4 h8c8 b3d2 b8b5 d2e4 d5e4
PGN
1. e4 c5 2. Nf3 Nc6 3. d4 cxd4 4. Nxd4 Qb6 5. Nb3 Nf6 6. Be3 Qb4+ 7. N1d2 Nxe4 8. c3 Qa4 9. Qg4 Nxc3 10. Bc4 d5 11. Qh4 Nb5 12. Nc5 Qb4 13. a3 Qxb2 14. Rb1 Qxa3 15. Bxb5 e6 16. Ncb3 Qb4 17. Bxc6+ bxc6 18. Qxb4 Bxb4 19. Bc5 a5 20. Bxb4 axb4 21. Nc5 Rb8 22. Ndb3 Ke7 23. Kd2 Kd6 24. Ra1 e5 25. Ra7 Be6 26. Rha1 e4 27. Ke3 g5 28. Kd4 Rhc8 29. Nd2 Rb5 30. Ndxe4+ dxe4
FEN
2r5/R4p1p/2pkb3/1rN3p1/1p1Kp3/8/5PPP/R7 w - - 0 31
PNG Render
2D rendered board for puzzle 15 (a8fit), white to move
Real-Board Photos
Overhead photo of puzzle 15 on a physical chess board
overhead
45 degree angle photo of puzzle 15 on a physical chess board
45° side
17 pieces, rating 1229, white to move. Solution: Nxe4#.

Each (model, modality) pair was graded on 30 single-shot attempts with no retries on illegal moves. The solve rates across the four test models are shown below.

Puzzle Accuracy by Model and Modality
ModelUCIPGNFENPNG ImagePhotos
Gemini 3.1 Pro97%100%100%77%73%
GPT 5.493%100%100%93%40%
Opus 4.753%50%50%57%20%
Opus 4.633%33%40%20%17%
Qwen 3.5 27B13%7%23%37%17%

30 one-move puzzles per cell, thinking effort set to “low” for every model.

The models previously found to excel at chess have nearly perfect understanding of notation based game input. Interestingly, GPT 5.4 does quite well on the 2D PNG task but poorly on 3D. Gemini is the only model with a credible attempt on the 3D photos. This matches my finding from earlier in the week that Gemini models are able to recreate the board from 3D images much better than any other model.

Qwen is the only model that seems to outperform with vision instead of text. This is quite surprising to me. I do not know the exact reason but will explore it a bit later on.

Can Gemini See the Board?

To separate vision from solving, I asked Gemini to transcribe each position to FEN and compared the transcription to the actual values.

Gemini 3.1 Pro: Vision vs Solving
MetricPNG ImagePhotos
FEN read exactly right93%33%
Puzzle solved77%73%
Mean squares correct (of 64)63.961.0

30 puzzles per modality.

Interestingly, Gemini struggles with each task for different reasons. For PNG images, it understands the board perfectly but is unable to convert this understanding to a puzzle solve with only limited thinking. This is obviously somewhat strange given that it can solve the puzzle with the FEN, but presumably the intermediate decode step has some cost.

For the 3D photos, Gemini is often off by one or two piece locations. However, if these pieces are immaterial to the solve it can still piece together the correct puzzle solution.

Looking Inside Qwen’s Activations

I generated 2,000 synthetic positions for four of the five input modalities (2,000 photos was too much work), ran them through Qwen 3.5 27B, and pulled the residual-stream activations at every layer. I measured similarity with Centered Kernel Alignment, a standard way to check whether two sets of activations encode the same information.

There’s one issue with raw CKA values though. Two activation streams can score high just because they share the same prompt structure and length. To work around this I built a stricter null: instead of comparing raw values, the chart shows how much each pair beats a position-pairing shuffle within tight piece-count and ply buckets. The null absorbs everything that’s not actually about the chess content (the system prompt, the template structure, the prompt length, the position complexity), so what’s left above zero is the content-driven alignment in the residual stream.

Qwen Cross Modality Subspace Alignment
0.000.150.300.450.600.75110192837465564LayerCKA Δ over null
FEN ↔ PNG ImageUCI ↔ FENUCI ↔ PNG ImageUCI ↔ PGN

Two clusters jump out in the late layers. FEN, PNG, and UCI all converge into the same shared board representation: FEN↔PNG peaks at +0.68 above the null, with UCI↔FEN and UCI↔PNG close behind around +0.54. UCI↔PGN, on the other hand, only gets to about +0.29.

I’m not really sure why this happens. However, we see empirically that Qwen is the worst at solving PGN puzzles. So perhaps this is showing that there is some underlying board representation understanding that the PGN puzzles do not activate.