ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

ByAdmin

Apr 24, 2026

THE AI TODAY

arXiv:2509.24239v4 Announce Type: replace-cross
Abstract: Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine strategic reasoning, or do they primarily excel at pattern recognition? To address this, we present ChessArena, a chess-based testbed for evaluating LLMs. Chess demands strategic reasoning, precise rule adherence, and the ability to track complex game states. ChessArena is a competitive framework where LLMs play against each other under four play modes. We evaluate 13 LLMs across over 800 games, testing basic understanding, move selection, and puzzle solving. Results reveal significant shortcomings: no model beats Maia-1100 (human amateur level), and some lose to random play. We also present a strong baseline: our fine-tuned Qwen3-8B substantially improves performance, approaching much larger state-of-the-art reasoning models.

By Admin

AI RESEARCH

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

ByAdmin

By Admin

Related Post

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Architecture of an AI-Based Automated Course of Action Generation System for Military Operations

ADS-POI: Agentic Spatiotemporal State Decomposition for Next Point-of-Interest Recommendation