Back to Notes
    ChessLM: Benchmarking LLMs in Real Chess Matches
    OpenSource
    AI
    LLM
    Chess
    MachineLearning
    BuildInPublic

    ChessLM: Benchmarking LLMs in Real Chess Matches

    I built ChessLM to level up my skills and benchmark GPT-4o, Claude 3.5 Sonnet, Gemini Pro 1.5, and Grok-2 in live chess matches. The results were fascinating.

    Ovi Shekh
    2 min read

    I built ChessLM, an open-source platform to benchmark LLMs against each other (and you) in real chess matches. This wasn't just to ship a product, but to level up my own skills.

    I wired up GPT-4o, Claude 3.5 Sonnet, Gemini Pro 1.5, and Grok-2 to a live chessboard and let them play. What I didn't expect was how differently each model approaches the game. Same board. Same rules. Completely different personalities.

    The Model Personalities

    • Claude 3.5 Sonnet: Plays like it's thinking three layers deep.
    • GPT-4o: Aggressive and proactive.
    • Grok-2: Bold and takes calculated risks.
    • Gemini Pro 1.5: More cautious, sometimes hesitates in complex positions.

    I wasn't just benchmarking them; the data started telling its own story.

    The Results: Estimated Elo & Win Rates

    Model Estimated Elo Win Rate
    Claude 3.5 Sonnet 2900 72%
    GPT-4o 2850 68%
    Grok-2 2820 66%
    Gemini Pro 1.5 2780 64%

    Every game played generates real data on how different neural architectures approach tactical planning, strategic depth, and long-horizon reasoning. Whether you're a grandmaster or a beginner, your match contributes to that dataset.

    Watch the Action

    Join the Open Source Journey

    ๐Ÿš€ Open Work #1

    Drop a star if you find this useful. PRs are always welcome!

    Share this article

    Spread the knowledge with your network

    Let's Build Together

    Have questions about this note? Want to discuss your AI project? Book a free 30-minute strategy call.

    Book a Free Call

    30-minute session ยท No commitment required