A new research project has introduced VideoGameBench, an AI benchmark designed to test whether state-of-the-art vision-language models can play and beat a suite of 20 popular video games using only what they see on the screen.
The researchers behind the project found that even the most advanced vision-language models, including GPT-4o, Claude Sonnet 3.7, and Gemini 2.5 Pro, struggle with playing classic first-person shooter games like Doom due to high inference latency. This delay means that by the time the model responds with an action, the game state has already changed significantly, making the action irrelevant.
Challenges in Gaming Environments
The researchers used classic Game Boy and MS-DOS games for their simpler visuals and diverse input styles, which better test a vision-language model’s spatial reasoning capabilities. The suite of games includes classics like Warcraft II, Age of Empires, and Prince of Persia. In their tests, Claude Sonnet 3.7 managed to progress the furthest in Doom, reaching the blue room.
Key Findings
- Latency Issues: Delayed responses are particularly problematic in fast-paced games like first-person shooters.
- Action Understanding: Models often failed to perform basic in-game actions or understand how their actions would translate on-screen.
- Mouse Control: The most consistent failure across all tested models was an inability to reliably control the mouse in games requiring precise movements.
The researchers emphasized that evaluating AI systems in dynamic environments like video games is crucial for understanding their limitations. Unlike complex tasks like unsolved math proofs, playing video games is considered a more accessible challenge for AI, yet current models still struggle.
VideoGameBench and its associated agent are open-source, allowing developers to test vision-language models themselves. This development comes as AI continues to face challenges in gaming environments, a domain where even non-AI entities like lawnmowers and human gut bacteria have been tested against classic game challenges.