New AI Benchmark Tests Vision-Language Models' Ability to Play Classic Video Games - Breaking News in Technology & Business

A new research project has introduced VideoGameBench, an AI benchmark designed to test whether state-of-the-art vision-language models can play and beat a suite of 20 popular video games using only what they see on the screen.

The researchers behind the project found that even the most advanced vision-language models, including GPT-4o, Claude Sonnet 3.7, and Gemini 2.5 Pro, struggle with playing classic first-person shooter games like Doom due to high inference latency. This delay means that by the time the model responds with an action, the game state has already changed significantly, making the action irrelevant.

Challenges in Gaming Environments

The researchers used classic Game Boy and MS-DOS games for their simpler visuals and diverse input styles, which better test a vision-language model’s spatial reasoning capabilities. The suite of games includes classics like Warcraft II, Age of Empires, and Prince of Persia. In their tests, Claude Sonnet 3.7 managed to progress the furthest in Doom, reaching the blue room.

Key Findings

Latency Issues: Delayed responses are particularly problematic in fast-paced games like first-person shooters.
Action Understanding: Models often failed to perform basic in-game actions or understand how their actions would translate on-screen.
Mouse Control: The most consistent failure across all tested models was an inability to reliably control the mouse in games requiring precise movements.

The researchers emphasized that evaluating AI systems in dynamic environments like video games is crucial for understanding their limitations. Unlike complex tasks like unsolved math proofs, playing video games is considered a more accessible challenge for AI, yet current models still struggle.

VideoGameBench and its associated agent are open-source, allowing developers to test vision-language models themselves. This development comes as AI continues to face challenges in gaming environments, a domain where even non-AI entities like lawnmowers and human gut bacteria have been tested against classic game challenges.

What's Hot

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

New AI Benchmark Tests Vision-Language Models’ Ability to Play Classic Video Games

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

Tech in Asia Organization Profile

Restaurant Tech Startup Owner.com Hits $1 Billion Valuation

The Hidden Opportunity in AI: Energy Infrastructure

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

Tech in Asia Organization Profile

Our Picks

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

Subscribe to Updates

What's Hot

New AI Benchmark Tests Vision-Language Models’ Ability to Play Classic Video Games

Challenges in Gaming Environments

Key Findings

Related Posts