Scientists have developed a new method to assess artificial intelligence (AI) capabilities by measuring how quickly they can complete challenging tasks compared to humans. While AI generally outperforms humans in text prediction and knowledge tasks, it struggles with more complex projects requiring multiple steps. A recent study published on arXiv proposes evaluating AI models based on task duration, comparing their completion time to that of humans.
The researchers found that AI models excel at tasks taking humans under four minutes, with a near-100% success rate. However, this success rate drops to 10% for tasks requiring more than four hours. The study tested various AI models, including Sonnet 3.7, GPT-4, and Claude 3 Opus, against a range of tasks from simple Wikipedia lookups to complex programming tasks like writing CUDA kernels or debugging PyTorch code.
To assess AI capabilities, the researchers used testing tools such as HCAST and RE-Bench. HCAST contains 189 autonomy software tasks evaluating AI agent capabilities in machine learning, cybersecurity, and software engineering, while RE-Bench uses seven challenging open-ended machine-learning research engineering tasks benchmarked against human experts. The tasks were also rated for “messiness” to assess their complexity and real-world applicability.
The study’s findings suggest that AI’s “attention span” is advancing rapidly. By extrapolating this trend, the researchers predict that AI could automate a month’s worth of human software development by 2032. This new benchmark could help better understand AI’s actual intelligence and capabilities, providing a meaningful interpretation of absolute performance rather than just relative performance.
Experts in the field, such as Sohrob Kazerounian and Eleanor Watson, agree that measuring AI performance based on task duration is valuable and intuitive. It directly reflects real-world complexity and captures AI’s ability to maintain coherent goal-directed behavior over time. This metric could track progress in AI development, particularly for tasks where AI is expected to solve complex human problems.
The study’s implications extend beyond a new benchmark metric. It highlights the rapid advancement of AI systems and their increasing ability to handle lengthy tasks. Eleanor Watson predicts that by 2026, we’ll see AI becoming more general, handling varied tasks across entire days or weeks rather than short, narrowly defined assignments. This could lead to AI taking on substantial portions of professional workloads, reducing costs, and improving efficiency, allowing humans to focus on more creative and strategic tasks.
For consumers, AI is expected to evolve from simple assistants to dependable personal managers capable of handling complex life tasks over extended periods with minimal oversight. The emergence of powerful generalist AI agents will likely reshape daily life and professional practices fundamentally.