New Study Accuses Popular AI Benchmark of Favoring Big Tech Companies

A recent paper from researchers at Cohere, Stanford, MIT, and Ai2 has accused LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of allowing certain major AI companies to gain an unfair advantage in leaderboard rankings. The study alleges that companies like Meta, OpenAI, Google, and Amazon were permitted to privately test multiple variants of their AI models on the platform, then selectively publish only the highest-scoring models.

The Controversy

Chatbot Arena, created in 2023 as an academic research project out of UC Berkeley, has become a significant benchmark for AI companies. It operates by pitting answers from two different AI models against each other in a “battle,” with users voting on which response is better. The votes contribute to a model’s score and its placement on the leaderboard. The researchers analyzed over 2.8 million Chatbot Arena “battles” between November 2024 and March 2025, finding evidence that certain major AI companies were allowed to collect more data through increased “battle” participation.

A chart pulled from the study showing the distribution of model performances

The authors claim this preferential access gave these companies a significant advantage. For instance, Meta was allegedly able to privately test 27 model variants between January and March leading up to its Llama 4 release, ultimately publishing only the score of a single top-performing model. Sara Hooker, Cohere’s VP of AI research and co-author of the study, described this practice as “gamification.”

Response from LM Arena

LM Arena Co-Founder and UC Berkeley Professor Ion Stoica disputed the findings, calling the study “full of inaccuracies” and “questionable analysis.” The organization maintained its commitment to “fair, community-driven evaluations” and invited all model providers to submit more models for testing. LM Arena argued that allowing model providers to choose how many tests to submit does not constitute unfair treatment.

Implications and Recommendations

The researchers recommend several changes to make Chatbot Arena more fair, including setting a clear limit on private tests and publicly disclosing scores from these tests. They also suggest adjusting the sampling rate to ensure all models appear in an equal number of “battles.” While LM Arena has rejected some of these suggestions, it has indicated plans to implement a new sampling algorithm.

The controversy comes as Meta was recently caught optimizing a Llama 4 model for “conversationality” to achieve a high score on Chatbot Arena, without releasing the optimized model. This incident, combined with the new study, increases scrutiny on private benchmark organizations and their potential susceptibility to corporate influence.

What's Hot

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

Tech in Asia Organization Profile

Restaurant Tech Startup Owner.com Hits $1 Billion Valuation

The Hidden Opportunity in AI: Energy Infrastructure

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

Tech in Asia Organization Profile

Our Picks

WM Technology Updates Stockholders on Non-Binding Proposal from Co-Founders

Access Restricted: Website Unavailable in Your Location

Best TV Deals in Amazon Prime Day 2025 Sale

Subscribe to Updates

What's Hot

New Study Accuses Popular AI Benchmark of Favoring Big Tech Companies

The Controversy

Response from LM Arena

Implications and Recommendations

Related Posts