Vals AI, a US-based company specializing in genAI performance testing, has just published its first study evaluating the performance of several legal tech companies. The tests, designed by major law firms including Reed Smith and Fisher Phillips, assessed the capabilities of these AI tools across a variety of tasks. The companies whose results were shared include Harvey, Thomson Reuters’ CoCounsel, Vecflow, and vLex. The human lawyers used for comparison were provided by ALSP Cognia. Vals utilized its proprietary ‘auto-evaluation framework platform’ to produce a blind assessment of submitted responses.
Results of the study provide an overview of the performance of these AI tools across seven key legal tasks: Data Extraction, Document Q&A, Document Summarization, Redlining, Transcript Analysis, Chronology Generation, and EDGAR Research. These tasks represent a range of functions frequently undertaken by legal professionals. The tools assessed included CoCounsel (Thomson Reuters), Vincent AI (vLex), Harvey Assistant (Harvey), and Oliver (Vecflow). Notably, Lexis+AI (LexisNexis) initially participated but withdrew from the sections covered in the report.

The findings, presented as percentages representing the accuracy or performance scores based on predefined evaluation criteria, highlight some key takeaways:
- Harvey participated in six out of seven tasks, securing the top score among AI tools on five tasks and placing second on one. It outperformed the lawyer baseline in four tasks.
- CoCounsel was the only other vendor whose AI tool also received a top score in one task. It consistently ranked among the top performers in the four evaluated tasks, with scores ranging from 73.2% to 89.6%.
- The lawyer baseline outperformed the AI tools on two tasks and matched the best-performing tool on one. In the other four tasks, at least one AI tool surpassed the baseline.
A more in-depth analysis reveals further insights into each tool’s strengths and weaknesses. Harvey Assistant either matched or exceeded the lawyer baseline in five tasks, and surpassed other AI tools in four tasks evaluated. It received two of the three highest scores across all tasks for Document Q&A (94.8%) and Chronology Generation (80.2%, matching the lawyer baseline). Thomson Reuters submitted its CoCounsel 2.0 product and received high scores on all four tasks evaluated, especially for Document Q&A (89.6%) and Document Summarization (77.2%). On these four tasks, CoCounsel’s average score was 79.5%, exceeding the lawyer baseline by over 10 points.
Collectively, the AI tools surpassed the lawyer baseline on four tasks related to document analysis, information retrieval, and data extraction, and matched on one (Chronology Generation). However, none of the AI tools outperformed the lawyer baseline on EDGAR research, which may suggest that these tools still have work to do on more challenging legal research.
The lawyer baseline surpassed the AI tools only in Redlining and, at its single highest point, for Chronology Generation (80.2%). Given the current capabilities of AI, lawyers may still be the best at handling these tasks. The lawyer baseline’s scores were reasonably high for Document Extraction (71.1%) and Document Q&A (70.1%), yet certain AI tools still surpassed these benchmarks. All AI tools outperformed the lawyer baseline in Document Summarization and Transcript Analysis. Document Q&A, with an average score of 80.2%, was the highest scoring task overall, indicating some of the greatest potential for legal generative AI tools. Performance on EDGAR Research, which was one of the most challenging tasks, saw Oliver reaching 55.2% against the Lawyer Baseline of 70.1%. Improvements in “AI agents” and “agentic workflows” may be required to complete these multi-step research tasks with greater accuracy.
Overall, the study suggests that these legal AI tools have value for law firms, despite room for improvement in both evaluation and results. LexisNexis, which had been involved in testing, withdrew, stating that a significant product upgrade to Lexis+ AI, rendering the initial benchmark invalid.
AL spoke to Vals team member Rayan Krishnan, and consultant Tara Waters. Krishnan stated that it had been a ‘diplomatic mission’ to get everyone involved, mentioning that he was glad that a number of leading law firms also participated. Krishnan added that the hope is to include more vendors and focus on legal research next. The project may cover other countries, with the UK a possible target. The plan is to make the study an annual event in the legal AI sector. Responding to a question about the potential risk of an ’empirical fallacy’, Krishnan noted that the law firms provided questions and requirements based on their real-world needs, and that this wasn’t tilted in favor of the legal tech vendors. Waters mentioned that the questions were not designed for AI tools, but chosen by law firms’ interests. Another issue being addressed is ’pilot fatigue’. Krishnan sees this as a solution, for law firms testing many tools in detail through benchmarking, and that he hopes that the benchmarking will encourage further adoption, meaning greater efficiency across the legal market. Waters added that benchmarking is vital to build trust by assessing real-world applications and bringing stakeholders together.