The advent of large language models (LLMs) has sparked a surge in AI pilot programs transitioning to deployment in enterprises. However, early LLMs proved unwieldy and expensive, prompting a shift towards smaller, more efficient models. Companies like Google, Microsoft, and Mistral have developed compact models such as Gemma, Phi, and Small 3.1, respectively, offering fast and accurate solutions for specific tasks.
Enterprises can now opt for smaller models tailored to particular use cases, reducing operational costs and potentially achieving a better return on investment (ROI). According to Karthik Ramgopal, distinguished engineer at LinkedIn, smaller models require less computational power, memory, and faster inference times, directly translating to lower infrastructure costs. “Task-specific models have a narrower scope, making their behavior more aligned and maintainable over time without complex prompt engineering,” Ramgopal explained.
Model developers have priced their smaller models competitively. For instance, OpenAI’s o4-mini costs $1.1 per million tokens for inputs and $4.4/million tokens for outputs, significantly lower than the full o3 version at $10 for inputs and $40 for outputs. The availability of small, task-specific, and distilled models has expanded, with most flagship models now offering a range of sizes. Anthropic’s Claude family, for example, includes Claude Opus, Claude Sonnet, and Claude Haiku, with the latter being compact enough to operate on portable devices.
The question of ROI remains complex. Experts suggest that while some companies consider ROI achieved by cutting task time, others wait for actual cost savings or increased business. Ravi Naarla, Cognizant’s chief technologist, recommends identifying expected benefits, estimating them based on historical data, and being realistic about overall AI costs, including hiring, implementation, and maintenance.
Small models reduce implementation and maintenance costs, particularly when fine-tuned for more context. Arijit Sengupta, CEO of Aible, noted that fine-tuning can significantly reduce token costs. Aible’s experiments showed a 100X cost reduction from post-training alone, dropping model use costs from “single-digit millions to something like $30,000.” However, maintaining small models requires post-training to match large models’ performance, incurring additional costs.
Experts emphasize the importance of right-sizing models for performance and cost. Daniel Hoske, CTO at Cresta, suggests starting with LLMs to assess feasibility before transitioning to smaller models. LinkedIn’s Ramgopal follows a similar approach, using general-purpose LLMs for prototyping before customizing solutions.
While smaller models offer cost savings, they may lack the context window of larger models, potentially increasing human workload and costs. Rahul Pathak of AWS cautions against overusing small models, as they may not handle complex instructions effectively. Sengupta also warns that some distilled models can be brittle, potentially negating long-term savings.
Industry players stress the need for flexibility in model choice. Tessa Burg, CTO at Mod Op, advises organizations to be prepared for model changes and updates. Starting with the understanding that current models will be superseded by better versions allows for more adaptable AI strategies. Smaller models have already helped Burg’s company save time and budget in researching and developing concepts.
Ultimately, the key to maximizing ROI lies in matching model size to specific tasks and being prepared to adjust as needed. Vendors are now making it easier to switch between models automatically, but users must also consider platforms that facilitate fine-tuning to avoid additional costs.