Alibaba Cloud Launches QwQ-32B, a Compact Reasoning Model
Just two months after the tech world was taken by storm by the DeepSeek-R1 AI model, Alibaba Cloud has introduced its own innovative solution. QwQ-32B, an open-source large language model (LLM), is the Chinese cloud giant’s latest offering. The model is described as “a compact reasoning model” that, remarkably, uses only 32 billion parameters. Despite this relatively low parameter count, QwQ-32B aims to deliver performance on par with other large language AI models that utilize far more parameters.
On its website, Alibaba Cloud has published performance benchmarks that underscore the model’s capabilities. These benchmarks suggest that QwQ-32B holds its own against AI models from DeepSeek and OpenAI. The tests include AIME 24 (mathematical reasoning), Live CodeBench (coding proficiency), LiveBench (test set contamination and objective evaluation), IFEval (instruction-following ability), and BFCL (tool and function-calling capabilities).
By leveraging continuous reinforced learning (RL) scaling, Alibaba asserts that the QwQ-32B model demonstrates significant advancements in mathematical reasoning and coding proficiency. In a company blog post, Alibaba noted that QwQ-32B, with its 32 billion parameters, is achieving performance comparable to DeepSeek-R1, which employs 671 billion parameters. This highlights the effectiveness of RL when applied to robust foundation models that have been pre-trained on extensive world knowledge.
“We have integrated agent-related capabilities into the reasoning model, enabling it to think critically while utilising tools and adapting its reasoning based on environmental feedback,” Alibaba stated in its blog post. This demonstrates the company’s commitment to advancing the capabilities of AI reasoning.
Alibaba highlights that QwQ-32B’s effectiveness stems from its use of reinforcement learning (RL) to enhance reasoning capabilities. With this approach, a reinforcement learning AI agent can perceive and interpret its environment, take actions, and learn by trial and error. Reinforcement learning represents one of several strategies that developers use to train machine learning systems, and Alibaba’s use of RL has allowed them to make their model more efficient.
“We have not only witnessed the immense potential of scaled RL but have also recognised the untapped possibilities within pretrained language models,” Alibaba stated. “As we work towards developing the next generation of Qwen, we are confident that combining stronger foundation models with RL powered by scaled computational resources will propel us closer to achieving Artificial General Intelligence [AGI].”
Alibaba is actively exploring the integration of agents with RL to enable what it describes as “long-horizon reasoning.” This, according to Alibaba, will eventually lead to greater intelligence with inference time scaling. The QwQ-32B model was trained using rewards from a general reward model and rule-based verifiers, which enhances its general capabilities. According to Alibaba, these enhancements include better instruction-following, alignment with human preferences, and improved agent performance.
China’s DeepSeek, which has been generally available since the start of the year, uses RL effectively, producing benchmark results comparable to those of rival US large language models. Its R1 LLM can compete with US artificial intelligence without needing the latest GPU hardware. The fact that Alibaba’s QwQ-32B model also employs RL is no coincidence. The US has restricted the export of high-end AI accelerator chips—such as the Nvidia H100 graphics processor—to China. As a result, Chinese AI developers have had to find alternative approaches to make their models work. RL appears to deliver comparable benchmark results when compared with models like those from OpenAI. The QwQ-32B model’s use of significantly fewer parameters to achieve comparable results to DeepSeek is particularly interesting. This suggests the model could run on hardware with less powerful AI acceleration.