Is AI Truly Thinking and Reasoning – Or Just Pretending?
The rapid advancements in artificial intelligence have made it increasingly difficult to keep pace with new developments. As AI companies release updated models, each iteration raises significant questions about the nature of AI, especially regarding its ability to exhibit genuine reasoning. This is important because the answers determine how we should utilize this technology.
If you’ve used programs like ChatGPT, you’re likely familiar with their capacity to provide quick responses to your questions. However, modern “reasoning models” are designed to deliberate before answering, breaking down complex problems into smaller steps. The industry labels this approach as “chain-of-thought reasoning.”
These models achieve impressive results, solving logic puzzles, acing math tests, and writing code from the get-go. Yet, they can also fail at basic tasks. This has led to a debate among AI experts.
Skeptics argue that these models aren’t truly reasoning. Conversely, some believe that the models are engaging in some form of reasoning and are progressing toward human-like flexibility.
So, who’s right? The most likely answer is one that troubles both sides of the debate.
What Constitutes Reasoning?
What does the term reasoning actually mean? AI companies define it as a process where models break problems into smaller components, address them step by step, and arrive at better solutions. However, this definition may be too narrow.
Scientists continue to research how reasoning works in the human brain. Nonetheless, they agree that various forms of reasoning exist. These include deductive reasoning, where a general statement leads to a specific conclusion, inductive reasoning that utilizes specific observations to form broader generalizations, and analogical, causal, and common-sense reasoning.
Breaking down a difficult math problem is easier when given time to consider it step-by-step. While “chain-of-thought reasoning” is valuable, it is not the complete definition of reasoning. Melanie Mitchell, a professor at the Santa Fe Institute, and co-authors have written that a key aspect of reasoning involves “sussing out a rule or pattern from limited data or experience and to apply this rule or pattern to new, unseen situations.” Even young children can easily grasp concepts from a few examples.
So, can AIs generalize?

The Skeptic’s Viewpoint
Many are skeptical about the ability of AI to generalize. They believe something else is at work. Shannon Vallor, a philosopher of technology at the University of Edinburgh, views the behavior of advanced AI models as “meta-mimicry.” Instead of simply mimicking human-written content during training, these models imitate the processes humans use to create that content. In other words, they’re not truly reasoning.
Models like the o1 can easily sound like they’re reasoning because their training data is full of examples. These range from how doctors analyze symptoms to determine diagnoses to ways judges evaluate evidence when delivering verdicts.
The o1 model made changes to an earlier model, ChatGPT. ChatGPT previously struggled with easy questions. So why should anyone think o1 is doing something completely new and magical, especially when it also struggles with easy problems? Mitchell believes that the model’s failures provide evidence it isn’t reasoning at all.
Mitchell was surprised by how well o3, OpenAI’s newest reasoning model, performed on tests, but also by how much computation it used to solve problems. Due to a lack of transparency from OpenAI, it is difficult to understand what happens under the hood. Mitchell has conducted experiments with people who think out loud when solving problems. She noted that humans do not take hours to think aloud, instead they use a few sentences. They then comment that they see how a problem works because they employ certain concepts. She isn’t sure if the model is using similar ones.
Critics also worry that AI models are not genuinely breaking down large problems into steps, despite OpenAI’s claims. Researchers explored this in a paper titled, “Let’s Think Dot by Dot.” Rather than instructing the model to break down the problem, the researchers merely told the model to generate dots. These turned out to be meaningless “filler tokens.” Additional tokens gave the model more computing power, helping it to solve problems. This reveals that the intermediate steps, whether a phrase like “let’s think about this step by step” or simply “….,” don’t necessarily signal human-like reasoning.
Mitchell believes models may rely more on a “bag of heuristics” than a proper reasoning model. A heuristic is a mental shortcut that often provides a quick answer without actual thought. For example, researchers used an AI to analyze photos for skin cancer. At first glance, the model seemed to be determining if a mole was malignant. However, the training data for malignant moles often contained a ruler, so the model had learned to use the presence of a ruler as a heuristic for identifying malignancy.
The Believer’s Perspective
Other experts have a more optimistic view of AI’s reasoning capabilities. Ryan Greenblatt, chief scientist at Redwood Research, believes these models are clearly performing some form of reasoning. He agrees that they don’t generalize as well as humans, but he views this as the models relying on more memorization than people. Nonetheless, the AI is still doing something, adding that these models have successfully solved many problems. For Greenblatt, the most logical explanation is that they are engaging in some level of reasoning.
Greenblatt notes that heuristics can also apply to earlier models. He referenced the “man, boat, and goat” prompt that drew mockery for the model last year. The model produced a flawed response because this prompt is an ancient logic puzzle. In some versions of the puzzle, a farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can only carry the farmer and one item at a time. The wolf will eat the goat, and the goat will eat the cabbage if they are left unattended. The challenge is to get everything across without anything getting eaten.
The model mentioned a cabbage in its response. It “recognized” the puzzle and likely knew that it performed well during training.
Greenblatt also notes that humans fail in the same way. A person could be tricked on a quiz after spending weeks studying color theory, or they might overcomplicate an answer based on their depth of knowledge.
Ajeya Cotra, a senior analyst at Open Philanthropy, agrees that the newest models are improving at a host of tasks that people would define as reasoning. She says that “the ‘just’ part of it is the controversial part.” It is easy to imply that AI will not impact the world or that artificial superintelligence is a long way off. Cotra is more optimistic, saying that AI models pair memorization with reasoning. AI models are similar to a hardworking student who has memorized many equations.
AI systems, then, have “jagged intelligence,” which means they can perform extremely impressive tasks and struggle with simple problems simultaneously.

Researchers use the term “jagged intelligence” to describe this phenomenon. Andrej Karpathy, a computer scientist, explained that state-of-the-art AI models “can both perform extremely impressive tasks (e.g., solve complex math problems) while simultaneously struggling with some very dumb problems.” While human intelligence is a cloud with softly rounded edges, AI is a spiky cloud with high peaks and valleys occurring next to each other.
Greenblatt says that AI is “jagged” relative to what humans are good at. The best way to understand AI is not as “smarter than a human” or “dumber than a human,” but just as “different.”
Cotra believes that AI could eventually encompass all of human intelligence and much more. Therefore, she regularly considers the risks that come with AI systems surpassing human experts in every field. For now, AI systems do have limitations.
The best use case for AI is when you struggle to come up with a solution, but can easily verify its correctness once generated. For example, writing code or creating a website. In other areas, particularly those without a right answer or where stakes are high, you should be more cautious. Use AI as a thought partner rather than an oracle in domains where judgment is important.