What exactly are “Reasoning Models” and how do they work?

As I continue to investigate LLMs, and the implement and use of AI. I took a step back and asked the question “What are the types of LLMs?” Part of the reason for asking this question, is that I wanted to understand the differences between gpt-4o and the “reasoning models,” like o1. For me this led to a lot of questions around what are the differences between these types of models, and how are they made and what tasks are they optimized for?

Where did Reasoning Models come from?

For me, the first part of this was understanding where reasoning models came from? Why do these models exist? What problem are they trying to solve? And I got there by looking to see where they came from?

When we look at our traditional LLMs, like gpt-4.0, we see models that were pre-trained on billions of parameters, and who function by using tokenization and vectorization to predict the next steps in a sequenced response. The idea being that they take in a set of tokens, and break that down by attaching numeric values to each token and then using that, along with tools like Euclidean Distance and Cosine similarity to respond with an appropriate answer. That by itself was enough to revolutionize the world of AI. But they had their own problems, pre-training means that they cannot ingest new data (You could but it quickly becomes cost prohibitive). And there was an increased need for specialization of response and data access. To respond to that, we now have RAG (Retrieval Augmented Generation) which leveraged embedding models to provide options for models to ingest new data without costly training (for more details see my post on embedding models, here => [Embeddings and Vectors, explanation of how LLMs see and interpret the world.](Embeddings and Vectors, explanation of how LLMs see and interpret the world. – Code-Workbench)).

With that advent, a new problem emerged, the desire for LLMs and AI to be able to take on more complicated tasks that required intermediate steps. If you want a primary example of this challenge, research “The strawberry problem.”

What is the strawberry problem?

So for as powerful as LLMs are, and you can ask ChatGPT or Claude, and validate this is still problem today. As the following question:

“Count the number of “r”s in the word strawberry”

If you do, you will get this answer: 2

Now the question would be why? LLMs are trained on vast amounts of text to understand and generate human-like language. They don’t necessarily understand it. But rather work by breaking down human language into a math-based process involving tokens and embeddings. They are designed to recognize patterns in text and replicate those answers, not actually understand your question. They break things down into tokens. The reason behind this problem is actually because of how it breaks the word strawberry along token lines.

But it wasn’t able to “reason” to get the answer only respond and predict based upon the tokens it was given.

What are reasoning models?

So given these limitations, of how tokens work, and the ability to have AI models handle more complicated tasks. The question then becomes how? For this research, I found a lot of different definitions of a reasoning model. But the industry seems to have settled on it, being a model that answers questions that require complex and multi-step generation with intermediate steps to get to a solution.

From a conceptual perspective, this is the difference between knowing the answer (or being able to look up the answer) to a question and solving a math problem.

Now, before we go further, basic LLMs are capable of basic reasoning. That’s how if you give it a math problem, it will generally be able to solve it.

Why don’t reasoning models replace existing LLMs?

Reasoning models are a more specialized version of an LLM, in that they apply problem solving to derive intermediate steps to get to an answer. And being a specialized model, they are good for specific (but not all) use-cases. The first and primary driver that reasoning models aren’t replacing more “traditional” LLMs is a straightforward one…cost.

Reasoning models are significantly more compute intensive, so if you are deploying them on your own hardware, or a virtual machine, you can expect the demand to be higher.

But if you are deploying them via PaaS services, the cost can be higher as well. AI solutions involving LLMs involving billing by the token. And that means tokens in, and tokens out. Reasoning models tend to have high token costs associated with them as they break problems down and then explain the steps on the output.

Additionally, due to the increased reasoning and steps they tend to now be very performant. And this performance degradation can be problematic for many users.

How big is the cost difference for a reasoning model?

So given that I say the cost is different. What do I mean by that? Let’s look at some examples. OpenAI has done benchmarking of their models, and that helps us to figure out the performance of a model, and the response. During the process of benchmarking, OpenAI discovered that o1 generated roughly 8 times the volume of tokens to gpt-4o.

The question being why? It’s because of the increased process of the model breaking down the original prompt into steps and writing it out. Think about it conceptually like a math test back in elementary school.

If I tell you to solve 2 + 2, you are just going to write out “4.”

If I ask you to divide 3254 / 267, chances are, you are going to be writing out the long division by hand. That process of you taking up more paper to solve this, is the equivalent to the model generating more tokens to work through the problem.

Furthermore, according to this [article](OpenAI’s o3 Reasoning Models Are Extremely Expensive to Run | Observer), o3 which scored 87.5% on the ARC-AGI benchmarking, comes with a 10x the cost.

When should you us reasoning models?

These models are designed for complex problem solving, beyond asking for “summarize a document.” They are designed for logic problems or use-cases like code-generation.

They are optimized for the following:

Deductive or inductive reasoning
Chain-of-thought problem solving
Complex Decision Making
Generalization to novel problem sets.

Where they tend to not perform well is the following:

Fast response
Knowledge-Based Tasks
Simple Tasks

Now the last one on that list is very important, reasoning models are not good at simple tasks or knowledge-based tasks. The more I dug into this, it makes sense. Ultimately it’s because the models are training to try to “solve” things, not look them up, and for simple tasks they will inevitably try and overcomplicate them.

How do reasoning models work?

One of the best explanations of this I’ve found is [OpenAI’s own documentation](Reasoning models – OpenAI API). And it drives to the point, these models “think before they answer.” Which is to say, they perform an internal chain of thoughts of token responses to work through a problem until they get to an answer. These modes are usually trained with reinforcement learning (more on that later).

They achieve this by introducing a third type of token, instead of just “input” and “output” tokens, they generate “reasoning tokens.” These reasoning tokens enable the model to break down the prompt / problem and enable it to consider multiple options before giving a reason. This enables it to figure out which answer is the best one, and convert those to output tokens, while throwing away any leftover reasoning tokens.

This enables the model to break up the problem based on the models training to identify the tasks and decompose the ask down to specific inputs that can then lead to an output. Now, when I first read that, my question was “how does that work with context windows?” If you are looking for more on context windows, see my AI jargon post [here](Getting started with LLMs, understanding the jargon – Code-Workbench).

The answer is that is the major concern!

What is a context window? The answer is that it most directly maps to “working memory” in a person. A context window is how the LLM keeps track of the context of the conversation. It maintains a window of previous responses and uses that window to inform its next responses. Conceptually it is like if you and I sat down to talk for 3 hours, by the end, we probably don’t remember how the conversation started.

The context window is a major part of the reasoning model. Because unlike a “traditional” LLM, that is maintaining a context window of prompts and responses. Reasoning models are maintaining a context window of their own reasoning responses.

The downside to this is that it means that they will fill the context window faster on their own, depending on the complexity of the task that needs to be accomplished, and the output containing their reasoning of how they got to the answer. Which means that multi-prompt exchanges can lead to hallucinations much faster. Which is why they are optimized for complex problems, not simple ones as they will overthink and rack up unneeded cost.

What is reinforcement learning?

When it comes to training AI models, that’s a whole other rabbit hole by itself. But given that the term “reinforcement” learning kept coming up in this investigation, I decided to dig a little deeper into it.

At its core, reinforcement learning is a form of machine learning that enables the model to build decision making processes based on feedback that is positive, negative or neutral in nature. This helps the model to repeat this process under different circumstances. Conceptually this is the reason my old math teacher “Mrs Boyle” would push so hard for us to “Show our work.” Because the idea was to get reinforcement on the process so that it can be applied to other problems.

As it continues to go through this training it accumulates the ability to learn positive or negative pathways to answers. The default way of approaching this is to give the model an unlabeled data set focused on a specific outcome. Every step taken by the algorithm to explore it comes with feedback to training the model to see the process.

An excellent article on the subject was published by [Michael Chen](What is Reinforcement Learning? How Does It Work?). The focus here was to train the model to leverage and develop “chain of thought” which enables it to decompose the question into a discrete task that needs to be accomplished to solve the problem. For a great article on the process, see [Learning to reason with LLMs](Learning to reason with LLMs | OpenAI).

Conclusion:

The part that I found most interesting in this investigation was the small pivot that led to the differences in reasoning models, and understanding the key differences. Things like agent mode for Github CoPilot are clearly designed to use reasoning agents as their backend. The benefits being that you can handle more complicated tasks, like building code solutions and get better results. But what I found most useful is that understanding how they were trained, and how they leverage “reasoning tokens” to accomplish the task decomposition means that at this stage we will still need to be aware of context windows. If you ask for too complex of a task you can start to see hallucinations or bizarre behavior because the context windows are exhausted too quickly. Which lines up with some of the experimentation I’ve done with agent mode on Github CoPilot. At this stage it does help inform how I use these models, and what their place and purpose are in designing AI solutions.

One use-case that occurs to me, is the ability to potentially leverage these reasoning models as orchestrators in agentic solutions, with them being able to provide chain-of-thought type operations, while “traditional” LLMs are suited for interpretation and summarization. It really opens my eyes up to the possibilities of what you could do with this.

Where did Reasoning Models come from?

What is the strawberry problem?

What are reasoning models?

Why don’t reasoning models replace existing LLMs?

How big is the cost difference for a reasoning model?

When should you us reasoning models?

How do reasoning models work?

What is reinforcement learning?

Conclusion:

Related Posts

Embeddings and Vectors, explanation of how LLMs see and interpret the world.

Fun with AI and Planetary Computer – Let’s find some kaiju!

Getting started with LLMs, understanding the jargon