Selected from Ahead of AI
Author: Sebastian Raschka
Compiled by Synced
Reasoning models are currently thriving, and renowned AI technical blogger Sebastian Raschka is writing a new book about how they work, titled 'Reasoning From Scratch'. He has previously published several well-known books in the AI field, including 'Build a Large Language Model (From Scratch)', 'Machine Learning Q and AI', and 'Machine Learning with PyTorch and Scikit-Learn'.
Recently, he released the first chapter of this book on his blog, providing an introductory overview of reasoning in the LLM field, and also outlining technical approaches such as inference-time scaling and reinforcement learning.
Synced has compiled the first chapter of this book for readers.
(Note: For clarity, this article translates "inference" as "推斷" and "reasoning" as "推理"; where "inference" refers to the computational process by which a model generates output based on input (such as generating text), while "reasoning" focuses on the model's ability for logical analysis, causal judgment, or problem-solving through methods like Chain-of-Thought.)
Welcome to the next phase of large language models (LLMs): reasoning.
LLMs have transformed how we process and generate text, but their success has primarily been attributed to statistical pattern recognition. However, new progress is being made in reasoning methods, techniques that allow LLMs to tackle more complex tasks, such as solving logic puzzles or multi-step arithmetic problems. Understanding these methods is the core of this book.
This chapter will introduce:
What exactly "reasoning" means in the context of LLMs;
The fundamental difference between reasoning and pattern matching;
Traditional pre-training and post-training stages for LLMs;
Key methods for improving LLM reasoning capabilities;
Why building reasoning models from scratch can help us understand their strengths, limitations, and practical trade-offs.
1. What Exactly is "Reasoning" in LLMs?
What is LLM-based reasoning? The answer and discussion of this question could fill a book on their own. However, this book is different in that it aims to implement LLM reasoning methods from scratch, so it will focus more on practical and hands-on programming rather than conceptual reasoning. Nonetheless, I think it's important to briefly define what we mean by "reasoning" in the LLM context.
Therefore, before turning to the programming part in subsequent chapters, I would like to define reasoning in the LLM context in this first section of the book, and its relationship to pattern matching and logical deduction. This will lay the foundation for further discussions about how LLMs are currently built, how they approach reasoning tasks, and their strengths and weaknesses.
In this book, "reasoning" in the LLM context is defined as follows:
In the context of LLMs, reasoning refers to the model's ability to produce intermediate steps before providing a final answer. This process is often described as Chain-of-Thought (CoT) reasoning. In CoT reasoning, the LLM explicitly generates a sequence of structured statements or computations that illustrate its process for arriving at a conclusion.
Figure 1 shows a simplified example of an LLM performing a multi-step (CoT) reasoning task.
Figure 1: A simplified example of an LLM processing a multi-step reasoning task. What a reasoning model does is not simply recall a fact, but combine multiple intermediate reasoning steps to arrive at the correct conclusion. Depending on the implementation, the intermediate reasoning steps may or may not be shown to the user.
As seen in Figure 1, the intermediate reasoning steps produced by the LLM look very much like a person thinking aloud. However, how similar these methods (and the resulting reasoning process) are to human reasoning remains an unanswered question, and this book will not attempt to answer it. It is even unclear whether such a question can be definitively answered.
Instead, this book focuses on explaining and implementing techniques that enhance the reasoning capabilities of LLMs, allowing them to better handle complex tasks. My hope is that by getting hands-on with these methods, you will gain a better understanding and be able to improve the reasoning methods currently under development, and perhaps even explore their similarities and differences with human reasoning.
Note: The reasoning process in LLMs might appear very similar to human thought, especially in the way intermediate steps are articulated. However, it is currently unclear whether LLM reasoning is similar to human reasoning in terms of internal cognitive processes. Human reasoning typically involves consciously manipulating concepts, intuitively understanding abstract relationships, or generalizing based on a few examples. In contrast, current LLM reasoning is primarily based on patterns learned from vast statistical correlations in the training data, rather than explicit internal cognitive structures or conscious reflection.
Therefore, while the output of reasoning-enhanced LLMs looks somewhat human-like, their underlying mechanisms are (likely) very different, and this is an active area of exploration.
2. Introduction to the LLM Training Process
This section will briefly summarize the typical way LLMs are trained so we can better understand their design and limitations. This background will also help us discuss the difference between pattern matching and logical reasoning.
Before applying any reasoning methods, traditional LLM training is usually divided into two stages: pre-training and post-training, as shown in Figure 2 below.
Figure 2: Overview of the typical LLM training process. Initially, the model is initialized with random weights and then pre-trained on a large-scale text dataset by predicting the next token to learn language patterns. Then, the model is optimized through instruction fine-tuning and preference fine-tuning to better follow human instructions and align with human preferences.
In the pre-training phase, LLMs are trained on a massive amount of unlabeled text (up to several TB), including books, websites, research papers, and many other sources. The pre-training objective of an LLM is to learn to predict the next word (or token) in these texts.
When pre-trained on TBs of text at scale, leading current LLMs often run on thousands of GPUs for months and cost millions of dollars, resulting in very powerful LLMs. This means they begin to be capable of generating text that is very similar to that written by humans. Furthermore, to some extent, pre-trained LLMs will begin to exhibit what are called emergent properties, meaning they can perform tasks they were not explicitly trained for, including translation, code generation, etc.
However, these pre-trained models are merely the foundation models for the post-training phase, which uses two key techniques: supervised fine-tuning (SFT, also called instruction fine-tuning) and preference fine-tuning. The purpose of post-training is to teach the LLM to respond to user queries, as shown in Figure 3 below.
Figure 3: Example responses of a language model at different training stages. In the figure, the prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM gives a related but unfocused answer that doesn't directly follow the instruction. The instruction-tuned LLM generates a concise and accurate summary consistent with the prompt. The preference-tuned LLM further improves the response – using a friendly tone and more engaging language, making the answer more relevant and user-centric.
As shown in Figure 3, instruction fine-tuning improves the LLM's ability in personal assistant-like tasks, such as answering questions, summarizing, and translating text, etc. Then, the preference fine-tuning stage refines these capabilities. It helps tailor responses to user preferences. Furthermore, preference fine-tuning is often used to make LLMs safer. (Some readers may be familiar with terms like Reinforcement Learning from Human Feedback (RLHF), which are specific techniques for implementing preference fine-tuning.)
In short, we can view pre-training as "raw language prediction" (via next-token prediction), which provides the LLM with some basic properties and the ability to generate coherent text. Then, the post-training phase enhances the LLM's task understanding ability through instruction fine-tuning and gives the LLM the ability to create answers with a specific style through preference fine-tuning.
Readers interested in the details of LLM pre-training and post-training phases can refer to 'Build A Large Language Model (From Scratch)'. This current book on reasoning does not require knowledge of these stages – you will start with a model that has already been pre-trained and post-trained.
3. Pattern Matching: How LLMs Learn from Data
When LLMs are trained, they "read" vast amounts of text data and learn how to predict the next token based on the preceding text. They do this by discovering statistical patterns in the data, rather than truly "understanding" the content. So, even if they can write fluent and coherent sentences, they are essentially just mimicking surface-level associations, not engaging in deep thinking.
Most current LLMs (like GPT-4o, Meta's Llama 3, unless specifically trained for reasoning) work this way – they don't perform step-by-step logical deduction like humans, but rather find the most probable answer from the training data based on the input question. Simply put, they don't answer questions through true logical inference, but rather by "matching" input and output patterns.
Consider the following example:
Prompt: The capital of Germany is...
Answer: Berlin
When an LLM answers "Berlin", it doesn't arrive at the conclusion through logical reasoning, but simply memorized the high-frequency pairing "Germany → Berlin" from the training data. This reaction is like a conditioned reflex, which we call "pattern matching" – the model is merely reproducing the textual patterns it has learned, without truly thinking step-by-step.
But what if the problem is more complex? For example, a task that requires deducing the answer based on known facts? This is where another ability is needed: logical reasoning.
True logical reasoning refers to deriving a conclusion step-by-step based on premises, like solving a math problem. It requires intermediate thinking steps, the ability to identify contradictions, and the ability to determine causality based on established rules. This is completely different from simply "matching textual relationships."
For example:
All birds can fly. Penguins are birds. Can penguins fly?
If it were a human (or a system that truly reasons), they would immediately notice something is wrong – based on the first two sentences, it seems penguins should be able to fly, but everyone knows penguins actually can't fly. This creates a contradiction (as shown in Figure 1.4).
A system that reasons would immediately grasp this contradiction and realize that either the first statement is too absolute (not all birds can fly) or penguins are an exception.
Figure 4: Schematic diagram of logical conflict caused by contradictory premises. Based on the statements "All birds can fly" and "Penguins are birds", we would deduce the conclusion "Penguins can fly". But this conclusion directly conflicts with the known fact "Penguins cannot fly", which creates a contradiction.
LLMs, relying on statistical learning, do not actively identify such contradictions. They merely predict answers based on the textual patterns in the training data. If the statement "All birds can fly" appears particularly frequently in the training data, the model might confidently answer: "Yes, penguins can fly."
In the next section, we will look at a specific example to see how an LLM actually answers when it encounters the "All birds can fly..." problem.
4. Simulating Logical Reasoning: How LLMs Mimic Reasoning Logic Without Explicit Rules
In the previous section, we mentioned that when encountering contradictory premises (like "All birds can fly, but penguins cannot"), ordinary LLMs don't actively detect these contradictions. They just generate responses based on the textual patterns they learned during training.
Now let's look at a specific example (see Figure 5): How would a model like GPT-4o, which hasn't been specifically enhanced for reasoning capabilities, answer when encountering this "All birds can fly..." problem?
Figure 5: An example of how a language model (GPT-4o) handles contradictory premises.
From the example in Figure 5, we can see that although GPT-4o is not a specialized reasoning model (unlike other OpenAI versions like o1 and o3 that were specifically developed with reasoning capabilities), it gave a seemingly correct answer to this question.
What's going on? Does GPT-4o really perform logical reasoning? Not really, but it does show that in familiar scenarios, 4o can very realistically "pretend" to perform logical reasoning.
Actually, GPT-4o doesn't actively check if statements are contradictory. Its answers are entirely based on the "word co-occurrence probabilities" learned from massive amounts of data.
For example: If the correct statement "Penguins cannot fly" appears frequently in the training data, the model will firmly remember the association between "penguin" and "cannot fly". As shown in Figure 5, even though 4o doesn't have true logical reasoning ability, it can still provide the correct answer based on this "word probability memory".
Simply put: it's not thinking using logical rules, but answering based on the principle of "seeing it often enough to remember it".
Simply put, the reason the model can "detect" this contradiction is because it has seen similar examples repeatedly during training. This ability comes entirely from the textual patterns it learned from massive amounts of data – just like the saying "practice makes perfect", it naturally learns by seeing things often.
In other words, even if a regular LLM appears to be performing logical reasoning, as in Figure 5, it is not actually thinking step-by-step according to rules, but merely applying the textual patterns learned from vast training data.
However, the fact that ChatGPT 4o can answer this question correctly demonstrates an important phenomenon: when a model undergoes ultra-large-scale training, its "implicit pattern matching" ability can become very powerful. But this pattern based on statistical regularities also has clear shortcomings and is prone to errors in the following situations:
Encountering completely new problem types (logical problems never seen in the training data) → Like a student who only practices past exams suddenly facing unfamiliar test questions;
The problem is too complex (requiring multiple interdependent reasoning steps) → Similar to asking a calculator to solve a math problem that requires writing out a proof;
Requires strict logical deduction (but there are no similar cases in the training data) → Like asking a student who has memorized model essays to create a completely new genre of writing on the spot.
Since rule-based systems are so reliable, why aren't they popular now? In fact, rule-based systems were very popular in the 80s and 90s, used in fields like medical diagnosis, legal judgments, and engineering design. Even today, they can still be seen in some critical areas (like medicine, law, aerospace) – after all, these occasions require clear reasoning processes and traceable decision-making basis. But these systems have a major drawback: they rely entirely on manually writing rules, which is extremely laborious to develop. In contrast, deep neural networks like LLMs, as long as they are trained on massive data, can flexibly handle various tasks and are much more versatile.
We can understand it this way: LLMs "simulate" logical reasoning by learning patterns from vast amounts of data. Although they do not internally run any rule-based logic systems, they can further enhance this simulation ability through some specialized optimization methods (such as enhancing inference computation capabilities and post-training strategies).
It's worth mentioning that the reasoning ability of LLMs is actually a gradual development process. Long before specialized reasoning models like o1 and DeepSeek-R1 appeared, ordinary LLMs were already able to exhibit reasoning-like behavior – for example, by generating intermediate steps to reach the correct conclusion. What we now call "reasoning models" are essentially the result of further strengthening and optimizing this ability, mainly achieved through two methods: 1. Using special inference computation scaling techniques, and 2. Performing targeted post-training.
The subsequent content of this book will focus on these advanced methods for improving the ability of large language models to solve complex problems, helping you gain a deeper understanding of how to enhance this "implicit" reasoning ability of large language models.
5. Enhancing LLM Reasoning Capabilities
The "reasoning capability" of large language models truly entered the public consciousness on September 12, 2024, when OpenAI released o1. In that official announcement, OpenAI specifically mentioned:
These new AI versions don't reply instantly like before; instead, they will think for a few seconds like humans do to ensure the answer is more reliable.
OpenAI also specifically stated:
This enhanced thinking capability is particularly helpful for solving complex problems in fields like science, programming, and mathematics – after all, problems in these areas often require turning several corners to figure things out.
Although the specific technical details of o1 have not been disclosed, it is widely believed that it achieves stronger thinking capabilities by "enhancing inference computation" based on previous models like GPT-4.
A few months later, in January 2025, DeepSeek released the DeepSeek-R1 model and a technical report detailing the methods for training reasoning models, causing a huge sensation. This was because:
They not only released a model open-source for free that rivals or even surpasses the performance of o1;
They also publicly disclosed the complete plan for developing such models.
This book will guide you through implementing these methods from scratch to understand the technical principles behind enhancing AI reasoning capabilities. As shown in Figure 6, methods for enhancing large language model reasoning capabilities can currently be divided into three main categories:
Figure 6: Three main methods for enhancing the reasoning capabilities of large language models. These three methods (Inference Compute Enhancement, Reinforcement Learning, and Knowledge Distillation) are typically used after the model has completed regular training. Regular training includes: base model training, pre-training, instruction fine-tuning, and preference fine-tuning.
As shown in Figure 6, these enhancement methods are applied to models that have already completed the aforementioned regular training stages.
Inference-Time Compute Enhancement
Inference-time compute scaling (also called inference compute enhancement, test-time enhancement, etc.) includes a series of methods that improve the model's reasoning ability during the inference phase (i.e., when the user inputs a prompt), without requiring training or modification of the underlying model weights. The core idea is to trade increased computational resources for performance improvement, using techniques such as chain-of-thought reasoning and various sampling procedures to enable models with fixed parameters to exhibit stronger reasoning capabilities.
Reinforcement Learning (RL)
Reinforcement Learning is a class of training methods that improve the model's reasoning ability by maximizing a reward signal. Its reward mechanism can be divided into two categories:
General rewards: such as task completion rate or heuristic scores
Precise verifiable rewards: such as correct answers to mathematical problems or pass rates on programming tasks
Unlike inference-time compute scaling, RL achieves capability enhancement through dynamic adjustment of model parameters (weights updating). This mechanism allows the model to continuously optimize its reasoning strategy through trial-and-error learning based on environmental feedback.
Note: When developing reasoning models, it is important to clearly distinguish between the pure Reinforcement Learning (RL) methods discussed here and Reinforcement Learning from Human Feedback (RLHF) used for preference fine-tuning in regular large language model development (as shown in Figure 2). The core difference lies in the source of the reward signal: RLHF generates reward signals through explicit scoring or ranking of model outputs by humans, directly guiding the model to align with human preferred behavior; pure RL relies on automated or environment-driven reward signals (such as the correctness of a mathematical proof), whose advantage lies in objectivity but may reduce alignment with human subjective preferences. Comparison of typical scenarios: Pure RL training: Taking a mathematical proof task as an example, the system provides rewards only based on the correctness of the proof steps; RLHF training: Requires human evaluators to rank different outputs based on preferences to optimize responses that meet human standards (such as clarity of expression, logical fluency).
Supervised Fine-tuning and Model Distillation
Model distillation is a technique that transfers the complex reasoning patterns learned by high-performance large models to lighter models. In the field of LLMs, this technique typically manifests as: using high-quality annotated instruction datasets generated by high-performance large models for Supervised Fine-Tuning (SFT). This technique is often collectively referred to as Knowledge Distillation or Distillation in the LLM literature.
Difference from traditional deep learning: In classic knowledge distillation, the "student model" needs to learn both the output results and logits of the "teacher model", whereas LLM distillation usually involves transfer learning based only on the output results.
Note: The Supervised Fine-Tuning (SFT) technique used in this scenario is similar to SFT in regular large language model development, with the core difference being that the training samples are generated by models specifically developed for reasoning tasks (rather than general LLMs). Therefore, the training samples are more focused on reasoning tasks and typically include intermediate reasoning steps.
6. The Importance of Building Reasoning Models From Scratch
Since the release of DeepSeek-R1 in January 2025, enhancing LLM reasoning capabilities has become one of the hottest topics in the AI field. The reason is easy to understand. Stronger reasoning capabilities enable LLMs to solve more complex problems, making them more capable of addressing various tasks that users care about.
A statement by OpenAI CEO on February 12, 2025, also reflects this shift:
We are next releasing GPT-4.5, which we internally call Orion, and this is our last non-chain-of-thought model. After this, our first goal is to unify the o series models and the GPT series models by building a system that can use all our tools, knows when it needs or doesn't need to think for a long time, and can be widely used for all sorts of tasks.
The above quote highlights the shift towards reasoning models by leading LLM providers. Here, chain-of-thought refers to a prompting technique that guides the language model to reason step-by-step to improve its reasoning ability.
Another point worth mentioning is that "knows when it needs or doesn't need to think for a long time" also implies an important design consideration: reasoning is not always necessary or desirable.
For example, reasoning models are designed to solve complex tasks, such as solving puzzles, advanced math problems, and difficult programming tasks. However, for simple tasks like summarizing, translating, or knowledge-based question answering, reasoning is not necessary. In fact, using a reasoning model for everything might be inefficient and costly. For instance, reasoning models are often more expensive to use, more verbose, and sometimes more prone to errors due to "overthinking". Furthermore, a simple rule applies here: use the right tool (or LLM type) for the specific task.
Why are reasoning models more expensive than non-reasoning models?
Primarily because they tend to produce longer outputs, due to the intermediate reasoning steps explaining how the answer was reached. As shown in Figure 7, an LLM generates text one token at a time. Each new token requires a full forward pass through the model. Therefore, if a reasoning model produces an answer twice as long as a non-reasoning model, it requires twice as many generation steps, leading to a doubling of computational cost. This also directly impacts API usage costs – billing is typically based on the number of tokens processed and generated.
Figure 7: Token-by-token generation in LLMs. At each step, the LLM takes the complete sequence generated so far and predicts the next token – which may represent a word, sub-word, or punctuation, depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used for both standard language models and reasoning-centric models.
This directly highlights the importance of implementing LLMs and reasoning methods from scratch. It is one of the best ways to understand how they work. If we understand how LLMs and these reasoning models work, we can better understand these trade-offs.
7. Summary
Reasoning in LLMs involves using intermediate steps (Chain-of-Thought) to systematically solve multi-step tasks.
Traditional LLM training is divided into several stages: pre-training, where the model learns language patterns from vast text; instruction fine-tuning, which improves the model's response to user prompts; and preference fine-tuning, aligning model output with human preferences.
Pattern matching in LLMs relies entirely on statistical associations learned from data, which allows for fluent text generation but lacks explicit logical deduction.
LLM reasoning capabilities can be enhanced through: inference-time compute scaling, which boosts reasoning without retraining (e.g., Chain-of-Thought prompting); reinforcement learning, which explicitly trains the model using reward signals; and supervised fine-tuning and distillation, using examples from more powerful reasoning models.
Building reasoning models from scratch can provide practical insights into LLM capabilities, limitations, and computational trade-offs.
The above is the main content of the first chapter of Sebastian Raschka's new book 'Reasoning From Scratch'. It sets a good tone for the book through some basic introductions. What are your thoughts on reasoning models, and what are your expectations for this book?