Understanding Reasoning in Large Language Models: Overview of the paper "Towards Reasoning in Large Language Models: A Survey"

Updated on May 9, 2025

Understanding Reasoning in Large Language Models: Overview of the paper "Towards Reasoning in Large Language Models: A Survey"

Introduction
As defined by several authors (Wason and Johnson-Laird, 1972; Wason, 1968; Galotti, 1989; Fagin et al., 2004; McHugh and Way, 2018), reasoning is the act of thinking about something logically and systematically to draw a conclusion or make a decision. Inference, evaluation of arguments, and logical conclusion-drawing are all components of reasoning. “Reasoning” is a term that appears often in literature and everyday conversation, but it is also a vague idea that can mean several things depending on the context. We provide a summary of various widely accepted types of reasoning to assist the reader in grasping this notion.

Prerequisites

Basic Understanding of Machine Learning (ML) and Natural Language Processing (NLP): Familiarity with ML concepts and NLP techniques, such as tokenization, embeddings, and language model architectures like Transformers.
Knowledge of Large Language Models (LLMs): Understanding LLMs like GPT, BERT, and their training processes, including pretraining and fine-tuning.
Familiarity with Reasoning and Logic Concepts: Basic concepts of reasoning (e.g., deductive, inductive, and abductive reasoning) and logical frameworks used in AI.
Understanding of In-Context Learning and Few-Shot Learning: Familiarity with how LLMs process prompts to adapt to tasks without extensive retraining.

Different types of reasoning

Deductive reasoning: In deductive reasoning, one concludes by assuming the validity of the premises. Since the conclusion in deductive reasoning must always flow logically from the premises, if the premises are true, then the conclusion must also be true.

Inductive reasoning: A conclusion is reached by inductive reasoning when supporting evidence is considered and accepted. Based on the facts provided, it is probable that the conclusion is correct, but this is by no means a guarantee.

Example:

Observation: Every time we see a creature with wings, it is a bird. Observation: We see a creature with wings. Conclusion: The creature is likely to be a bird.

Abductive reasoning: In abductive reasoning, one seeks the most plausible explanation for a collection of observations to conclude. This conclusion is based on the best available information and represents the most plausible explanation; nonetheless, it should not be taken as an absolute fact.

For example:

Observation: The car cannot start, and there is a puddle of liquid under the engine.

Conclusion: The most likely explanation is that the car leaks from the radiator.

Other forms of reasoning include analogical reasoning, which draws comparisons between two or more things to make inferences or reach conclusions; causal reasoning, which focuses on identifying and understanding the causes and effects of events or phenomena; and probabilistic reasoning, which involves making decisions or forming conclusions based on the likelihood or probability of specific outcomes.

Formal Reasoning vs Informal Reasoning: In mathematics and logic, the term “formal reasoning” refers to a certain type of reasoning that is both methodical and logical. In daily life, we often use an “informal reasoning” style, a less formal method that relies on intuition, experience, and common sense to make conclusions and solve issues. While informal reasoning is more flexible and open-ended, it may be less reliable than formal reasoning due to its lack of structure.

Reasoning in Language Models: While the concept of reasoning in language models is not new, there is no clear definition of what it entails. Although it is not always clear that the reasoning in question is informal (Cobbe et al., 2021; Wei et al., 2022b, among others), the word “reasoning” is often used to refer to such reasoning in the literature.

Towards Reasoning in Large Language Models

As large language models (LLMs) like GPT-4, Claude, and Gemini continue to advance, researchers and developers are increasingly focusing on enhancing their reasoning capabilities — moving beyond simple text generation to more sophisticated, human-like thinking.

Traditional LLMs are excellent at pattern recognition and imitation based on massive datasets, but true reasoning — the ability to logically connect information, infer unseen facts, and solve novel problems — remains a significant challenge. Efforts towards building reasoning into LLMs can be broadly categorized into several strategies:

1. Chain-of-Thought Prompting

Instead of asking an LLM for a direct answer, chain-of-thought (CoT) prompting encourages it to break down problems into intermediate steps. This mirrors human reasoning, where arriving at a conclusion often involves multiple stages of thinking. By guiding the model to “think aloud,” CoT improves accuracy, especially in tasks involving logic, math, and complex decision-making.

2. Self-Consistency Sampling

When reasoning about difficult questions, humans often consider multiple possibilities before settling on an answer. Similarly, LLMs can generate several reasoning paths and then choose the most consistent or most frequent solution. This approach, called self-consistency, often leads to better outcomes than relying on a single generated answer.

3. Tool-Augmented Reasoning

Large language models are being combined with external tools like calculators, search engines, code interpreters, and knowledge graphs to enhance reasoning. When an LLM recognizes its own limitations, it can delegate parts of a problem to a specialized tool and integrate the result back into its overall answer — mimicking how humans seek external help when needed.

4. Memory and Contextual Reasoning

Building memory into LLMs — the ability to recall past interactions or facts across a conversation — strengthens their ability to reason over longer contexts. Emerging architectures are trying to provide LLMs with episodic memory (short-term) and semantic memory (long-term) to improve multi-turn, context-aware reasoning.

Fully Supervised Finetuning

Fully supervised finetuning is a critical technique in training large language models (LLMs) and other AI systems to perform specific tasks with greater accuracy and reliability. It involves training a pre-existing model on a labeled dataset where input-output pairs are explicitly provided, guiding the model to learn precise mappings from input queries to correct responses.

In contrast to unsupervised learning, where models discover patterns without explicit labels, supervised finetuning anchors the model’s behavior toward the desired outcomes by continuously comparing its predictions against known ground truths and adjusting accordingly.

Fully supervised fine-tuning suffers from two serious flaws. It first requires a dataset with explicit reasoning, which can be challenging and time-consuming to generate. Furthermore, the model is restricted to a single dataset for training, limiting its use to a single domain and increasing the likelihood that it would depend on artifacts in the training data rather than true reasoning to generate predictions.

Prompting & In-Context Learning

Through in-context learning, large language models like GPT-3 (Brown et al., 2020) have shown amazing few-shot performance on a wide range of tasks. A query and some “input, output” examples are all that’s needed to get these models 'reasoning about how to approach an issue and find a solution, either implicitly or explicitly. While these models have improved, they still struggle with problems that call for multiple steps of reasoning to resolve (Bommasani et al., 2021; Rae et al., 2021; Valmeekam et al., 2022). Recent research has revealed that this might be because the full potential of these models has not been explored.

Chain of Thought and Its Variants

By instructing LLMs to engage in “reasoning” explicitly, we can increase the likelihood that they will reason rather than merely provide answers. Wei et al. (2022b) suggest using chain-of-thought prompting as a means to this end. The “chain of thought” (CoT) examples provided in this method represent intermediary steps in the process of thinking using natural language.

Specifically, in CoT prompting, ⟨input, output⟩ demonstrations are replaced with ⟨input, chain of thought, output⟩ triples

Examples

[input] Roger has five tennis balls. He buys two more cans of tennis balls. Each can has three tennis balls. How many tennis balls does he have now?

[chain of thought] Roger started with five balls. 2 cans of 3 tennis balls each is six tennis balls. 5 + 6 = 11.

[output] The answer is 11.”

As a result, when presented with a target question, the model learns to produce clear rationales before generating the final response. In the literature, various types of chain-of-thought prompting have been developed, each in a distinct form or to tackle a particular issue.

Different Form: To elicit reasoning without the necessity for few-shot demonstrations, Kojima et al. (2022) propose Zero-shot-CoT, in which LLMs are merely prompted with the statement “Let’s think step by step” following the input. Madaan et al. (2022), Gao et al. (2022), and Chen et al. (2022) discovered that LLMs trained with code, such as Codex (Chen et al., 2021), perform better on reasoning tasks when reasoning is framed as code generation.

Specific Problem/Setting: Before the chain of thought, Nye et al. (2022) attempt to use intermediate computations dubbed “scratchpads” to enhance language models’ reasoning performance in both finetuning and few-shot regimes, with a particular emphasis on programs. Shi et al. (2022) try to solve multilingual reasoning problems using CoT in the original language, CoT in English (independent of the problem language), and CoT in English (with the problem translated to English).

Rationale Engineering

Rationale engineering attempts to improve the elicitation or use of reasoning in LLMs. This can be achived by rationale refining (creating more effective examples of reasoning processes) or rationale exploration and rationale verification (exploring and evaluating the rationales given by LLMs). Figure 2 depicts an overview of raltionale engineering.

Rationale refinement: The goal of rationale refinement is to generate and enhance rationale examples that are more capable of eliciting reasoning in LLMs. Fu et al. (2022b) suggest using complexity-based prompting to generate rationales with more reasoning processes. Their findings suggest that, when rationale complexity increases, LLM performance improves. Similarly, Zhou et al.(2022c) propose algorithmic prompting, which implies that presenting more detailed examples of solutions can aid in improving reasoning ability on certain easy arithmetic computations.

Rationale exploration: In addition to offering better exemplars, we can enable LLMs to explore multiple modes of reasoning completely to enhance their performance on reasoning tasks. This process is known as rationale exploration. Wang et al. (2022c) introduce a decoding technique termed self-consistency to improve on the classic greedy decoding utilized in chain-of-thought prompting. This technique includes sampling a varied range of rationales rather than simply the greedy one and finding the most consistent response by marginalizing out the sampled rationales.

Rationale verification: The accuracy of LLM predictions relies on the accuracy of the rationales used to arrive at such predictions; hence, checking their validity is essential (Ye and Durrett, 2022). To solve this problem, a process called rationale verification has been developed to check whether LLMs’ explanations of their reasoning result in accurate answers. To improve LLMs’ performance in answering mathematical word problems, Cobbe et al. (2021) propose adding a trained verifier that ranks each LLM’s reasoning and solution and chooses the solution with the highest score.

Problem Decomposition

Despite its success in eliciting reasoning in LLMs, chain-of-thought prompting might struggle with complex problems, such as those that need compositional generalization (Lake and Baroni, 2018; Keysers et al., 2020). To handle a complex problem, it is sometimes advantageous to partition it into simpler parts. Each of these subproblems must be resolved before the larger problem can be resolved successfully. Decomposition, sometimes known as “divide and conquer,” describes this method.

Least-to-most prompting: Zhou et al. (2022a) propose least-to-most prompting, which entails two steps: first, breaking down the complex problem into manageable subproblems; second, solving these subproblems in a specific order, with the answers to each subproblem facilitating the solution to the next subproblem in the sequence. The dynamic least-to-most prompting introduced by Droz-dov et al. (2022) is a follow-up piece of work that aims to address more realistic semantic parsing issues by dissecting them using prompting-based syntactic parsing and then dynamically choosing exemplars depending on the decomposition.

Decomposed prompting: Khot et al. (2022) propose decomposed prompting, which divides a complex problem into subproblems that can be addressed by a common library of prompting-based LLMs, each specialized in a specific subproblem.

Successive prompting: Successive prompting, developed by Dua et al. (2022), is an iterative method of decomposing a complex problem into a series of simpler problems. Each subsequent subproblem prediction has access to the solutions from the prior one.

Hybrids Methods

While “prompting” techniques might assist in eliciting or better use reasoning in large language models to solve reasoning problems, they do not truly demonstrate the reasoning skills of the LLMs themselves since the model parameters remain unchanged. In contrast, the “hybrid approach” tries to increase the reasoning powers of LLMs while also making greater use of these models to tackle complex issues. This strategy entails both improving the LLMs’ reasoning ability and using techniques like prompting to make the most of these abilities.

Bootstrapping & Self-Improving

Some research has looked at the possibility of letting LLMs develop their reasoning skills through a process called bootstrapping, as opposed to fine-tuning them on pre-built datasets that already include reasoning.

One example of this is the Self-Taught Reasoner (STaR) introduced by Zelikman et al. (2022), in which a LLM is trained and refined on its own output iteratively. Specifically, with CoT prompting, the model first generates initial rationales. And then, the model is finetuned on rationales that lead to correct answers. This process can be repeated, with each iteration resulting in an improved model that can generate better training data, which in turn leads to further improvements.

Measuring Reasoning in Large Language Models

Reporting LLMs’ performance (e.g., accuracy) on end tasks that include reasoning is one technique to evaluate their reasoning ability. Here are some industry standards:

Arithmetic Reasoning

Arithmetic reasoning is the capacity to understand and use mathematical concepts and principles to solve issues requiring arithmetic operations. When addressing mathematical issues, this entails applying logical reasoning and mathematical concepts to decide the best path of action. Math (Hendrycks et al., 2021), MathQA (Amini et al., 2019), SVAMP (Patel et al., 2021), AS- Div (Miao et al., 2020), AQuA (Ling et al., 2017), and MAWPS (Roy and Roth, 2015) are all examples of benchmarks for mathematical reasoning.

Commonsense Reasoning

Commonsense Reasoning is the use of everyday knowledge and understanding to make judgments and predictions about new situations. It is a fundamental aspect of human intelligence that enables us to navigate our environment, understand others, and make decisions with incomplete information.

The CSQA (Talmor et al., 2019), StrategyQA (Geva et al., 2021), and ARC (Clark et al., 2018) are all benchmarks that may be used to evaluate an LLM’s commonsense reasoning capabilities. For a more comprehensive overview of this field, we suggest reading Bhargava and Ng’s paper (2022).

Symbolic Reasoning

Symbolic reasoning is a kind of reasoning in which symbols are manipulated by formal rules. Symbolic reasoning entails deducing or fixing a problem by manipulating abstract symbols representing ideas and connections according to strict rules. Two benchmarks of symbolic reasoning are presented in Wei et al. (2022b), including Last Letter Concatenation and Coin Flip.

Findings and Implications

Here, we provide a concise overview of the key findings and implications of research on reasoning in large language models. Here, we provide a concise overview of the key findings and implications of research on reasoning in large language models:

Reasoning seems an emergent ability of LLMs: Significant increases in performance on reasoning tasks at a particular size (e.g., 100 billion parameters) suggest that reasoning ability seems to arise mainly in large language models like GPT-3 175B (Wei et al., 2022a,b; Suzgun et al., 2022). It’s possible that training tiny models for specialized tasks is less efficient than using a large model for general reasoning problems. The underlying cause of this emerging skill, however, remains a mystery. For several possible reasons, we can see Wei et al. (2022a) and Fu et al. (2022a).

Chain of thought elicits “reasoning” of LLMs: In experiments conducted by Wei et al. (2022a,b) and Suzgun et al. (2022), it was found that LLMs performed better when given chain-of-thought (CoT) prompts to help them reason through a problem. Saparov and He (2022) (4.2) show that LLMs can generate correct individual proof steps when presented with CoT prompts, even when the synthetic ontology is fictitious or counterfactual.

When faced with a choice between many options, they can choose the wrong one, which might result in a flawed or incomplete proof. Chain-of-thought prompting can lead to significant performance improvements for many reasoning tasks, including those where the performance of standard prompting rises smoothly with model size. Furthermore, it has been shown that compared to standard prompting or fully supervised finetuning paradigms, using CoT prompts improves the out-of-distribution robustness of LLMs (Wei et al., 2022b; Zhou et al., 2022a; Anil et al., 2022).

LLMs show human-like content effects on reasoning: The reasoning tendencies of LLMs are comparable to those of humans, as described in the cognitive literature, as stated by Dasgupta et al. (2022). For instance, the models’ predictions are affected by factors like familiarity and abstract thinking, while the credibility of the results affects how seriously they are taken as true. Even though language models may not always perform well on reasoning tasks, these results imply that their failures typically occur in settings that are tough for humans as well. This suggests that language models may “reason” in a way analogous to human reasoning.

LLMs are still unskilled at complex reasoning: According to research by authors like Valmeekam et al. (2022), Han et al. (2022a), and Ruis et al. (2022), while LLMs seem to have outstanding reasoning ability, they nevertheless struggle with more complex reasoning problems or those requiring implicature.

Right task/application: More realistic and relevant applications, such as decision making (Edwards, 1954), legal reasoning (Levi, 2013), and scientific reasoning (Zimmerman, 2000), are necessary to fully grasp the reasoning capabilities of LLMs. Having LLMs perform operations that are easily accomplished by other programs is not where we want to end up. Research that is significant must always consider the task at hand and whether or not the suggested approach can be applied to other, more realistic problems and applications.

Improving reasoning capabilities of LLMs: Large language models can benefit from techniques such as chain-of-thought prompting (Wei et al., 2022b) to elicit their reasoning skills, but these methods will not allow the models to solve challenges beyond their existing capabilities. Training data, model architecture, and optimization objectives that are geared toward encouraging reasoning are all required to significantly improve reasoning in LLMs. Examples include fine-tuning a model using a dataset that contains CoT data to enhance its reasoning (Chung et al., 2022), and bootstrapping a model’s reasoning to enhance its performance (Zelikman et al., 2022; Huang et al., 2022a). There is still a lot of room for improvement in reasoning in large language models, thus, we welcome any future research in this field.

Conclusion

While large language models (LLMs) have achieved impressive results across natural language tasks, there is still ongoing debate about whether they truly possess reasoning abilities or simply rely on statistical pattern recognition and heuristics. Understanding and enhancing the reasoning capabilities of LLMs remains a critical research challenge. Continued exploration in this area will not only deepen our comprehension of LLM behavior but also guide the development of more reliable and intelligent AI systems.

Platforms like DigitalOcean, with scalable GPU-backed infrastructure, provide accessible and cost-effective environments for training, fine-tuning, and evaluating LLMs—empowering researchers and developers to push the boundaries of machine reasoning and deploy smarter AI applications in production.

References

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Adrien Payong

Author

AI consultant and technical writer

See author profile

I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.

See author profile

Shaoni Mukherjee

Editor

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

See author profile

Category:

Tutorial

Tags:

AI/ML