Reasoning Models Know When They’re Right

Reasoning Models Know When They’re Right

Ask DeepSeek R1 a simple math problem — “what’s 1 + 1”— and here’s what unfolds before it gives the answer:

Okay, so the user is asking what 1 plus 1 equals in math. Hmm, that seems pretty straightforward, right? But maybe I should double-check to make sure there's no trick here. Sometimes people ask simple questions to see if you'll give a detailed answer or just a quick one. Let me think.

First, in basic arithmetic, 1 plus 1 is definitely 2. That's like the foundation of addition. But maybe the user is coming from a different angle. Like, in binary, 1 plus 1 equals 10, right? Because binary is base-2, so when you add 1 and 1, you carry over a 1 to the next place value. But the question specifically mentions "in math," which is pretty general. So maybe they just want the standard answer.

Wait, could there be another context? In some algebraic structures, like Boolean algebra, 1 plus 1 might be 1, because it's logical OR. But in standard arithmetic, it's 2. The user didn't specify the context, so I should probably mention both possibilities but clarify the usual answer.

…… ……

(Generated by DeepSeek in response to “what 1 + 1 equals in math”)

The reasoning process actually continues to go on for another “let me also consider…” and “another thought…”, but the point is: AI models overthink. They often keep producing additional tokens after arriving at the correct answer, which leads to computation waste and inefficiency. This raises an important question: can these models tell when they’ve already found the right answer and stop accordingly?

A team of researchers from NYU and NYU Shanghai explored this question in their recent paper, “Reasoning Models Know When They’re Right: Probing Hidden States for Self-Verification,” now available as a preprint on online open platform arXiv. NYU Shanghai PhD student Chen Yulin and Assistant Professor of Computer Science Zhao Chen were among the study’s contributors.

The project focuses on studying to what extent models can evaluate the correctness of their intermediate answers during reasoning, and use the findings to improve reasoning efficiency.

Specifically, the team developed a lightweight probe—a two-layer neural network—that examines a model’s hidden states at intermediate reasoning steps. By training the probe to recognize patterns associated with correct intermediate answers, the researchers enabled the model to “self-verify” during inference.

Using models like DeepSeek-R1-Distill and QwQ-32B, the team applied their method to datasets involving mathematical and logical problems. They segmented the models’ step-by-step answers into smaller parts and analyzed the intermediate reasoning trace produced at each stage. The probe was trained to determine whether each segment had reached a correct conclusion. Surprisingly, the necessary signals were often already present and easily detectable in the model’s internal state. As experiments indicate that simple probing is both highly accurate and calibrated (i.e., the prediction confidence aligns with likelihood). 

When incorporating our probing results into the reasoning process, it enabled the model to stop once it was confident in its answer—cutting down on processing time. This led to a 24% reduction in token usage while maintaining over 88% accuracy. Compared to static exit strategies, this dynamic method also improved accuracy by up to 5%, using the same or fewer tokens.

“This suggests that models already have a sense of when they’re right, but they don’t use it on their own,” Zhao explained. “Our tool helps them recognize and act on that confidence to reason more efficiently.”