We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct final answer (“outcome supervision”). In addition to boosting performance relative to outcome supervision, process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans.
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still produce logical mistakes, often called _hallucinations_. Mitigating hallucinations is a critical step towards building aligned AGI.
We can train reward models to detect hallucinations using either _outcome supervision_, which provides feedback based on a final result, or _process supervision_, which provides feedback for each individual step in a chain-of-thought. Building on previous work1, we conduct a detailed comparison of these two methods using the MATH dataset2 as our testbed. We find that process supervision leads to significantly better performance, even when judged by outcomes. To encourage related research, we release our full dataset of process supervision.
Process supervision has several alignment advantages over outcome supervision. It directly rewards the model for following an aligned chain-of-thought, since each step in the process receives precise supervision. Process supervision is also more likely to produce interpretable reasoning, since it encourages the model to follow a human-approved process. In contrast, outcome supervision may reward an unaligned process, and it is generally harder to scrutinize.
In some cases, safer methods for AI systems can lead to reduced performance3, a cost which is known as an _alignment tax_. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results below show that process supervision in fact incurs a negative alignment tax, at least in the math domain. This could increase the adoption of process supervision, which we believe would have positive alignment side-effects.
## Solving MATH problems
We evaluate our process-supervised and outcome-supervised reward models using problems from the MATH test set. We generate many solutions for each problem and then pick the solution ranked the highest by each reward model. The graph shows the percentage of chosen solutions that reach the correct final answer, as a function of the number of solutions considered. Not only does the process-supervised reward model perform better across the board, but the performance gap widens as we consider more solutions per problem. This shows us that the process-supervised reward model is much more reliable.
We showcase 10 problems and solutions below, along with commentary about the reward model’s strengths and weaknesses.
## Explore examples in 3 categories:
Simplify tan10 0∘+4 sin10 0∘.\tan 100^\circ + 4 \sin 100^\circ.tan 10 0∘+4 sin 10 0∘.
This challenging trigonometry problem requires applying several identities in a not-at-all obvious succession. Most solution attempts fail, because it is hard to choose which identities are actually helpful. Although GPT-4 usually can’t solve this problem (only .1%.1\%.1% of solution attempts reach the correct answer), the reward model correctly recognizes that this solution is valid.
I notice that the angle 100 degrees is not a special angle, so I will need to use some trigonometric identities to simplify this expression.
I recall that tanθ=sinθ cosθ\tan \theta = \frac{\sin \theta}{\cos \theta}tan θ=c o s θ s i n θ and sin2 θ=2 sinθ cosθ.\sin 2\theta = 2 \sin \theta \cos \theta.sin 2 θ=2 sin θ cos θ.
I can use these identities to rewrite the expression as sin10 0∘cos10 0∘+4 sin10 0∘.\frac{\sin 100^\circ}{\cos 100^\circ} + 4 \sin 100^\circ.c o s 10 0∘s i n 10 0∘+4 sin 10 0∘.
I can combine the terms by finding a common denominator of cos10 0∘.\cos 100^\circ.cos 10 0∘.
I get sin10 0∘+4 sin10 0∘cos10 0∘cos10 0∘.\frac{\sin 100^\circ + 4 \sin 100^\circ \cos 100^\circ}{\cos 100^\circ}.c o s 10 0∘s i n 10 0∘+4 s i n 10 0∘c o s 10 0∘.
I can simplify the numerator by using the double angle identity again, but in reverse.
I get sin10 0∘+2 sin20 0∘cos10 0∘.\frac{\sin 100^\circ + 2 \sin 200^\circ}{\cos 100^\circ}.c o s 10 0∘s i n 10 0∘+2 s i n 20 0∘.
I notice that 200 degrees is the same as 180 degrees plus 20 degrees, so I can use the sum angle identity for sine.
I get sin10 0∘+2(sin18 0∘cos2 0∘+cos18 0∘sin2 0∘)cos10 0∘.\frac{\sin 100^\circ + 2 (\sin 180^\circ \cos 20^\circ + \cos 180^\circ \sin 20^\circ)}{\cos 100^\circ}.c o s 10 0∘s i n 10 0∘+2(s i n 18 0∘c o s 2 0∘+c o s 18 0∘s i n 2 0∘).
I recall that sin18 0∘=0\sin 180^\circ = 0 sin 18 0∘=0 and cos18 0∘=−1,\cos 180^\circ = -1,cos 18 0∘=−1, so I can simplify further.
I get sin10 0∘−2 sin2 0∘cos10 0∘.\frac{\sin 100^\circ - 2 \sin 20^\circ}{\cos 100^\circ}.c o s 10 0∘s i n 10 0∘−2 s i n 2 0∘.
I notice that 100 degrees is the same as 90 degrees plus 10 degrees, so I can use the sum angle identity for sine and cosine.
I get (sin9 0∘cos1 0∘+cos9 0∘sin1 0∘)−2 sin2 0∘(cos9 0∘cos1 0∘−sin9 0∘sin1 0∘).\frac{(\sin 90^\circ \cos 10^\circ + \cos 90^\circ \sin 10^\circ) - 2 \sin 20^\circ}{(\cos 90^\circ \cos 10^\circ - \sin 90^\circ \sin 10^\circ)}.(c o s 9 0∘c o s 1 0∘−s i n 9 0∘s i n 1 0∘)(s i n 9 0∘c o s 1 0∘+c o s 9 0∘s i n 1 0∘)−2 s i n 2 0∘.
I recall that sin9 0∘=1\sin 90^\circ = 1 sin 9 0∘=1 and cos9 0∘=0,\cos 90^\circ = 0,cos 9 0∘=0, so I can simplify further.
I get cos1 0∘−2 sin2 0∘−sin1 0∘.\frac{\cos 10^\circ - 2 \sin 20^\circ}{-\sin 10^\circ}.−s i n 1 0∘c o s 1 0∘−2 s i n 2 0∘.
I can simplify the sign by multiplying the numerator and denominator by -1.
I get 2 sin2 0∘−cos1 0∘sin1 0∘.\frac{2 \sin 20^\circ - \cos 10^\circ}{\sin 10^\circ}.s i n 1 0∘2 s i n 2 0∘−c o s 1 0∘.
I notice that 20 degrees is the same as 30 degrees minus 10 degrees, so I can use the difference angle identity for sine.
I get 2(sin3 0∘cos1 0∘−cos3 0∘sin1 0∘)−cos1 0∘sin1 0∘.\frac{2 (\sin 30^\circ \cos 10^\circ - \cos 30^\circ \sin 10^\circ) - \cos 10^\circ}{\sin 10^\circ}.s i n 1 0∘2(s i n 3 0∘c o s 1 0∘−c o s 3 0∘s i n 1 0∘)−c o s 1 0∘.
I recall that sin3 0∘=1 2\sin 30^\circ = \frac{1}{2}sin 3 0∘=2 1 and cos3 0∘=3 2,\cos 30^\circ = \frac{\sqrt{3}}{2},cos 3 0∘=2 3, so I can simplify further.
I get cos1 0∘−3 sin1 0∘−cos1 0∘sin1 0∘.\frac{\cos 10^\circ - \sqrt{3} \sin 10^\circ - \cos 10^\circ}{\sin 10^\circ}.s i n 1 0∘c o s 1 0∘−3s i n 1 0∘−c o s 1 0∘.
I can cancel out the cos1 0∘\cos 10^\circ cos 1 0∘ terms in the numerator.
I get −3 sin1 0∘sin1 0∘.\frac{-\sqrt{3} \sin 10^\circ}{\sin 10^\circ}.s i n 1 0∘−3s i n 1 0∘.
I can cancel out the sin1 0∘\sin 10^\circ sin 1 0∘ terms in the numerator and denominator.
I get −3.-\sqrt{3}.−3.
Answer: −3-\sqrt{3}−3
It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains. If these results generalize, we may find that process supervision gives us the best of both worlds – a method that is both more performant and more aligned than outcome supervision.
1. 1 Uesato, J., Kushman N., Kumar R., Song F., Siegel N., Wang L., Creswell A., Irving G. and Higgins, I., 2022. Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275.
2. 2 Hendrycks D., Burns C., Kadavath S., Arora A., Basart S., Tang E., Song D. and Steinhardt J., 2021. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv preprint arXiv:2103.03874.
3. 3 Ouyang L., Wu J., Jiang X., Almedia D., Wainwright C.L., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., Schulman J., Hilton J., Kelton F., Miller L., Simens M., Askell A., Welinder P., Christiano P., Leike J. and Lowe R., 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Karl Cobbe, Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Jan Leike, Ilya Sutskever
Bowen Baker, Teddy Lee, John Schulman, Greg Brockman, Kendra Rimbach, Hannah Wong, Thomas Degry
DALL·E 3 is now available in ChatGPT Plus and Enterprise Product Oct 19, 2023
Language models can explain neurons in language models Publication May 9, 2023
Solving (some) formal math olympiad problems Milestone Feb 2, 2022
Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research
Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex
Safety * Safety Approach * Security & Privacy * Trust & Transparency
ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)
Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)
API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)
For Business * Business Overview * Solutions * Contact Sales
Company * About Us * Our Charter * Foundation * Careers * Brand
Support * Help Center(opens in a new window)
More * News * Stories * Livestreams * Podcast * RSS
Terms & Policies * Terms of Use * Privacy Policy * Other Policies
(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)
OpenAI © 2015–2026 Manage Cookies
English United States