Evaluations are the foundation of developing reliable AI systems because they offer a standardized approach to verifying model performance. Without rigorous evaluation, we risk deploying systems of unprecedented complexity while remaining misinformed on their actual capabilities. This is not merely about performance as it is about the fundamental imperative that we know the scope of what we deploy.
In this article, we will be discussing RewardBench 2. This is a new benchmark for evaluating reward models, distinguishing itself by incorporating unseen prompts from actual users rather than recycling prompts from existing evaluations.
Reward models are trained on preference data, where prompts (x) and completions (y) are ranked by humans or automated metrics. The process involves comparing two completions for each prompt, labeling one as “chosen” and the other as “rejected.” A reward model is then trained to predict the likelihood of a prompt and completion being chosen, utilizing a Bradley-Terry model of human preferences.
Maximum Likelihood Estimation (MLE) finds the optimal parameters θ that best explain the observed preference data.
The likelihood function L(θ, D) represents how well the model with parameters θ explains the D.
D represents the preference data distribution from which the “chosen” and “rejected” completion pairs, along with their corresponding prompts, are sampled.
Reward models (RMs) have a number of use cases, including but not limited to:
RLHF ( Reinforcement Learning From Human Feedback ) is how AI systems incorporate human preferences and values. The process typically involves three stages: first, a model is pre-trained on large datasets (pre-training); second, human evaluators rank or rate the model’s outputs to create a preference dataset; and third, the model is fine-tuned using reinforcement learning to maximize alignment with human preferences. This approach helps AI systems produce outputs derived by learning from human judgment rather than just optimizing for traditional metrics.
Inference-time scaling, also known as test-time compute, refers to the allocation of more computational resources during inference. Here, the model is “thinking harder” by exploring multiple solution trajectories, with a reward model guiding the selection of the best candidate solution. Inference-time scaling allows for improved reasoning and accuracy without modification of the model’s pre-trained weights.
About 70% of prompts are sourced from WildChat, a corpus of 1 million user-ChatGPT conversations, consisting of over 2.5 million interaction turns. From a collection of prompts, a combination of QuRater for data annotation, a topic classifier for identifying the prompt domain, and manual inspection were used to filter and assign prompts to domain-specific subsets.
There are 6 domains in the benchmark: Factuality, Precise Instruction Following, Math, Safety, Focus, and Ties. The Math,Safety, Focus domains build upon the original RewardBench’s Math, Safety, and Chat-Hard sections. The Factuality, Precise Instruction Following, and Ties are introduced in RewardBench 2.
Domain (Count) | Description | Prompt Source | Method of Generating Completions | Scoring |
---|---|---|---|---|
Factuality (475) | Test RM’s ability to detect hallucinations | Human (in-the-wild chat interactions) | Both natural and System Prompt Variation | Majority voting, LLM-as-a-judge (two LLMs must agree to assign a label) |
Precise Instruction Following (160) | Tests RM’s ability to follow instructions like “Answer without the letter u” | Human (in-the-wild chat interactions) | Natural | Verifier functions to evaluate adherence to constraint |
Math (183) | Tests RM’s math ability | Human (in-the-wild chat interactions) | Natural | Majority voting, LM-as-a-judge & manual verification |
Safety (450) | Tests RM’s ability to correctly determine which responses should be complied with or refused | CoCoNot | Both natural and System Prompt Variation | Subset-specific rubrics for judging compliance with GPT-4o, manual verification for half of examples |
Focus (495) | Tests RM’s ability to detect on-topic answers of high-quality | Human (in-the-wild chat interactions) | System Prompt Variation | N/A (implied by method of generation, which differentiates chosen and rejected completions) |
Ties (102) | Tests model’s ability to avoid expressing overly strong or arbitrary preferences among equivalent correct answers, while still clearly preferring any correct answer over any incorrect one. | Manual (Researchers) | System Prompt Variation | Weighted score of accuracy (all valid correct answers scored higher than all incorrect answers) and whether reward margin between correct and incorrect answers exceeds that of highest and lowest-scored correct responses |
The "Natural” method for generating completions means that the completions are generated without any specific system prompts designed to induce errors or variations. Instead, they are uninfluenced outputs from the language models. This contrasts with “System Prompt Variation,” where models are explicitly instructed to create subtle factual errors or off-topic responses for the “rejected” completions.
The scoring process works in two stages:
This means each domain contributes equally to the final score regardless of how many individual items or tasks it contains, since the averaging is unweighted across the six domains.
Appendix E of the paper goes into greater detail on dataset creation.
Here is the RewardBench-2 dataset. We encourage you to take a look and pay special attention to the chosen vs. rejected responses. Note that in the rejected column, incorrect responses are separated by commas. For every category except Ties, there are 3 rejected responses and 1 correct response. In the Ties category, the number of rejected responses varies.
You may also notice that a number of different models are being used in the models column. The researchers found that RMs tend to prefer completions generated by their own base model and therefore used a pool of models for the different domains.
The table positions RewardBench 2 as an advancement in RM evaluation. It shows that RewardBench 2 uniquely incorporates “Best-of-N (N > 2)” evaluations, uses “Human Prompts,” and, critically, uses “Unseen Prompts”. The use of unseen human prompts is a significant departure from most prior work, which often repurposes prompts from existing downstream evaluations, and helps to avoid potential contamination with respect to downstream evaluation targets.
The researchers also trained reward models so that they can analyze the performance of a larger variety of models. These models are available here.
The scaling of massive neural networks has taught us that capabilities emerge in unpredictable ways, meaning that theory and “vibes” alone are insufficient. Here, evaluation benchmarks help verify that reward models successfully bridge the gap between objective metrics and complex human preferences, building the trust and reliability necessary as AI systems become more autonomous and influential in real-world applications. The community has already started using RewardBench 2, as seen by the evaluation of eight reward models recently released by Skywork-Reward-V2.
Check out Reward Bench 2’s Datasets, Spaces, and Models on Hugging Face!
Reward Bench 2 paper and model card (sources of all images) Illustrating Reinforcement Learning from Human Feedback (RLHF) from Hugging Face A short introduction to RLHF and post-training focused on language models from Nathan Lambert
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.