Evaluating Reward Models with RewardBench 2

Published on July 9, 2025

Evaluating Reward Models with RewardBench 2

The Importance of Evaluations

Evaluations are the foundation of developing reliable AI systems because they offer a standardized approach to verifying model performance. Without rigorous evaluation, we risk deploying systems of unprecedented complexity while remaining misinformed on their actual capabilities. This is not merely about performance as it is about the fundamental imperative that we know the scope of what we deploy.

In this article, we will be discussing RewardBench 2. This is a new benchmark for evaluating reward models, distinguishing itself by incorporating unseen prompts from actual users rather than recycling prompts from existing evaluations.

Primer on Reward Models

Reward models are trained on preference data, where prompts (x) and completions (y) are ranked by humans or automated metrics. The process involves comparing two completions for each prompt, labeling one as “chosen” and the other as “rejected.” A reward model is then trained to predict the likelihood of a prompt and completion being chosen, utilizing a Bradley-Terry model of human preferences. Maximum Likelihood Estimation (MLE) finds the optimal parameters θ that best explain the observed preference data. MLE The likelihood function L(θ, D) represents how well the model with parameters θ explains the D. D represents the preference data distribution from which the “chosen” and “rejected” completion pairs, along with their corresponding prompts, are sampled.

Reward models (RMs) have a number of use cases, including but not limited to:

RLHF ( Reinforcement Learning From Human Feedback ) is how AI systems incorporate human preferences and values. The process typically involves three stages: first, a model is pre-trained on large datasets (pre-training); second, human evaluators rank or rate the model’s outputs to create a preference dataset; and third, the model is fine-tuned using reinforcement learning to maximize alignment with human preferences. This approach helps AI systems produce outputs derived by learning from human judgment rather than just optimizing for traditional metrics.

Inference-time scaling, also known as test-time compute, refers to the allocation of more computational resources during inference. Here, the model is “thinking harder” by exploring multiple solution trajectories, with a reward model guiding the selection of the best candidate solution. Inference-time scaling allows for improved reasoning and accuracy without modification of the model’s pre-trained weights.

RewardBench 2 Composition

About 70% of prompts are sourced from WildChat, a corpus of 1 million user-ChatGPT conversations, consisting of over 2.5 million interaction turns. From a collection of prompts, a combination of QuRater for data annotation, a topic classifier for identifying the prompt domain, and manual inspection were used to filter and assign prompts to domain-specific subsets.

RewardBench 2 Domains

There are 6 domains in the benchmark: Factuality, Precise Instruction Following, Math, Safety, Focus, and Ties. The Math,Safety, Focus domains build upon the original RewardBench’s Math, Safety, and Chat-Hard sections. The Factuality, Precise Instruction Following, and Ties are introduced in RewardBench 2.

Domain (Count)	Description	Prompt Source	Method of Generating Completions	Scoring
Factuality (475)	Test RM’s ability to detect hallucinations	Human (in-the-wild chat interactions)	Both natural and System Prompt Variation	Majority voting, LLM-as-a-judge (two LLMs must agree to assign a label)
Precise Instruction Following (160)	Tests RM’s ability to follow instructions like “Answer without the letter u”	Human (in-the-wild chat interactions)	Natural	Verifier functions to evaluate adherence to constraint
Math (183)	Tests RM’s math ability	Human (in-the-wild chat interactions)	Natural	Majority voting, LM-as-a-judge & manual verification
Safety (450)	Tests RM’s ability to correctly determine which responses should be complied with or refused	CoCoNot	Both natural and System Prompt Variation	Subset-specific rubrics for judging compliance with GPT-4o, manual verification for half of examples
Focus (495)	Tests RM’s ability to detect on-topic answers of high-quality	Human (in-the-wild chat interactions)	System Prompt Variation	N/A (implied by method of generation, which differentiates chosen and rejected completions)
Ties (102)	Tests model’s ability to avoid expressing overly strong or arbitrary preferences among equivalent correct answers, while still clearly preferring any correct answer over any incorrect one.	Manual (Researchers)	System Prompt Variation	Weighted score of accuracy (all valid correct answers scored higher than all incorrect answers) and whether reward margin between correct and incorrect answers exceeds that of highest and lowest-scored correct responses

Method of Generating Completions

The "Natural” method for generating completions means that the completions are generated without any specific system prompts designed to induce errors or variations. Instead, they are uninfluenced outputs from the language models. This contrasts with “System Prompt Variation,” where models are explicitly instructed to create subtle factual errors or off-topic responses for the “rejected” completions.

Scoring

The scoring process works in two stages:

Domain-level measurement: Accuracy scores are first calculated separately for each individual domain
Final score calculation: The overall final score is computed as an unweighted average of the accuracy scores across all six domains

This means each domain contributes equally to the final score regardless of how many individual items or tasks it contains, since the averaging is unweighted across the six domains.

Appendix E of the paper goes into greater detail on dataset creation.

Here is the RewardBench-2 dataset. We encourage you to take a look and pay special attention to the chosen vs. rejected responses. Note that in the rejected column, incorrect responses are separated by commas. For every category except Ties, there are 3 rejected responses and 1 correct response. In the Ties category, the number of rejected responses varies.

You may also notice that a number of different models are being used in the models column. The researchers found that RMs tend to prefer completions generated by their own base model and therefore used a pool of models for the different domains. model pool

RewardBench-2 is not like other Reward Model Benchmarks

reward model comparison The table positions RewardBench 2 as an advancement in RM evaluation. It shows that RewardBench 2 uniquely incorporates “Best-of-N (N > 2)” evaluations, uses “Human Prompts,” and, critically, uses “Unseen Prompts”. The use of unseen human prompts is a significant departure from most prior work, which often repurposes prompts from existing downstream evaluations, and helps to avoid potential contamination with respect to downstream evaluation targets.

Training More Reward Models for Evaluation Purposes

The researchers also trained reward models so that they can analyze the performance of a larger variety of models. These models are available here.

Final Thoughts

The scaling of massive neural networks has taught us that capabilities emerge in unpredictable ways, meaning that theory and “vibes” alone are insufficient. Here, evaluation benchmarks help verify that reward models successfully bridge the gap between objective metrics and complex human preferences, building the trust and reliability necessary as AI systems become more autonomous and influential in real-world applications. The community has already started using RewardBench 2, as seen by the evaluation of eight reward models recently released by Skywork-Reward-V2.

Check out Reward Bench 2’s Datasets, Spaces, and Models on Hugging Face!

References and Additional Resources

Reward Bench 2 paper and model card (sources of all images) Illustrating Reinforcement Learning from Human Feedback (RLHF) from Hugging Face A short introduction to RLHF and post-training focused on language models from Nathan Lambert

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran

Author

See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

See author profile

Category:

Tutorial

Tags:

AI/ML