Report this

What is the reason for this report?

Evaluating Reward Models with RewardBench 2

Published on July 9, 2025
Evaluating Reward Models with RewardBench 2

The Importance of Evaluations

Evaluations are the foundation of developing reliable AI systems because they offer a standardized approach to verifying model performance. Without rigorous evaluation, we risk deploying systems of unprecedented complexity while remaining misinformed on their actual capabilities. This is not merely about performance as it is about the fundamental imperative that we know the scope of what we deploy.

In this article, we will be discussing RewardBench 2. This is a new benchmark for evaluating reward models, distinguishing itself by incorporating unseen prompts from actual users rather than recycling prompts from existing evaluations.

Primer on Reward Models

Reward models are trained on preference data, where prompts (x) and completions (y) are ranked by humans or automated metrics. The process involves comparing two completions for each prompt, labeling one as “chosen” and the other as “rejected.” A reward model is then trained to predict the likelihood of a prompt and completion being chosen, utilizing a Bradley-Terry model of human preferences. bradley-terry Maximum Likelihood Estimation (MLE) finds the optimal parameters θ that best explain the observed preference data. MLE The likelihood function L(θ, D) represents how well the model with parameters θ explains the D. D represents the preference data distribution from which the “chosen” and “rejected” completion pairs, along with their corresponding prompts, are sampled.

Reward models (RMs) have a number of use cases, including but not limited to:

RLHF ( Reinforcement Learning From Human Feedback ) is how AI systems incorporate human preferences and values. The process typically involves three stages: first, a model is pre-trained on large datasets (pre-training); second, human evaluators rank or rate the model’s outputs to create a preference dataset; and third, the model is fine-tuned using reinforcement learning to maximize alignment with human preferences. This approach helps AI systems produce outputs derived by learning from human judgment rather than just optimizing for traditional metrics.

Inference-time scaling, also known as test-time compute, refers to the allocation of more computational resources during inference. Here, the model is “thinking harder” by exploring multiple solution trajectories, with a reward model guiding the selection of the best candidate solution. Inference-time scaling allows for improved reasoning and accuracy without modification of the model’s pre-trained weights.

RewardBench 2 Composition

About 70% of prompts are sourced from WildChat, a corpus of 1 million user-ChatGPT conversations, consisting of over 2.5 million interaction turns. From a collection of prompts, a combination of QuRater for data annotation, a topic classifier for identifying the prompt domain, and manual inspection were used to filter and assign prompts to domain-specific subsets.

RewardBench 2 Domains

There are 6 domains in the benchmark: Factuality, Precise Instruction Following, Math, Safety, Focus, and Ties. The Math,Safety, Focus domains build upon the original RewardBench’s Math, Safety, and Chat-Hard sections. The Factuality, Precise Instruction Following, and Ties are introduced in RewardBench 2.

Domain (Count) Description Prompt Source Method of Generating Completions Scoring
Factuality (475) Test RM’s ability to detect hallucinations Human (in-the-wild chat interactions) Both natural and System Prompt Variation Majority voting, LLM-as-a-judge (two LLMs must agree to assign a label)
Precise Instruction Following (160) Tests RM’s ability to follow instructions like “Answer without the letter u” Human (in-the-wild chat interactions) Natural Verifier functions to evaluate adherence to constraint
Math (183) Tests RM’s math ability Human (in-the-wild chat interactions) Natural Majority voting, LM-as-a-judge & manual verification
Safety (450) Tests RM’s ability to correctly determine which responses should be complied with or refused CoCoNot Both natural and System Prompt Variation Subset-specific rubrics for judging compliance with GPT-4o, manual verification for half of examples
Focus (495) Tests RM’s ability to detect on-topic answers of high-quality Human (in-the-wild chat interactions) System Prompt Variation N/A (implied by method of generation, which differentiates chosen and rejected completions)
Ties (102) Tests model’s ability to avoid expressing overly strong or arbitrary preferences among equivalent correct answers, while still clearly preferring any correct answer over any incorrect one. Manual (Researchers) System Prompt Variation Weighted score of accuracy (all valid correct answers scored higher than all incorrect answers) and whether reward margin between correct and incorrect answers exceeds that of highest and lowest-scored correct responses

Method of Generating Completions

The "Natural” method for generating completions means that the completions are generated without any specific system prompts designed to induce errors or variations. Instead, they are uninfluenced outputs from the language models. This contrasts with “System Prompt Variation,” where models are explicitly instructed to create subtle factual errors or off-topic responses for the “rejected” completions.

Scoring

The scoring process works in two stages:

  1. Domain-level measurement: Accuracy scores are first calculated separately for each individual domain
  2. Final score calculation: The overall final score is computed as an unweighted average of the accuracy scores across all six domains

This means each domain contributes equally to the final score regardless of how many individual items or tasks it contains, since the averaging is unweighted across the six domains.

Appendix E of the paper goes into greater detail on dataset creation.

Here is the RewardBench-2 dataset. We encourage you to take a look and pay special attention to the chosen vs. rejected responses. Note that in the rejected column, incorrect responses are separated by commas. For every category except Ties, there are 3 rejected responses and 1 correct response. In the Ties category, the number of rejected responses varies.

You may also notice that a number of different models are being used in the models column. The researchers found that RMs tend to prefer completions generated by their own base model and therefore used a pool of models for the different domains. model pool

RewardBench-2 is not like other Reward Model Benchmarks

reward model comparison The table positions RewardBench 2 as an advancement in RM evaluation. It shows that RewardBench 2 uniquely incorporates “Best-of-N (N > 2)” evaluations, uses “Human Prompts,” and, critically, uses “Unseen Prompts”. The use of unseen human prompts is a significant departure from most prior work, which often repurposes prompts from existing downstream evaluations, and helps to avoid potential contamination with respect to downstream evaluation targets.

Training More Reward Models for Evaluation Purposes

The researchers also trained reward models so that they can analyze the performance of a larger variety of models. These models are available here.

Final Thoughts

The scaling of massive neural networks has taught us that capabilities emerge in unpredictable ways, meaning that theory and “vibes” alone are insufficient. Here, evaluation benchmarks help verify that reward models successfully bridge the gap between objective metrics and complex human preferences, building the trust and reliability necessary as AI systems become more autonomous and influential in real-world applications. The community has already started using RewardBench 2, as seen by the evaluation of eight reward models recently released by Skywork-Reward-V2.

Check out Reward Bench 2’s Datasets, Spaces, and Models on Hugging Face!

References and Additional Resources

Reward Bench 2 paper and model card (sources of all images) Illustrating Reinforcement Learning from Human Feedback (RLHF) from Hugging Face A short introduction to RLHF and post-training focused on language models from Nathan Lambert

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran
Melani Maheswaran
Author
See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

Category:
Tags:

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.