Reinforcement Learning Environments

Published on January 14, 2026

Introduction

An active area of continued interest for AI researchers and engineers is adopting LLMs into end-to-end autonomous systems composed of multi-agent architectures. While LLMs are impressive in their own right, to truly derive value from them, we’re seeing the industry turn to Reinforcement Learning (RL) environments.

RL environments aren’t new; they predate LLMs. In fact, you really can’t talk about agents without talking about environments. Generally, in an RL context, an environment provides a reward or penalty for an action an agent takes in that environment. The agent is forced to adapt to maximize cumulative reward. This adaptation to maximize reward is the crux of reinforcement learning.

Now with the rise of LLMs, the agent is actually often a model. The model’s weights get updated from the scoring of its attempts at different tasks, allowing the model to adapt. Computer use, as in an AI system that can navigate your computer, is a particularly interesting multi-agent task. We actually explored this topic in a previous article on Microsoft’s work with scaling computer-use data with multi-agent pipelines for their model, Fara-7B.

Key Takeaways

The industry is turning to Reinforcement Learning (RL) environments to derive true value from Large Language Models (LLMs).
An RL environment provides a reward or penalty for an agent’s action, forcing the agent (often an LLM/model) to adapt its weights to maximize cumulative reward.
Unlike subjective rewards in RLHF, RLVR utilizes objective and verifiable rewards (e.g., in math and code tasks) that are “non-gameable,” ensuring the model develops the desired reasoning.
Companies are using RL environments, often called “harnesses” or “UI gyms,” to train models specifically for use within their own software products (e.g., Cursor’s Composer, OpenAI’s Codex).

Reinforcement Learning from Verifiable Rewards

The recent increased discourse and interest in RL environments can perhaps be traced back to the success of RLVR (Reinforcement Learning from Verifiable Rewards), where tasks can be verified such as with math and code. We like the way Andrej Karpathy’s blog post, 2025 LLM Year in Review, puts RLVR in context of the progress made last year. Karpathy explains what makes RLVR effective: rewards that are verifiable are non-gameable. A non-gameable reward function is one that is tied directly to the successful, verifiable outcome of a task (like solving a puzzle or passing a test case), making it hard for the LLM to achieve a high reward without actually developing the desired reasoning and problem-solving strategies (reward hacking).

RL Environments for Products

Will Brown explains that models are trained for a specific product by giving examples of Cursor’s Composer and Open AI’s Codex, where the model is trained in a harness, which is essentially an RL environment that represents the product. In the same vein, companies are beginning to build environments around their software – which we’re hearing being called UI gyms.

Creating an RL Environment

There are a multitude of ways of going about creating RL environments. First consider your goal. What do you want your model to achieve? Then you want to choose a framework. Depending on the framework, the environment will be defined in different ways.

Potential frameworks include Prime Intellect’s environments hub, SkyRL (which has reusable tools), PyTorch’s OpenEnv, and OpenAI’s Gymnasium. Thinking Machines also has documentation and a cookbook on RL environments.

Regardless of the framework, you’ll typically need to think about and specify the key components of your RL environment:

State space: The information the agent observes. This could be pixels from a game screen, numerical sensor readings, screenshots, or representations of the world.
Action space: All potential actions the agent can take. Actions might be discrete (like button presses) or continuous (like motor controls).
Reward function: The reward function shapes what behaviours emerge. Sparse rewards (only at task completion) can be challenging to learn from, while dense rewards (frequent feedback) can reward unintended behaviours.
Episode termination conditions: Here, we determine when the “trial” ends. This could be after achieving a goal, exceeding a time limit, or entering a failure state.

After defining these components, you’ll implement the environment’s dynamics as in the rules governing how states transition based on actions.

Step 0: Consider what we’ll need

Begin by setting up a GPU Droplet. Be mindful of how many GPUs you’ll need. We’re going to use 4 H100s. You’ll also need a Weights and Biases account.

Step 1: Clone the repo and set up the environment


git clone https://github.com/NovaSky-AI/SkyRL.git
cd SkyRL
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[vllm]"   ##or ".[sglang]" for alternative inference backends

uv pip install -r requirements.txt

# may need to:
snap install astral-uv

Step 2: Prepare a Dataset for RL Training

SkyRL expects data in Parquet format with a schema suited for instruction and RL (prompt, completions, rewards, etc.).
You could use one of the built-in examples, like GSM8K for math reasoning (a good starter before jumping to SWE-Bench style tasks), or prepare your own for a custom environment.

We’re going to generate GSM8K data (reasoning + tool-use style) with gsm8k_dataset.py.

cd skyrl-train
uv run examples/gsm8k/gsm8k_dataset.py --output_dir ~/data/gsm8k

So this creates train.parquet and validation.parquet with fields like:

prompt
completion (or trajectories)
reward (for offline/hybrid setups; online RL computes rewards live)

For more agentic tasks, we could perhaps use SWE-Bench or a similar benchmark looking at verifiable tasks. SkyRL actually integrates OpenHands runtime (SkyRL-OpenHands) for code-editing environments.

Step 3: Create a YAML config to Configure a Training Run

Let’s create a YAML file to configure a GRPO training run: examples/gsm8k/gsm8k-grpo.yaml

data:
  train_data: ["~/data/gsm8k/train.parquet"]
  val_data: ["~/data/gsm8k/validation.parquet"]

trainer:
  algorithm:
    name: grpo
    advantage_estimator: grpo
  policy:
    model:
      path: Qwen/Qwen2.5-1.5B-Instruct   # Start small; scale to 7B–32B
  epochs: 2                             # Increase for real training
  strategy: fsdp2                       # Or ddp for single node
  placement:
    colocate_all: true
    policy_num_gpus_per_node: 4         # Adjust to your hardware

inference:
  backend: vllm                         # Fast inference for rollouts

logger: wandb

Step 4: TRAIN

For a single node:

uv run -m skyrl_train.entrypoints.main_base \
  --config examples/gsm8k/gsm8k-grpo.yaml \
  trainer.epochs=5 \
  data.train_data='["~/data/gsm8k/train.parquet"]'

For distributed (multi-GPU/node via Ray/SkyPilot):

sky launch skyrl_train/examples/gsm8k/gsm8k-grpo-skypilot.yaml \
  --secret WANDB_API_KEY=your_key_here

Step 5: Evaluate and Iterate

from skyrl.agent import SkyRLAgent

agent = SkyRLAgent.from_checkpoint("path/to/checkpoint")
result = agent.run_task(
    prompt="Fix this bug in repo X: ...",
    runtime="openhands",   # Stateful code env
    max_turns=30
)
print(result.success_rate, result.trajectory)

FAQ

What are RL environments?
In the context of Reinforcement Learning (RL), an environment provides a reward or penalty for an action an agent (often an LLM/model) takes. The agent adapts its behaviour to maximize the cumulative reward.

Why is the industry turning to RL environments for LLMs?
The industry is adopting RL environments to integrate LLMs into end-to-end autonomous systems with multi-agent architectures, which is seen as the way to derive true value from these models.

What is Reinforcement Learning from Verifiable Rewards (RLVR)?
RLVR is a method that uses objective and verifiable rewards (such as those in math and code tasks). These “non-gameable” rewards ensure the model develops the desired reasoning and problem-solving strategies, unlike subjective rewards used in RLHF.

How are RL environments used for commercial products?
Companies are building environments, often called “harnesses” or “UI gyms,” around their own software to train models specifically for use within their products, such as Cursor’s Composer or OpenAI’s Codex.

How can RL environments be used for synthetic data generation?
Environments inherently know the ground truth (e.g., unit tests passing, correct spreadsheet outputs, accurate terminal data), making them useful for generating high-quality synthetic data.

Final Thoughts

Moving forward in 2026, it looks like we’re going to see better integration of AI models into tangible use cases thanks to RL environments allowing for models to be trained for specific applications. DigitalOcean seeks to make your ambitions with AI possible: Train, Infer, and Build Agents with Gradient.

References and Additional Resources

SemiAnalysis: RL Environments and RL for Science: Data Foundries and Multi-Agent Architectures
Epoch AI: An FAQ on Reinforcement Learning Environments
SkyRL: Creating a New Environment or Task — SkyRL documentation
Thinking Machines: RL Environments – Tinker API
GPU mode: RL environments mini-conference

Prime Intellect:
What are RLVR environments for LLMs? | Policy - Rollouts - Rubrics
RL Environments at Scale – Will Brown, Prime Intellect
Under the Hood: Building an RL Environment with Zapier & Prime Intellect
INTELLECT-3: Technical Report

And if you want to explore RL from the ground up, here’s the Google drive of Richard Sutton’s CMPUT 609 course.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran

Author

See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

See author profile

Category:

Tutorial

Tags:

AI/ML