
An active area of continued interest for AI researchers and engineers is adopting LLMs into end-to-end autonomous systems composed of multi-agent architectures. While LLMs are impressive in their own right, to truly derive value from them, we’re seeing the industry turn to Reinforcement Learning (RL) environments.
RL environments aren’t new; they predate LLMs. In fact, you really can’t talk about agents without talking about environments. Generally, in an RL context, an environment provides a reward or penalty for an action an agent takes in that environment. The agent is forced to adapt to maximize cumulative reward. This adaptation to maximize reward is the crux of reinforcement learning.
Now with the rise of LLMs, the agent is actually often a model. The model’s weights get updated from the scoring of its attempts at different tasks, allowing the model to adapt. Computer use, as in an AI system that can navigate your computer, is a particularly interesting multi-agent task. We actually explored this topic in a previous article on Microsoft’s work with scaling computer-use data with multi-agent pipelines for their model, Fara-7B.
The recent increased discourse and interest in RL environments can perhaps be traced back to the success of RLVR (Reinforcement Learning from Verifiable Rewards), where tasks can be verified such as with math and code. We like the way Andrej Karpathy’s blog post, 2025 LLM Year in Review, puts RLVR in context of the progress made last year. Karpathy explains what makes RLVR effective: rewards that are verifiable are non-gameable. A non-gameable reward function is one that is tied directly to the successful, verifiable outcome of a task (like solving a puzzle or passing a test case), making it hard for the LLM to achieve a high reward without actually developing the desired reasoning and problem-solving strategies (reward hacking).
Will Brown explains that models are trained for a specific product by giving examples of Cursor’s Composer and Open AI’s Codex, where the model is trained in a harness, which is essentially an RL environment that represents the product. In the same vein, companies are beginning to build environments around their software – which we’re hearing being called UI gyms.
There are a multitude of ways of going about creating RL environments. First consider your goal. What do you want your model to achieve? Then you want to choose a framework. Depending on the framework, the environment will be defined in different ways.
Potential frameworks include Prime Intellect’s environments hub, SkyRL (which has reusable tools), PyTorch’s OpenEnv, and OpenAI’s Gymnasium. Thinking Machines also has documentation and a cookbook on RL environments.
Regardless of the framework, you’ll typically need to think about and specify the key components of your RL environment:
After defining these components, you’ll implement the environment’s dynamics as in the rules governing how states transition based on actions.
Begin by setting up a GPU Droplet. Be mindful of how many GPUs you’ll need. We’re going to use 4 H100s. You’ll also need a Weights and Biases account.
git clone https://github.com/NovaSky-AI/SkyRL.git
cd SkyRL
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[vllm]" ##or ".[sglang]" for alternative inference backends
uv pip install -r requirements.txt
# may need to:
snap install astral-uv
SkyRL expects data in Parquet format with a schema suited for instruction and RL (prompt, completions, rewards, etc.).
You could use one of the built-in examples, like GSM8K for math reasoning (a good starter before jumping to SWE-Bench style tasks), or prepare your own for a custom environment.
We’re going to generate GSM8K data (reasoning + tool-use style) with gsm8k_dataset.py.
cd skyrl-train
uv run examples/gsm8k/gsm8k_dataset.py --output_dir ~/data/gsm8k
So this creates train.parquet and validation.parquet with fields like:
promptcompletion (or trajectories)reward (for offline/hybrid setups; online RL computes rewards live)For more agentic tasks, we could perhaps use SWE-Bench or a similar benchmark looking at verifiable tasks. SkyRL actually integrates OpenHands runtime (SkyRL-OpenHands) for code-editing environments.
Let’s create a YAML file to configure a GRPO training run: examples/gsm8k/gsm8k-grpo.yaml
data:
train_data: ["~/data/gsm8k/train.parquet"]
val_data: ["~/data/gsm8k/validation.parquet"]
trainer:
algorithm:
name: grpo
advantage_estimator: grpo
policy:
model:
path: Qwen/Qwen2.5-1.5B-Instruct # Start small; scale to 7B–32B
epochs: 2 # Increase for real training
strategy: fsdp2 # Or ddp for single node
placement:
colocate_all: true
policy_num_gpus_per_node: 4 # Adjust to your hardware
inference:
backend: vllm # Fast inference for rollouts
logger: wandb
For a single node:
uv run -m skyrl_train.entrypoints.main_base \
--config examples/gsm8k/gsm8k-grpo.yaml \
trainer.epochs=5 \
data.train_data='["~/data/gsm8k/train.parquet"]'
For distributed (multi-GPU/node via Ray/SkyPilot):
sky launch skyrl_train/examples/gsm8k/gsm8k-grpo-skypilot.yaml \
--secret WANDB_API_KEY=your_key_here
from skyrl.agent import SkyRLAgent
agent = SkyRLAgent.from_checkpoint("path/to/checkpoint")
result = agent.run_task(
prompt="Fix this bug in repo X: ...",
runtime="openhands", # Stateful code env
max_turns=30
)
print(result.success_rate, result.trajectory)
What are RL environments?
In the context of Reinforcement Learning (RL), an environment provides a reward or penalty for an action an agent (often an LLM/model) takes. The agent adapts its behaviour to maximize the cumulative reward.
Why is the industry turning to RL environments for LLMs?
The industry is adopting RL environments to integrate LLMs into end-to-end autonomous systems with multi-agent architectures, which is seen as the way to derive true value from these models.
What is Reinforcement Learning from Verifiable Rewards (RLVR)?
RLVR is a method that uses objective and verifiable rewards (such as those in math and code tasks). These “non-gameable” rewards ensure the model develops the desired reasoning and problem-solving strategies, unlike subjective rewards used in RLHF.
How are RL environments used for commercial products?
Companies are building environments, often called “harnesses” or “UI gyms,” around their own software to train models specifically for use within their products, such as Cursor’s Composer or OpenAI’s Codex.
How can RL environments be used for synthetic data generation?
Environments inherently know the ground truth (e.g., unit tests passing, correct spreadsheet outputs, accurate terminal data), making them useful for generating high-quality synthetic data.
Moving forward in 2026, it looks like we’re going to see better integration of AI models into tangible use cases thanks to RL environments allowing for models to be trained for specific applications. DigitalOcean seeks to make your ambitions with AI possible: Train, Infer, and Build Agents with Gradient.
SemiAnalysis: RL Environments and RL for Science: Data Foundries and Multi-Agent Architectures
Epoch AI: An FAQ on Reinforcement Learning Environments
SkyRL: Creating a New Environment or Task — SkyRL documentation
Thinking Machines: RL Environments – Tinker API
GPU mode: RL environments mini-conference
Prime Intellect:
What are RLVR environments for LLMs? | Policy - Rollouts - Rubrics
RL Environments at Scale – Will Brown, Prime Intellect
Under the Hood: Building an RL Environment with Zapier & Prime Intellect
INTELLECT-3: Technical Report
And if you want to explore RL from the ground up, here’s the Google drive of Richard Sutton’s CMPUT 609 course.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.