We’re in awe of the OSS some releases we’re seeing such as Kimi K2, Qwen3 Coder, and GLM-4.5 (article coming soon) given the incredible progress these agentic models demonstrate with their clever multistep reasoning, coding, and tool-use capabilities.
With gpt-oss, OpenAI has made their first significant open-source model release in over five years, following GPT-2 in 2019. This family of models is released under a permissive Apache 2.0 license - which just means you have broad freedom to use, modify, and distribute software, even commercially, with the main conditions being to include the original license and copyright notices, and to acknowledge modifications.
The model comes in two variants,120B and 20B, where the 120B model features 117 billion total parameters (with 5.1 billion active per token) across 36 layers and the 20B model has 21 billion total parameters (3.6 billion active per token) and 24 layers. Both models employ native 4-bit (MXFP4) quantization for their MoE weights, allowing the 120B model to fit on a single 80 GB GPU and the 20B model to operate with around 16GB of memory.
DigitalOcean is committed to making things simple for developers and businesses that are looking to scale. Learn more about using gpt-oss in our product update.
This article can be a lot if you don’t have any exposure to training large language models and model architectures. It is advised that readers have an understanding of neural network fundamentals, the attention mechanism, the transformer, and data types before proceeding. We’ve listed some additional prerequisite knowledge and resources below. Feel free to also click the hyperlinks scattered throughout the article to better approximate your grasp of the material. Our goal with this article isn’t to regurgitate the gpt-oss model card but rather use this exciting model release to flesh out the approach,tools, and frameworks that will help you in whatever you hope to learn and accomplish.
If you’re unfamiliar with quantization, A Visual Guide to Quantization by Maarten Grootendorst is great. This might provide you with the context to understand MXFP4 quantization.
The Attention and its Variants section of our inference optimization article might be worth checking out to get a preliminary understanding of attention mechanisms. We’ll discuss gpt-oss’ use of GQA and SWA.
It may be helpful to read this article alongside the gpt-oss model card. The model card goes into greater depth into safety testing and the model’s performance on benchmarks - which we don’t cover here. The additional resources section at the end of this article is a gold-mine so be sure to take a look at that as well. Feel free to do a quick test of the model at the official gpt-oss website.
We’re going to start with a discussion of the model architecture. We’ve summarized key specifications alongside their relevance into a chart with the hope that it’ll be easier to digest.
Spec | Relevance |
---|---|
Mixture of Experts | The Mixture of Experts (MoE) architecture in gpt-oss employs sparse Feedforward Neural Network (FFN) layers, known as experts, along with a gating mechanism (router) to route tokens to the top-4 experts, thereby activating only a subset of the parameters for each token. This architecture is favourable for being a compute-efficient alternative to dense models. |
Gated SwiGLU activation function | Activation functions introduce non-linearity, enabling the network to learn and model complex patterns in data. The MoE blocks in gpt-oss use a gated SwiGLU activation function– SwiGLU being the current standard in modern LLMs. The gpt-oss model card notes that the SwiGLU implementation is unconventional in that it includes clamping and a residual connection. This is likely due to these modifications leading to smoother optimization and faster convergence, especially in large-scale transformer architectures. Residual connections, or skip connections, are neural network architecture components that create a “shortcut” path, allowing the input of a layer to be directly added to its output, bypassing one or more intermediate layers. |
Grouped Query Attention (GQA) Sliding Window Attention (SWA) | In the model card, they refer to this as dense (fully dense) and locally banded sparse attention patterns in different transformer layers. But really this just means that the attention blocks alternate between grouped query attention and sliding window attention. Gpt-oss’ attention mechanism has 8 key value heads. Each attention head has a learned bias in the denominator of softmax, similar to off-by-one attention. |
Rotary Position Embeddings | RoPE encodes position through the rotation of the query and key vectors depending on the token’s position. The encoding of position is critical since attention is order-blind to the input tokens. |
Context Length of dense layers = 131 072 | The context length of the gpt-oss dense layers is extended to 131,072 tokens using YaRN. YaRN (Yet another RoPE-scaling method) is a compute-efficient technique designed to efficiently extend the context window of transformer-based models. |
Attention Sinks | Attention sinks are tokens placed in the start of a sequence to stabilize attention. This is especially useful in long-context scenarios. |
We’re very intrigued by the quantization used here. The gpt-oss family of models is trained natively with Microscaling FP4 or MXFP4 where the MoE weights (90% of the total parameter count) are quantized to 4.25 bits per parameter. To better understand microscaling, we encourage you to read OCP Microscaling Formats (MX) Specification Version 1.0 - specifically section 5.
The o200k_harmony tokenizer, a BPE variant with a 200k-token vocabulary, is used across all training stages of the model. This open-source tokenizer is available in the TikToken library and builds upon the o200k tokenizer used in other OpenAI models.
The focus of gpt-oss post-training is on reasoning, tool use (browsing, Python, and developer functions), safety through CoT RL techniques, and a Harmony Chat Format. To our knowledge, the datasets/RL environments for this model aren’t released.
There are a number of reasons why chat templates are critical. Consistency in chat formats between training and deployment formats mitigates performance degradation. Like tokenizers, chat templates store information about how data is processed. Open AI employs a custom chat format in the training of gpt-oss, referred to as the harmony chat format, with special tokens for message boundaries and role tags such as User, Assistant, System, and Developer.
The model follows a role hierarchy of System, Developer, User, Assistant, and Tool to resolve conflicts, and uses channels to manage message visibility for analysis, commentary, and final output.
This approach enables advanced agentic features like interleaved tool calls.
The Illustrated GPT-OSS - by Jay Alammar: The visuals in this article are excellent for gaining a more intuitive understanding of the gpt-oss architecture and the message formatting the model uses. We really like how this article talks about how different user roles (e.g., end-users of chatGPT, builders of LLM apps like cursor, or those who post-train the model) are supported by the formatting of the shapes of the input and output (e.g., reasoning traces, tool interactions) to the model.
From GPT-2 to gpt-oss: Analyzing the Architectural Advances - by Sebastian Raschka: This article is great because it puts into perspective the long way we have come from GPT-2. The explanations of the different concepts (ex: RoPE, SwiGLU, Mixture of Experts, GQA, SWA, RMSNorm) are very comprehensive, very thorough.
What is SwiGLU? by jcarlosroldan.This article does a good job of providing context around SwiGLU as the preferred activation function in modern LLMs.
Chat Templates: An End to the Silent Performance Killer : Chat templates are Jinja-based templates embedded in tokenizers. They automatically format conversation messages into the correct structure the model was trained on. This article explains how if you don’t format prompts exactly the way a model expects, its performance can degrade silently, not necessarily without errors but with poor results.
Recall gpt-oss uses the Harmony chat format.
OCP Microscaling Formats (MX) Specification Version 1.0 : This resource provides more information on microscaling formats from the Open Compute Project (OCP). Section 2 explains how the MX formats align with OCP’s core principles. They are open, jointly developed by major industry players and based on prior open standards; efficient, enabling reduced precision and memory use for lower cost and better performance; impactful, with broad backing that makes them likely to become an industry standard; scalable, designed for easy adoption on existing hardware; and sustainable, reducing energy use and carbon emissions in AI workloads.
GitHub - microsoft/microxcaling: PyTorch emulation library for Microscaling (MX)-compatible data formats : This GitHub library emulates MX-compatible formats and bfloat quantization in PyTorch. Computations use float32/bfloat16/fp16 but respect the representable range of MX or bfloat formats. The GitHub library for PyTorch emulation of MX-compatible data formats supports matrix multiplication (specifically torch.matmul, torch.linear, and torch.bmm) for MX tensors, as well as element-wise operations like GELU, softmax, and layernorm, where basic operations (such as add, sub, sqrt, and exp) are performed with bfloat precision.
Microscaling Data Formats for Deep Learning: This is the paper that introduced the notion of microscaling data formats.
1.5x Faster MoE Training on Blackwell with MXFP8 Kernels Built from Scratch | Cursor - The AI Code Editor: This article describes how Cursor achieved a 1.5x speedup in end-to-end training of large language models on Blackwell GPUs. By redesigning Mixture-of-Experts (MoE) layers with custom MXFP8 kernels, they substantially reduced training time and costs, accelerating advancements and deployment of SOTA models.
Note that gpt-oss uses MXFP4 and not MXFP8 for the linear projection weights in the MoE layer.
Fine-tuning with gpt-oss and Hugging Face Transformers: “On a H100 GPU, this takes about 18 minutes to train, but may take longer depending on your hardware."
Ollama (gpt-oss 20b, ~14GB of VRAM): “Ollama is supporting the MXFP4 format natively without additional quantizations or conversions. New kernels are developed for Ollama’s new engine to support the MXFP4 format. Ollama collaborated with OpenAI to benchmark against their reference implementations to ensure Ollama’s implementations have the same quality.”
Unsloth (gpt-oss 20b, ~14GB of VRAM): “We utilized OpenAI’s Triton Kernels library directly to allow MXFP4 inference. For finetuning / training however, the MXFP4 kernels do not yet support training, since the backwards pass is not yet implemented. We’re actively working on implementing it in Triton!”
Additional implementations can be found linked in the gpt-oss repo.
There are definitely some great resources out there, so feel free to comment on anything that was missed.
When doing research for this article, we were blown away by the sheer amount of content that came out on gpt-oss whether that be news articles, youtube videos, blog posts, externally-developed base models, etc. It is very clear the community is very excited about the release of open source models from OpenAI and we’re excited to see how these get leveraged and compare to other similar models.
Wait - did we mention that gpt-oss is available on the DigitalOcean Gradient™ AI Platform?
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.