By Andrew Dugan
Senior AI Technical Content Creator II

The United States produces some of the world’s most widely used open-weights models, spanning hyperscaler releases like Meta’s Llama and Google’s Gemma, hardware-tuned models from NVIDIA, small-but-capable models from Microsoft’s Phi team, and the fully transparent OLMo family from the nonprofit Allen Institute for AI. Together they range from the most openly documented models on earth to some of the most commercially restricted, and from 14-billion-parameter models that run on a laptop to 550-billion-parameter reasoning systems. The result is a uniquely diverse ecosystem with no single design philosophy holding it together.
In honor of the 250th anniversary of the United States, this article reviews the state of the United States’ open-weights large language models. We take a look at the top performing options, analyze how open-weight LLMs from the US differ in architecture, and speculate on what kinds of improvements we might see in future open-source models. The goal is not to crown a single winner, but to map who builds open models in the US, how they build them, and where the ecosystem might be headed next.
American open-weights AI spans the full range from one of the most open model families in the world (Ai2’s OLMo) to the highest-scoring open model in the US (NVIDIA’s Nemotron 3 Ultra 550B), with strong self-hostable options like Gemma 4 31B in between.
American models are unique for their architectural diversity and are mostly distributed through incumbent platforms. Most of them lack techniques common across architectures abroad (Multi-head Latent Attention (MLA), auxiliary-loss-free Mixture-of-Experts (MoE), reasoning pretraining). Chinese labs converge on shared architectures and non-US/non-China labs like Mistral compete on sovereignty and permissive licensing.
Its open releases have historically trailed the internal proprietary frontier, but newer entrants are closing that gap. NVIDIA’s hybrid Mamba-MoE Nemotron 3 line now tops US open benchmarks on both quality and throughput.
The tables below list verified benchmark results for the leading American open-weights models as of mid-2026, drawn from official model cards, technical reports, and the independent Artificial Analysis leaderboard.
The reasoning models and general-purpose models are separated. A reasoning model with test-time chain-of-thought will beat a non-reasoning model on math and science benchmarks. Comparing them directly is not a fair comparison, so they are in separate tables. Also, benchmark versions and conditions vary. LiveCodeBench versions differ (v3 vs v6), American Invitational Mathematics Examination (AIME) editions differ by year (2024/2025/2026), and most scores are vendor-reported from self-run evals. GPQA Diamond and LiveCodeBench are high-variance. Treat differences of a point or two as noise.
Scores are reasoning-mode (“thinking on”) where the model supports a toggle. All figures are from official model cards or technical reports.
| Model (Lab) | Params (total/active) | MMLU-Pro | GPQA Diamond | LiveCodeBench | AIME 2025 | Context | License |
|---|---|---|---|---|---|---|---|
| Nemotron 3 Ultra 550B (NVIDIA) | 550B / 55B MoE | 86.8 | 87.0 | 89.0 (v6) | — † | 1M | OpenMDW-1.1 (weights + data) |
| Gemma 4 31B (Google) | 31B dense | 85.2 | 84.3 | 80.0 (v6) | 89.2 ('26) | 256K | Apache 2.0 |
| Nemotron 3 Super 120B (NVIDIA) | 120B / 12B MoE | 83.7 | 79.2 | 81.2 | 90.2 | 1M | NVIDIA Open Model |
| OLMo 3 32B Think (Ai2) | 32B dense | — ‡ | 58.1 | 83.5 (v3) | 72.5 | 65K | Apache 2.0 (full stack) |
| Llama-Nemotron Ultra 253B (NVIDIA) | 253B dense | — | 76.0 | 66.3 | 72.5 | 128K | NVIDIA Open Model |
| Hermes 4 405B (Nous Research) | 406B dense | 80.5 | 70.5 | 61.3 (v6) | 78.1 | ~40K | Llama 3.1 |
| Nemotron 3 Nano 30B (NVIDIA) | 30B / ~3B MoE | 78.1 | 72.5 | 67.6 | 87.7 | 1M | NVIDIA Open Model |
| Phi-4-reasoning-plus (Microsoft) | 14B dense | 76.0 | 68.9 | 53.1 | 78.0 | 32K | MIT |
† The Nemotron 3 Ultra model card reports GPQA, MMLU-Pro, LiveCodeBench, SWE-Bench Verified (70.7), and RULER-1M (94.7) but not a standalone AIME 2025 score; on tool-augmented olympiad math (IMOAnswerBench) it scores 92.3. ‡ OLMo 3 32B Think reports standard MMLU 85.4 and MATH 96.1 rather than MMLU-Pro. It is the strongest fully open (weights + code + data) reasoning model.
| Model (Lab) | Params (total/active) | MMLU-Pro | GPQA Diamond | LiveCodeBench | Throughput | Context | License |
|---|---|---|---|---|---|---|---|
| Llama 4 Maverick (Meta) | 400B/17B | 80.5 | 69.8 | 43.4 | ~108 tok/s | 1M | Llama 4 Community |
| Llama 4 Scout (Meta) | 109B/17B | 74.3 | 57.2 | 32.8 | ~95 tok/s | 10M | Llama 4 Community |
| Phi-4 (Microsoft) | 14B dense | 70.4 | 56.1 | — | — | 16K | MIT |
| Gemma 3 27B (Google) | 27B dense | 67.5 | 42.4 | 29.7 | — | 128K | Gemma |
| DBRX Instruct (Databricks) | 132B/36B | — ‡ | — | — | ~150 tok/s | 32K | Databricks Open |
‡ DBRX (March 2024) predates MMLU-Pro/GPQA becoming standard; it reports MMLU 73.7, HumanEval 70.1, and GSM8K 66.9. It is included as a size and speed reference point, not a current-quality contender.
American open-weights AI is defined by a collection of divergent bets made by large technology incumbents, a chip vendor, and a few research nonprofits, with little shared design philosophy between them. The defining trait is architectural diversity without consensus. Grouped Query Attention (GQA) is the most common building block, but beyond it American labs pursue radically divergent bets in parallel. Mamba-2 state space models, Meta’s interleaved Rotary Position Embedding (iRoPE) attention, NVIDIA’s LatentMoE, and various layer-wise scaling schemes are diverging experiments with no shared design philosophy connecting them. NVIDIA is the clearest example of a lab pushing its own direction. It co-evolves model and silicon together, pretraining Nemotron in the NVIDIA FP4 (NVFP4) 4-bit format and building hardware-aware hybrid Mamba-2 architectures around its own GPUs.
How these models acquire their capabilities also differs from the Chinese approach. American labs have historically treated reasoning as a post-training problem, grafting it on through supervised fine-tuning and reinforcement learning rather than embedding it in pretraining. And because the largest players are platform companies, distribution is a structural advantage no independent lab has matched yet. Meta’s multi-billion-user footprint gives Llama real-world reach far beyond its benchmark standing.
Openness is where the ecosystem has the most contradictions. In past years, the pattern seemed to be “open after proprietary,” with open weights trailing a lab’s internal frontier product by a generation. NVIDIA’s Nemotron 3 line has recently upended that, shipping open weights (and, for the Ultra model, training data) that top the US benchmark tables outright. The Allen Institute’s OLMo family goes further still, releasing weights, training code, full training data, and intermediate checkpoints together, making it one of the most completely open releases anywhere. Yet the same country produces the most restricted licenses too, from Llama’s monthly-active-user cap to DBRX’s no-compete clause. The one throughline is efficiency at small scale. Microsoft’s 14-billion-parameter Phi-4-reasoning matches models many times its size, and Apple’s OpenELM has advanced layer-wise efficiency research.
Where American labs diverge, leading Chinese labs have converged. Multi-head Latent Attention (MLA), first introduced by DeepSeek, has since been adopted by Moonshot’s Kimi K2 and, as of GLM-5, Zhipu’s GLM line (earlier GLM versions used GQA). Several of these models also share a fine-grained Mixture-of-Experts design along with DeepSeek’s auxiliary-loss-free load balancing.
Their training philosophy is also distinctive. Chinese labs increasingly treat reasoning as a pretraining target, dedicating whole “stage 2” pretraining phases to elevated math, code, and STEM data instead of relying on post-training alone. Some also build self-sustaining synthetic data loops. Alibaba, for instance, used specialized Qwen2.5-Math and Qwen2.5-Coder models to generate synthetic training data for Qwen3, reducing dependence on proprietary API teachers. And Chinese models appear to do it cheaply. DeepSeek V3 claims to have been trained on 14.8 trillion tokens for roughly $5.6 million, which would make it the most cost-efficient frontier training run ever.
Qwen leads global HuggingFace downloads and DeepSeek leads open-weight reasoning leaderboards, trailing only the strongest closed models. The limit is transparency. These labs publish strong results and detailed arXiv papers, but they release weights only. None of the major frontier Chinese labs publish training code, training data, or intermediate checkpoints.
Outside the US and China, open-weights work seems to be driven as much by sovereignty and language coverage as by capability. France’s Mistral is the most frontier-competitive player, and its largest models now ship under Apache 2.0, a rarity at that scale. For most European efforts, though, the motivation is reducing dependence on American and Chinese platforms rather than beating them outright. That goal is supported by public money, most notably EuroHPC’s AI Factories, which give small and medium-sized enterprises (SMEs) and nonprofits free GPU access. In one striking case, a Latvian translation company trained a 30-billion-parameter model using entirely subsidized compute, something with no US equivalent.
The rest of the field tends to fill gaps the giants ignore. Multilingual coverage is a recurring theme. OpenEuroLLM spans the EU’s 24 official languages, Singapore’s SEA-LION covers Southeast Asian languages, and India’s Sarvam handles 22 Indian languages. Licensing approaches vary widely: Canada’s Cohere releases its Command A models for research under a non-commercial (CC-BY-NC) license, requiring a separate agreement for commercial use. Several one-time contenders have simply retreated. Germany’s Aleph Alpha left the frontier race for enterprise sovereignty software, and the UAE’s Falcon pulled back to far smaller models.
Memory-efficient attention, whether DeepSeek-style MLA or the hybrid Mamba-Transformer designs NVIDIA is now shipping, is on track to become standard because it cuts inference cost without sacrificing quality. Reasoning is likely to move earlier in the pipeline, treated as a pretraining objective rather than a post-training patch, and a single checkpoint will increasingly serve both a “thinking” and a “fast” mode instead of a lab shipping two separate models. More of the full stack, including training data, code, and even training-cost disclosures, could ship alongside weights.
The larger opportunity is organizational rather than technical. Almost every major American open model is a side output of a company whose real product is something else, which means openness is always secondary to a proprietary roadmap. The Allen Institute’s OLMo shows how much a genuinely open-source-first organization can accomplish, but it’s practically alone. There is room in the US for more organizations whose primary mission is the open release itself, especially in the underserved 30-to-70-billion-parameter range where no fully open, architecturally modern, data-released American model yet exists.
Finally, not every opportunity is a bigger model. Some of the most valuable open-source work will be small, hyper-specific, compact models tuned for a single domain, plus the routers, model-selection architectures, and specialized verifiers for speculative decoding. These systems reward openness, because they need to be inspected, fine-tuned, and freely composed, and they play directly to America’s demonstrated strength in small-model efficiency. A future US open-source ecosystem may compete less on owning the single biggest model and more on offering a rich toolkit of small, interoperable, purpose-built ones.
Is American open-source AI behind China?
Not on openness, and not uniformly on capability. The US produces both one of the most open model families in the world (Ai2’s OLMo) and some of the most widely used open models (Llama, Gemma); however, Qwen (China) now leads global downloads and derivatives. Where American open models lag is top-line benchmark performance. The highest-scoring open-weights models are currently Chinese.
What is the most open American model?
Ai2’s OLMo family (OLMo 2 and the newer OLMo 3). It releases weights, training code, the full training dataset, and hundreds of intermediate checkpoints together under a permissive Apache 2.0 license. This is among the most complete open releases anywhere. The majority of “open” models release weights only.
Mistral has offices in the U.S. Is Mistral an American open source project?
No. Mistral is a French company headquartered in Paris. Its US presence is a sales and operational office. Model development is controlled by the French parent. It is the strongest near-frontier open-weights lab outside the US and China.
Why don’t American labs use MLA?
A mix of timing, infrastructure lock-in, hardware privilege, and mission. Meta’s architecture predates MLA’s maturity. NVIDIA chose a competing approach. Google and Microsoft face serving-stack constraints. Underlying all of it, only Ai2 treats open weights as its primary mission, so most US labs adopt architecture on product timelines.
Has NVIDIA’s Mamba-2 bet paid off?
Increasingly, yes. The throughput advantage is well established, and with the Nemotron 3 line the hybrid Mamba-2 design now reaches near-frontier quality. Nemotron 3 Ultra tops the US open-weights benchmarks. What it has not shown is a quality advantage over the best pure-attention models. The top Chinese models still lead the composite leaderboards, and NVIDIA still keeps a small fraction of full attention layers rather than going pure state-space.
Two hundred and fifty years in, the state of American open-source AI is one of genuine leadership paired with self-imposed lag. Many companies and much of the talent in the U.S. are focused on leading the world in closed models, leaving open-weights work with less focus and attention. The US hosts some of the most open models in the world and some of the most widely used ones, but its strongest labs increasingly reserve their frontier work for closed products. Meta’s 2026 pivot to the proprietary Muse Spark is the clearest example.
The most interesting opportunities are structural rather than incremental. An open-source-first organization, building MLA-first at 30B+ scale, with reasoning in pretraining and the full training stack released, would close most of the gaps identified here at once. The techniques are already public. What is missing is an American lab whose primary mission is to use them.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Andrew is an NLP Scientist with 8 years of experience designing and deploying enterprise AI applications and language processing systems.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Scale up as you grow — whether you're running one virtual machine or ten thousand.

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.
