By Andrew Dugan
Senior AI Technical Content Creator II

Paris-based Mistral AI has announced their newest family of open-weight models including Mistral Large 3 and three other small, dense models. All of the models have been released under an Apache 2.0 license, making them openly available for commercial use, self-hosting, and fine-tuning. This tutorial describes the Mistral 3 models, compares them with existing open-source LLM options, overviews potential use cases for each, explains hardware requirements, and demonstrates an example deployment.
The open-weight Mistral 3 models offer some great performing options with a relatively low cost. They are token efficient and sized to deploy with less compute power.
There are four model variants including a large 675B and three smaller (14B, 8B, 3B) options. They all have vision capabilities but excel in text analysis.
The smallest options can be deployed on smaller GPUs, such as the NVIDIA RTX 4000 Ada. The largest model needs 8 x H200s or H100s, depending on precision.
The largest, most capable model of the group is the Mistral Large 3 675B. Trained on 3,000 NVIDIA H200s, The Mistral Large 3 has a sparse Mixture of Experts architecture with 41B active parameters of its 675B total, including a 2.5B vision encoder. On popular LLM benchmarks, it ranks comparably to Deepseek 3.1 670B and Kim-K2 1.2T. It has vision capabilities for analyzing images, supports a 256k context window, and offers native function calling and JSON outputting.

According to Mistral, the best use-cases for the Mistral Large 3 are:
long document understanding
powerful daily-driver AI assistants
agentic applications with tool use
enterprise knowledge work
general coding assistance
It is not a dedicated reasoning model and is not optimized for vision tasks, so it may not be the best option for reasoning use cases or multimodal tasks that require a lot of vision capability. It’s also quite large, so efficient deployment at scale requires sufficient resources.
The other three models they released are “Mini-stral” tiny language models of 14B, 8B, and 3B parameters. All three of these smaller models offer vision capabilities and the 256k context window, but they are optimized to run on a wider range of hardware options, including laptops and edge devices. Each of these smaller models has been released with base, instruct, and reasoning variants to accommodate fine-tuning, inference, and accuracy-focused use-cases respectively.
Mistral AI’s smaller models tend to perform well at minimizing the number of unnecessary output tokens in their responses. This allows users to get more out of the model for a lower cost. These three smaller models have a great cost-to-performance ratio compared to other open-weights models.

Given their small size, these Ministral models are useful in architectures where either offline inference capabilities or cost take precedence over accuracy and performance. They can also be useful for parallel architectures that employ multiple small models working together to accomplish larger tasks.
Mistral suggests deploying the Large 3 model on a node of H200s in FP8 (8-bit Floating Point) precision or a node of A100s in NVFP4 precision. They recommend the 3B, 8B, and 14B models be deployed with 8 GB, 12 GB, and 24 GB of VRAM (Video Random Access Memory) respectively.
| Model | Precision | System Requirements |
|---|---|---|
| Large 3 (675B) | FP8 | 8 x H200 |
| Large 3 (675B) | NVFP4 | 8 x H100 |
| Ministral 3 14B | FP8 | 24 GB VRAM |
| Ministral 3 8B | FP8 | 12 GB VRAM |
| Ministral 3 3B | FP8 | 8 GB VRAM |
For an implementation example, you are going to deploy a Ministral 3 3B instance on an NVIDIA GPU.
First, sign in to your DigitalOcean account and create a GPU Droplet.
Choose AI/ML-Ready as your image and select any available NVIDIA GPU. The Ministral 3 3B can work on an NVIDIA RTX 4000 Ada, so we will use this option. Add or select an SSH Key, and create the DigitalOcean Droplet.
Once the DigitalOcean Droplet is created, you will need to SSH (Secure Shell) into your server instance. Go to your command line and enter the following command, replacing the highlighted your_server_ip placeholder value with the Public IPv4 of your instance. You can find the IP in the Connection Details section of your GPU Instance Dashboard.
ssh root@your_server_ip
You may get a message that reads:
OutputThe authenticity of host 'your_server_ip (your_server_ip)' can't be established.....Are you sure you want to continue connecting (yes/no/[fingerprint])?
If you do, you can type yes and press ENTER.
Next, ensure Python is installed. Verify you are still in the Linux instance, and install Python.
sudo apt install python3 python3-pip
It may notify you that additional space will be used and ask if you want to continue. If it does, type Y and press ENTER.
If you receive a “Daemons using outdated libraries” message asking which services to restart, you can press ENTER.
After Python has finished installing, install vllm.
pip install vllm
This package might take a little while to install. After it is finished installing, you will be ready to serve the model.
Specify exactly which model you want to serve using the model’s ID from Hugging Face. Also specify other parameters to ensure the model loads correctly. Make sure to set the max-model-len. This lowers the max context length to ensure the model can run on a small GPU.
vllm serve mistralai/Ministral-3-8B-Instruct-2512 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--max-model-len 4096 \
--host 0.0.0.0 --port 8000
The tokenizer_mode, config_format, and load_format flags are all used to ensure the Mistral model loads properly.
Once the model is loaded and being served on your instance with vLLM, you can make inference calls to the endpoint using Python locally or from another server. The following example shows how to send a request to the model.
import requests
url = "http://your_server_ip:8000/v1/completions"
data = {
"model": "mistralai/Ministral-3-8B-Instruct-2512",
"prompt": "Suggest a short and easy recipe using potatoes and cheese.",
"max_tokens": 1000
}
response = requests.post(url, json=data)
response_message = response.json()['choices'][0]['text']
print(response_message)
You will see output similar to the following:
Output Here's a **easy and tasty 5-ingredient recipe** you can try:
### **Loaded Baked Potato Bar Muffins**
#### **Ingredients:**
- 4 large potatoes
- 2 cups shredded cheddar or mozzarella cheese
- 1 cup hot sauce (or sriracha for extra kick)
- ½ cup Greek yogurt or sour cream (optional for creaminess)
- 1 egg (optional, for binding)
- Toppings: Butter, garlic powder, bacon bits, scallions, etc.
#### **Directions:**
1. **Prep Potatoes**: Boil whole potatoes until fork-tender (~15 mins). Drain, halve lengthwise, and scoop flesh into a bowl.
2. **Mix Filling**: Cut potatoes and place in a bowl with cheese, hot sauce, yogurt, and egg (if using). Mash lightly until semi-blended but still chunky.
3. **Load Toppings**: Lightly butter an oven-safe muffin tin or use silicone molds. Drop spoonfuls of mix into each well, layering cheese on top.
4. **Bake**: Set oven to **375°F (190°C)** and bake **25–30 mins** until golden and bubbly (check with a toothpick).
5. **Serve Warm**: Top with extra cheese, butter, and hot sauce if desired! Perfect with a side salad or as a meal prep favorite.
---
**Bonus Tip:** Use starchy potatoes (Russet or Yukon Gold) for fluffier texture. For extra flavor, swap hot sauce for green onions, horseradish, or bacon bits. Enjoy! 🍠🧀
Can a model as small as 3 billion parameters be used for anything practical?
Although 3 billion parameter models are relatively small, they can still answer questions about basic things including recipes, common facts, and a range of elementary school level knowledge. They are often not as good at generating outputs that have consistent formatting for following instructions. They are also not good at deterministic fact finding based on their pre-training or data in the context. You should experiment with it to see where the limitations are before deploying to a production use-case.
Can I use Mistral 3 models commercially?
Yes, all Mistral 3 models are released under the Apache 2.0 license, which allows for commercial use, self-hosting, and fine-tuning without restrictions. Mistral AI has some models under alternative licenses with more restrictions, but these Mistral 3 models are available for you to use commercially without issue.
Which specific GPU do I need to run the 14B or 8B variants?
The specific GPU that can handle these models depends on a range of factors including the context length you would like to have available and the precision that you would like to use. Using the chart above showing minimum VRAM requirements, find a GPU that has at least that much VRAM, load the model, and test. You will find that the smallest models may be able to run on a Macbook, depending on your Macbook’s hardware. Inference speed will vary significantly depending on where you deploy them.
Should I use the Mistral 3 models for vision capabilities?
All Mistral 3 models include vision capabilities through a vision encoder, allowing them to analyze images. However, they are optimized for text analysis rather than vision-heavy tasks, so they may not be the best choice for applications requiring extensive image processing.
How do I choose between the 3B, 8B, and 14B Ministral models?
This decision should be defined by your app’s requirements and the GPU resources available. It should be determined through an iterative testing process where you deploy a smaller model, begin building and testing the application, and increase the model size to meet your app’s requirements. Larger models generally offer better accuracy and capabilities, but the smaller models provide excellent cost-to-performance ratios and are more token-efficient.
Are Mistral 3 models good for reasoning tasks?
The Mistral Large 3 is not a dedicated reasoning model, so it may not be optimal for complex reasoning use cases. However, the smaller Ministral models offer specialized reasoning variants (3B-Reasoning, 8B-Reasoning, 14B-Reasoning) that are optimized for accuracy-focused applications.
The Mistral 3 family of models offers a new set of open-weight options for applications that prioritize data privacy and control without sacrificing quality and efficiency. They offer competitive text capabilities with some vision capabilities added in. The smaller ones can be deployed on a wide range of machines, including smaller NVIDIA GPUs.
Next, you can use the LLM you deployed for relatively low-cost inference for basic tasks with complete control over your prompts and inference data. For a production deployment, follow security best practices including creating a non-root user for your instance and change your port configurations to remove public access.
Leave a comment below with any follow-up questions or links to projects you’ve built with the Mistral 3 models.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Andrew is an NLP Scientist with 8 years of experience designing and deploying enterprise AI applications and language processing systems.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.