Report this

What is the reason for this report?

An Overview of Alibaba’s Multimodal Model: Ovis-U1

Published on July 17, 2025
An Overview of Alibaba’s Multimodal Model: Ovis-U1

Introduction

The advancement of Artificial General Intelligence (AGI) towards human-level task performance is largely driven by Multimodal Large Language Models (MLLMs). Combining multiple modalities allows for greater information density in inputs and enhanced capabilities during inference. We’ve covered multiple recent multimodal image generation models such as OmniGen2, BAGEL, Ming-lite-omni, ICEdit, etc. In this article, we will cover Ovis-U1, an open-source 3-billion parameter model released by the Alibaba Ovis team, with capabilities that span across understanding multimodal inputs, generating images from text, and editing uploaded images.

Key Takeaways

  • Ovis-U1 is an open-source 3-billion parameter multimodal LLM from Alibaba.
  • Capabilities include multimodal understanding, text-to-image generation, and image editing.
  • The model was trained on a diverse mix of datasets for various tasks (linked below).
  • The model can be implemented on a DigitalOcean GPU Droplet or tested on HuggingFace Spaces.

Training Process

Stage Trained Parameters Task Steps / Batch Size / Learning Rate Description
0 Refiner + Visual Decoder Text-to-Image Generation 500 / 1024 / 1e − 4 Visual decoder pretraining, starting with random initialization to develop basic image generation capabilities. The visual decoder and refiner generate images from LLM embeddings using text-to-image data.
1 Adapter Understanding Text-to-Image Generation, Image Editing 1.5k / 8192 / 5e − 4 Adapter pretraining, aligning visual and textual embeddings. The adapter is randomly initialized and trained in this stage across understanding, text-to-image, and image editing tasks.
2 Visual Encoder + Adapter Understanding Text-to-Image Generation, Image Editing 2.6k / 8192 / 1e − 4 Visual encoder alignment, where both the visual encoder and adapter are fine-tuned to further align visual and textual representations. All three task types are used for training, with the generation task assisting in embedding alignment.
3 Visual Encoder + Adapter + LLM Understanding 23 / 2240 / 5e-5 Understanding learning, where parameters of the visual encoder, adapter, and LLM are trained on understanding tasks. These parameters are fixed after this stage to preserve understanding capability.
4 Refiner + Visual Decoder Text-to-Image Generation 275 / 256 / 5e − 5 Generation learning, training the refiner and visual decoder to align with optimized text and image embeddings after LLM parameters are tuned in Stage 3. This stage shows improved text-to-image performance.
5 Refiner + Visual Decoder Text-to-Image Generation, Image Editing 325 / 256 / 5e − 5 Generation fine-tuning, building on text-to-image capabilities by fine-tuning the decoder for both text-to-image and image editing tasks.

Data mix

Let’s take a look at the data used to train the model.

Task Datasets used Additional information
Multimodal understanding COYO Wukong Laion-5B ShareGPT4V CC3M The researchers set up a data preprocessing pipeline that removes noisy data, improves caption quality, and balances data ratios to achieve the best training performance.
Text-to-Image Generation Laion-5B JourneyDB Using Laion5B, the researchers initially select samples with an aesthetic score exceeding 6. The researchers then utilize the Qwen2-VL model to create detailed descriptions for each chosen image, resulting in the formation of the Laion-aes6 dataset.
Image+Text-to-Image Generation Image Editing OmniEdit UltraEdit SeedEdit Datasets used to improve the model’s image editing capabilities
Reference-Image-Driven Image Generation Subjects200K SynCD StyleBooth Subjects200K and SynCD were used to train for subject-driven image generation and StyleBooth was used to train for style-driven image generation.
Pixel-Level Controlled Image Generation MultiGen_20M To facilitate canny-to-image (canny = edge detection), depth-to-image, inpainting, outpainting
In-House Data Additional datasets that incorporated style-driven data, content removal, style translation, de-noise/de-blur data, colourization data, text rendering data, etc.

What about RL?

In the paper’s conclusion, they acknowledge “that Ovis-U1 currently lacks a reinforcement learning stage, which has proven crucial for large model optimization. Developing effective methods to align unified multimodal models with human preferences remains an important open research question in this domain.” We recently covered MMADA which introduces UniGRPO and are curious if there’s an application here. Let us know what you think in the comments.

Now that we’ve went over the model architecture and training process, let’s run the model on DigitalOcean.

Implementation

Begin by spinning up a GPU Droplet. Once that’s completed, clone the repo into and install the required packages. You can do this using the following shell commands in the terminal. Alternatively, you can also try out the model on HuggingFace Spaces here.

# Install git-lfs for handling large files
apt install git-lfs

# Clone the Ovis-U1-3B repository from HuggingFace Spaces
git-lfs clone https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B

# Change directory into the cloned repository
cd Ovis-U1-3B

# Install pip for Python package management
apt install python3-pip

# Install required Python packages from requirements.txt
pip install -r requirements.txt

# Install additional Python packages for wheel and spaces
pip install wheel spaces

# Install PyTorch with CUDA 12.8 support and upgrade existing installations
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --upgrade

# Install xformers for optimized transformer operations
pip install -U xformers

# Install flash_attn for attention mechanism optimization
pip install flash_attn==2.7.4.post1

# Run the main application script
python app.py

Final Thoughts

We’re very excited about MLLMs. The datasets researchers decide to leverage, architectural modification, and how those translate to incremental improvements in capabilities is fascinating. We encourage you to test the model out. How are you using multimodal models and what use cases do you care about?

Learn more about DigitalOcean’s AI offerings. We have GPU Droplets you can spin up to train your models/run inference!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran
Melani Maheswaran
Author
See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

Category:
Tags:

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.