The advancement of Artificial General Intelligence (AGI) towards human-level task performance is largely driven by Multimodal Large Language Models (MLLMs). Combining multiple modalities allows for greater information density in inputs and enhanced capabilities during inference. We’ve covered multiple recent multimodal image generation models such as OmniGen2, BAGEL, Ming-lite-omni, ICEdit, etc. In this article, we will cover Ovis-U1, an open-source 3-billion parameter model released by the Alibaba Ovis team, with capabilities that span across understanding multimodal inputs, generating images from text, and editing uploaded images.
Stage | Trained Parameters | Task | Steps / Batch Size / Learning Rate | Description |
---|---|---|---|---|
0 | Refiner + Visual Decoder | Text-to-Image Generation | 500 / 1024 / 1e − 4 | Visual decoder pretraining, starting with random initialization to develop basic image generation capabilities. The visual decoder and refiner generate images from LLM embeddings using text-to-image data. |
1 | Adapter | Understanding Text-to-Image Generation, Image Editing | 1.5k / 8192 / 5e − 4 | Adapter pretraining, aligning visual and textual embeddings. The adapter is randomly initialized and trained in this stage across understanding, text-to-image, and image editing tasks. |
2 | Visual Encoder + Adapter | Understanding Text-to-Image Generation, Image Editing | 2.6k / 8192 / 1e − 4 | Visual encoder alignment, where both the visual encoder and adapter are fine-tuned to further align visual and textual representations. All three task types are used for training, with the generation task assisting in embedding alignment. |
3 | Visual Encoder + Adapter + LLM | Understanding | 23 / 2240 / 5e-5 | Understanding learning, where parameters of the visual encoder, adapter, and LLM are trained on understanding tasks. These parameters are fixed after this stage to preserve understanding capability. |
4 | Refiner + Visual Decoder | Text-to-Image Generation | 275 / 256 / 5e − 5 | Generation learning, training the refiner and visual decoder to align with optimized text and image embeddings after LLM parameters are tuned in Stage 3. This stage shows improved text-to-image performance. |
5 | Refiner + Visual Decoder | Text-to-Image Generation, Image Editing | 325 / 256 / 5e − 5 | Generation fine-tuning, building on text-to-image capabilities by fine-tuning the decoder for both text-to-image and image editing tasks. |
Let’s take a look at the data used to train the model.
Task | Datasets used | Additional information |
---|---|---|
Multimodal understanding | COYO Wukong Laion-5B ShareGPT4V CC3M | The researchers set up a data preprocessing pipeline that removes noisy data, improves caption quality, and balances data ratios to achieve the best training performance. |
Text-to-Image Generation | Laion-5B JourneyDB | Using Laion5B, the researchers initially select samples with an aesthetic score exceeding 6. The researchers then utilize the Qwen2-VL model to create detailed descriptions for each chosen image, resulting in the formation of the Laion-aes6 dataset. |
Image+Text-to-Image Generation | Image Editing OmniEdit UltraEdit SeedEdit | Datasets used to improve the model’s image editing capabilities |
Reference-Image-Driven Image Generation Subjects200K SynCD StyleBooth | Subjects200K and SynCD were used to train for subject-driven image generation and StyleBooth was used to train for style-driven image generation. | |
Pixel-Level Controlled Image Generation MultiGen_20M | To facilitate canny-to-image (canny = edge detection), depth-to-image, inpainting, outpainting | |
In-House Data | Additional datasets that incorporated style-driven data, content removal, style translation, de-noise/de-blur data, colourization data, text rendering data, etc. |
In the paper’s conclusion, they acknowledge “that Ovis-U1 currently lacks a reinforcement learning stage, which has proven crucial for large model optimization. Developing effective methods to align unified multimodal models with human preferences remains an important open research question in this domain.” We recently covered MMADA which introduces UniGRPO and are curious if there’s an application here. Let us know what you think in the comments.
Now that we’ve went over the model architecture and training process, let’s run the model on DigitalOcean.
Begin by spinning up a GPU Droplet. Once that’s completed, clone the repo into and install the required packages. You can do this using the following shell commands in the terminal. Alternatively, you can also try out the model on HuggingFace Spaces here.
# Install git-lfs for handling large files
apt install git-lfs
# Clone the Ovis-U1-3B repository from HuggingFace Spaces
git-lfs clone https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B
# Change directory into the cloned repository
cd Ovis-U1-3B
# Install pip for Python package management
apt install python3-pip
# Install required Python packages from requirements.txt
pip install -r requirements.txt
# Install additional Python packages for wheel and spaces
pip install wheel spaces
# Install PyTorch with CUDA 12.8 support and upgrade existing installations
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --upgrade
# Install xformers for optimized transformer operations
pip install -U xformers
# Install flash_attn for attention mechanism optimization
pip install flash_attn==2.7.4.post1
# Run the main application script
python app.py
We’re very excited about MLLMs. The datasets researchers decide to leverage, architectural modification, and how those translate to incremental improvements in capabilities is fascinating. We encourage you to test the model out. How are you using multimodal models and what use cases do you care about?
Learn more about DigitalOcean’s AI offerings. We have GPU Droplets you can spin up to train your models/run inference!
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.