Since Google introduced Gemini Diffusion during Google I/O, we can’t stop thinking about the use of diffusion models FOR TEXT. For context, Large Language Models (LLMs) have traditionally been autoregressive, generating words sequentially one token at a time, with each future token predicted based on past ones. Text diffusion models, however, produce outputs by refining “noise” - which just means rapid iteration of an initial solution until a desirable output is reached. The beauty of text diffusion models lies in their low latency while producing textual outputs of comparable quality to autoregressive models. Furthermore, text diffusion models could prove to be a more cost-effective solution, given the computational expense of sequential token generation.
A notable paper, Large Language Diffusion Models, sheds some valuable insight on the use of diffusion models for language. The paper introduces LLaDA, a text-based diffusion model, that inspired the development of MMaDA.
The figure below is from the LLaDA paper and gives a visual representation of the inference process. Darker colours correspond to tokens produced in later sampling stages and lighter colours correspond to earlier predictions.
After reading about LLaDA, we were curious about multimodal models that leverage text diffusion and have image-handling capabilities. After all, it’s pretty evident that models that can handle visual inputs as well are far more useful than just text-based models.
In this article, we’re going to take a look at Multimodal Large Diffusion Language Models (MMaDA), a novel class of models for textual reasoning, multimodal understanding, and text-to-image generation.
There’s a demo available for your exploration on HuggingFace Spaces! We encourage you to check it out. We’ll also show you how you can run MMaDA on DigitalOcean GPU Droplets and discuss our thoughts on the model’s performance.
There are two main components to this tutorial - an overview and implementation. The overview section assumes familiarity with topics such as Large Language Models (LLMs), tokenization, diffusion models, and multimodality. Without this background, readers may find it difficult to navigate. The implementation section is quite straightforward. We’ll be setting up a GPU Droplet and showing you how you can run the code in the command-line. Feel free to skip sections that aren’t of use to you.
Multimodal Large Language Models (MLLMs) typically combine autoregressive models for text and diffusion models for images. However, Multimodal Large Diffusion Language Models (MMaDA) presents a modality-agnostic approach for textual reasoning, multimodal understanding, and text-to-image generation. The researchers developed MMaDA by combining a unified diffusion architecture with a shared probabilistic framework, designed to be adaptable across different data types without needing specific components for each. This design enables smooth integration and processing of various forms of information. They also employ a novel “mixed long chain-of-thought” (CoT) fine-tuning approach that standardizes the CoT format across different modalities. By linking reasoning processes in both text and visual data, this strategy helps the model initiate training for reinforcement learning tasks, thus improving its ability to handle complex problems from the start. Furthermore, the researchers introduce UniGRPO, a policy-gradient-based reinforcement learning algorithm specifically created for diffusion foundation models. Through diverse reward modeling, UniGRPO integrates post-training for both reasoning and generation tasks, leading to consistent improvements in performance.
MMaDA builds upon LLaDA in that it was initialized with LLaDA-8B-Instruct’s pretrained weights and a pretrained image tokenizer from Show-o.
Below we listed the datasets used to train the model.
Dataset Category | Dataset Name | Purpose |
---|---|---|
Foundational Language and Multimodal Data | ||
RefinedWeb | Basic text generation capabilities | |
ImageNet | Multimodal understanding | |
Conceptual 12M | Image-text pre-training | |
Segment Anything (SAM) | Multimodal understanding | |
LAION-Aesthetics-12M | Image-text dataset | |
JourneyDB | Generative image understanding | |
Instruction Tuning Data | ||
LLaVA-1.5 | Visual instruction tuning | |
Stanford Alpaca | Textual instructions | |
InstructBLIP | Vision-language instruction tuning | |
Qwen-VL | Vision-language model | |
mPLUG-Owl2 | Multi-modal instruction tuning | |
LLaVA-Phi | Efficient multi-modal assistant | |
Reasoning Data | ||
GeoQA | Geometric question answering | |
CLEVR | Compositional language and visual reasoning | |
ReasonFlux | Hierarchical LLM reasoning | |
LIMO | Mathematical and logical reasoning | |
s1k | Simple test-time scaling | |
OpenThoughts | Mathematical and logical reasoning | |
AceMath-Instruct | Math reasoning with post-training | |
LMM-R1 | 3D LMMs with strong reasoning | |
Reinforcement Learning Data | ||
Mathematical datasets | UniGRPO training | |
Logical datasets | UniGRPO training |
MMaDA builds upon LLaDA in that it was initialized with LLaDA-8B-Instruct’s pretrained weights and a pretrained image tokenizer from Show-o.
Begin by setting up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDIA H100 option.
Once your GPU Droplet finishes loading, you’ll be able to open up the Web Console.
In the web console, copy and paste the following code snippet:
apt install python3-pip python3.10
git clone https://github.com/Gen-Verse/MMaDA
cd MMaDA
Installing the requirements file nsures that your environment has all the tools needed to run the `app.py`
pip install -r requirements.txt
python3 app.py
The output will be a Gradio link that you can access within VS Code.
In VS Code, click on “Connect to…” in the Start menu.
Choose “Connect to Host…”.
Click “Add New SSH Host…” and enter the SSH command to connect to your droplet. This command is usually in the format ssh root@[your_droplet_ip_address]. Press Enter to confirm, and a new VS Code window will open, connected to your droplet.
You can find your droplet’s IP address on the GPU droplet page.
In the new VS Code window connected to your droplet, type >sim and select “Simple Browser: Show”.
Paste the Gradio url from the Web Console.
Multimodal and text-to-image generation will require a wandb account. Access is free for students and postdoctoral researchers, but will require a subscription otherwise. If you don’t have a wandb account, remember that you can still try out MMaDA on HuggingFace!
wandb login
python3 inference_mmu.py config=configs/mmada_demo.yaml mmu_image_root=./mmu_validation question='Please describe this image in detail.'
python3 inference_t2i.py config=configs/mmada_demo.yaml batch_size=1 validation_prompts_file=validation_prompts/text2image_prompts.txt guidance_scale=3.5 generation_timesteps=15
mode='t2i'
The model falsely classifies the line in the distance-time graph as a straight line and incorrectly interprets the line. Looks like high school physics is not the model’s strong suit. Oof.
The model is correctly able to identify the flavour of the ice cream image we uploaded though. Nice.
While we were impressed by the speed of image generation, we found that prompt adherence for MMaDA’s text-to-image capabilities could still use some work. We encourage you to play around with the different parameters and let us know your thoughts in the comments below.
It’s really incredible how fast this field moves. The introduction of Multimodal Large Diffusion Language Models (MMaDA) represents a unique approach to textual reasoning, multimodal understanding, and text-to-image generation. By leveraging a modality-agnostic diffusion framework and building upon innovations like LLaDA, MMaDA offers a potential alternative to traditional autoregressive models, especially concerning latency and computational cost.
While the model demonstrates impressive speed in image generation and an ability to handle multimodal inputs (like identifying ice cream flavors), there are still areas for improvement, such as prompt adherence in text-to-image generation and complex reasoning tasks (as seen with the physics graph example). Despite these current limitations, the underlying diffusion methodology for language and its extension to multimodal contexts are quite interesting. As the field continues to evolve, we anticipate further refinements that will enhance the accuracy and robustness of MMaDA or similar multimodal diffusion models, making it an even more powerful tool for a wide range of applications.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.