Report this

What is the reason for this report?

Run BAGEL VLM on a DigtialOcean GPU Droplet

Published on May 22, 2025
James Skelton

By James Skelton

Technical Evangelist // AI Arcanist

Run BAGEL VLM on a DigtialOcean GPU Droplet

The release of GPT-4o’s image generation capabilities on the ChatGPT platform has made incredible shockwaves in the world of AI image generation. This single model, which was already popular for its standard Vision Language Model (VLM) capabilities, suddenly became able to edit and generate images with higher fidelity than any competitor. OpenAI’s proprietary model, which has boasted this capability since its release, quickly rocketed to the top of the Artificial Analysis text-to-image leaderboard thanks to its incredible capabilities.

Since this release, there has been a constant question from the AI community: could an open-source VLM be an image generator as capable as GPT-4o? We’ve covered this topic on the DigitalOcean Tutorials blog with Janus Pro, but that model was decidedly too small to match the likes of even Stable Diffusion 1.5 at image generation tasks, despite its reportedly higher ELO.

This week, the question may have been answered with ByteDance SEED’s newest release: the BAGEL model. This first-of-its-kind, massive open-source VLM is the largest vision language model ever released to the public, and it shows impressive results already. With 14B (7B active) parameters, it is one of the largest models of it’s kind. Not only can it describe images with details akin to some of the best closed-source models, it can even generate and edit images to a high degree of accuracy. This is all thanks to their unique mixture-of-transformers-experts (MoT) that selectively activates modality specific parameters to optimize the results. For more information about how BAGEL works, check out their paper.

In this tutorial, we will show how to run BAGEL on a GPU Droplet using the provided Jupyter Notebook based examples from the project’s repository on Github. Follow along for instructions on setting up a GPU Droplet to run BAGEL followed by detailed explanations of the model’s capabilities in practice.

Running BAGEL on DigitalOcean’s GPU Droplets

In this section of the tutorial, we will show how to set up the environment for BAGEL on a GPU Droplet, and then show some examples of images we generated using the code provided.

Set Up the GPU Droplet for BAGEL

To get started, sign into your DigitalOcean account and spin up an NVIDIA GPU powered GPU Droplet. We recommend either a single or eight way NVIDIA H100 GPU Droplet for running this model. Follow the instructions in this article to set up your environment for this tutorial, as we will need to have added our SSH keys to the GPU Droplet and installed Jupyter to continue.

Once your GPU Droplet is running, we can SSH into our GPU Droplet from our local machine’s terminal. To set up the environment for this demo, paste the following code into your terminal:

apt install python3-pip python3.10-venv
pip install huggingface-hub wheel jupyter
git clone https://github.com/ByteDance-Seed/Bagel
cd Bagel/
vim requirements.txt

Then use VIM to remove the 16th line flash_attn==2.5.8 from the text file. This will prevent a broken install later on. Alternatively, you can directly remove it by accessing the host machine from your local VS Code application, and using the file editor to modify the text file. Once that’s done, we can continue with installation.

pip install -r requirements.txt
pip install git+https://github.com/Dao-AILab/flash-attention

This will complete the installation of all the required packages for this demo. Next, we recommend using the Hugging Face CLI to login to your Hugging Face account. We can then download the model files for the demo.

huggingface-cli login

##After entering your access token

huggingface-cli download ByteDance-Seed/BAGEL-7B-MoT --local-folder ./BAGEL-7B-MoT

Alternatively, you can download the model files in your notebook later on using the following Python code:

from huggingface_hub import snapshot_download

save_dir = "./BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
        local_dir=save_dir,
        repo_id=repo_id,
        local_dir_use_symlinks=False,
        resume_download=True,
        allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
        )

Once your model weights are downloaded, we can open our Jupyter Notebook demo. Paste the following code into the terminal to launch Jupyter Lab.

jupyter lab --allow-root

jupyter lab --allow-root

Then copy the URL output, and use that to paste into the simple browser in your VS Code window as shown in the setup tutorial.

Now that we have accessed our Jupyter Lab environment on our local browser, let’s get started with BAGEL. Open the inference.ipynb notebook file, and run the first 7 code cells. These are labeled to correspond to what each does for the setup process.

import os
from copy import deepcopy
from typing import (
    Any,
    AsyncIterable,
    Callable,
    Dict,
    Generator,
    List,
    NamedTuple,
    Optional,
    Tuple,
    Union,
)
import requests
from io import BytesIO

from PIL import Image
import torch
from accelerate import infer_auto_device_map, load_checkpoint_and_dispatch, init_empty_weights

from data.transforms import ImageTransform
from data.data_utils import pil_img2rgb, add_special_tokens
from modeling.bagel import (
    BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM, SiglipVisionConfig, SiglipVisionModel
)
from modeling.qwen2 import Qwen2Tokenizer
from modeling.bagel.qwen2_navit import NaiveCache
from modeling.autoencoder import load_ae
from safetensors.torch import load_file

This first cell loads all the required packages for the demo. Next, we initialize the model weights. Be sure to edit the value on line 1 to reflect the path to your model weights, ./BAGEL-7B-MoT.

model_path = "./BAGEL-7B-MoT"  
# LLM config preparing
llm_config = Qwen2Config.from_json_file(os.path.join(model_path, "llm_config.json"))
llm_config.qk_norm = True
llm_config.tie_word_embeddings = False
llm_config.layer_module = "Qwen2MoTDecoderLayer"

# ViT config preparing
vit_config = SiglipVisionConfig.from_json_file(os.path.join(model_path, "vit_config.json"))
vit_config.rope = False
vit_config.num_hidden_layers = vit_config.num_hidden_layers - 1

# VAE loading
vae_model, vae_config = load_ae(local_path=os.path.join(model_path, "ae.safetensors"))

# Bagel config preparing
config = BagelConfig(
    visual_gen=True,
    visual_und=True,
    llm_config=llm_config, 
    vit_config=vit_config,
    vae_config=vae_config,
    vit_max_num_patch_per_side=70,
    connector_act='gelu_pytorch_tanh',
    latent_patch_size=2,
    max_latent_size=64,
)

with init_empty_weights():
    language_model = Qwen2ForCausalLM(llm_config)
    vit_model = SiglipVisionModel(vit_config)
    model = Bagel(language_model, vit_model, config)
    model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config, meta=True)

# Tokenizer Preparing
tokenizer = Qwen2Tokenizer.from_pretrained(model_path)
tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)

# Image Transform Preparing
vae_transform = ImageTransform(1024, 512, 16)
vit_transform = ImageTransform(980, 224, 14)

Next, we will load the model for single or multi GPU inference. Edit the value on line 1 to reflect the corresponding amount of VRAM on your system, either 80 or 640.

max_mem_per_gpu = "80GiB"  # Modify it according to your GPU setting

device_map = infer_auto_device_map(
    model,
    max_memory={i: max_mem_per_gpu for i in range(torch.cuda.device_count())},
    no_split_module_classes=["Bagel", "Qwen2MoTDecoderLayer"],
)
print(device_map)

same_device_modules = [
    'language_model.model.embed_tokens',
    'time_embedder',
    'latent_pos_embed',
    'vae2llm',
    'llm2vae',
    'connector',
    'vit_pos_embed'
]

if torch.cuda.device_count() == 1:
    first_device = device_map.get(same_device_modules[0], "cuda:0")
    for k in same_device_modules:
        if k in device_map:
            device_map[k] = first_device
        else:
            device_map[k] = "cuda:0"
else:
    first_device = device_map.get(same_device_modules[0])
    for k in same_device_modules:
        if k in device_map:
            device_map[k] = first_device

model = load_checkpoint_and_dispatch(
    model,
    checkpoint=model_path+"/ema.safetensors"),
    device_map=device_map,
    offload_buffers=True,
    dtype=torch.bfloat16,
)

model = model.eval()
print('Model loaded')

In the next two cells, we will load in the inferencer package to load in our inference pipeline. We also set the seed value to randomize.

from inferencer import InterleaveInferencer

inferencer = InterleaveInferencer(
        model=model,
        vae_model=vae_model,
        tokenizer=tokenizer,
        vae_transform=vae_transform,
        vit_transform=vit_transform,
        new_token_ids=new_token_ids
)

import random
import numpy as np

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

With that, we have set everything up! The model is loaded onto the GPU and the pipeline is ready for us to begin doing inference. In the next sections, we outline the models general capabilities as shown by the examples.

Generate images with BAGEL

To generate images, navigate to the section of the notebook labeled “Image Generation”. Run the first code cell to instantiate a dictionary with the parameters for the inference run. Then run the following cell to generate your image. Below is an example using the code from both cells:

inference_hyper=dict(
        cfg_text_scale=4.0,
        cfg_img_scale=1.0,
        cfg_interval=[0.4, 1.0],
        timestep_shift=3.0,
        num_timesteps=50,
        cfg_renorm_min=1.0,
        cfg_renorm_type="global",
)

prompt = '''draw the famous actor Keanu Reeves with long hair and a beard eating Ramen noodles while sitting next to an anthromorphic bear wearing a bowtie, both laughing, drawing in anime style'''
print(prompt)
print('-' * 10)
output_dict = inferencer(text=prompt, **inference_hyper)
display(output_dict['image'])

The result can be seen below, along with a code snippet used for saving the image.

image generated with BAGEL

from PIL import Image
import PIL

output_dict['image'].save('output_img_gen.png')

As we can see, the model does an impressive job of capturing the challenging nature of the prompt. It not only placed the subjects correctly, but understood the desired style and expressions we described in the prompt. While it stands to see how BAGEL holds up to larger VLMs like GPT-4o or dedicated image generation models like FLUX.1 Pro is unclear, but we are so far impressed with the results.

Generate Images with Thinking

Next, we will take advantage of the VLM’s reasoning capabilities to augment the control of our image generation. In practice, it appears the VLM expands and rewrites the prompt to try and better express the user’s original intent. Below, we can see the code used to generate an image with thinking.

inference_hyper=dict(
        max_think_token_n=1000,
        do_sample=False,
        # text_temperature=0.3,
        cfg_text_scale=4.0,
        cfg_img_scale=1.0,
        cfg_interval=[0.4, 1.0],
        timestep_shift=3.0,
        num_timesteps=50,
        cfg_renorm_min=1.0,
        cfg_renorm_type="global",
)
prompt = '''draw the famous actor Keanu Reeves with long hair and a beard eating Ramen noodles while sitting next to an anthromorphic bear wearing a bowtie, Keanu is wearing a shirt that says "DigitalOean", both are laughing, drawing in anime style'''
print(prompt)
print('-' * 10)
output_dict = inferencer(text=prompt, think=True, **inference_hyper)
print(output_dict['text'])
display(output_dict['image'])

This produces the sample thinking text and the subsequent generated image. The edited prompt seems to have become the following:

“”" think Okay, let’s break this down. The user wants an anime-style drawing of Keanu Reeves with long hair and a beard, eating Ramen noodles. He’s sitting next to an anthropomorphic bear wearing a bowtie. Keanu is wearing a shirt that says “DigitalOcean,” and both are laughing. The scene should be vibrant and playful, with a focus on the characters’ expressions and the food. I need to ensure the bear is drawn in a cute, human-like manner, and the shirt text is clearly visible. The overall tone should be light-hearted and fun, with exaggerated anime features like big eyes and dynamic lines. end think “”"

This generated the example image below:

image

As we can see, the image is fairly similar to the original. We would argue that the quality is a bit higher, with greater prompt adherence, as shown by the more anime-like style and the presence of the writing on his shirt. Typically, image generation with thinking seems to outperform the image generation without thinking. We recommend trying these different techniques with your prompts and comparing the results for yourself!

Image Editing

One of the most exciting prospects of the BAGEL VLM is the ability to edit images, including making changes to subjects, objects, and styles. ChatGPT’s GPT-4o has burst to the top of the image generation scene partially because of its awesome capabilities for image editing, and we suspect BAGEL will follow suit as adoption rises from the open-source community. Below, we can see the code used to run image editing. Edit the value on line 11 to reflect the path to the image you want to edit.

inference_hyper=dict(
        cfg_text_scale=4.0,
        cfg_img_scale=2.0,
        cfg_interval=[0.0, 1.0],
        timestep_shift=4.0,
        num_timesteps=50,
        cfg_renorm_min=1.0,
        cfg_renorm_type="text_channel",
)

image = Image.open('./output_img_gen_think.png')
prompt = 'make his shirt say "DIGITALOCEAN"'
display(image)
print(prompt)
print('-'*10)
output_dict = inferencer(image=image, text=prompt, **inference_hyper)
display(output_dict['image'])

This will run the pipeline. For our example, we used the image we generated from the second task. We instructed the model to edit the original image to have “DigitalOcean” written onto the human character’s shirt. We can view the example below.

image

As we can see, the model mostly succeeded with its task. The spelling was not perfect, and there is some smudging around the new text. Nonetheless, the model successfully retained all of the original features and stylistic traits of the original image with the dynamic edit.

Image Editing with Thinking

Extending the reasoning capabilities of the model to image editing, the model is capable of generating images with greater understanding of both the subject matter in the input and with increased prompt adherence. Let’s look at the code to run image editing with thinking below, using the example provided by the authors of the project.

inference_hyper=dict(
        max_think_token_n=1000,
        do_sample=False,
        # text_temperature=0.3,
        cfg_text_scale=4.0,
        cfg_img_scale=2.0,
        cfg_interval=[0.4, 1.0],
        timestep_shift=3.0,
        num_timesteps=50,
        cfg_renorm_min=0.0,
        cfg_renorm_type="text_channel",
)

image = Image.open('./test_images/octupusy.jpg')
prompt = 'Could you display the sculpture that takes after this design?'

display(image)
print('-'*10)
output_dict = inferencer(image=image, text=prompt, think=True, **inference_hyper)
print(output_dict['text'])
display(output_dict['image'])

After running the example, we should be left with something approximating the image below:

image

As we can see, the resultant approximation of the image subject into a 3d sculpture shows deep understanding of both the prompt and input image. Through reasoning, the model was able to extrapolate what the task was, extract the subject from the entire photo, and generate the simulacra of the original in the form of a 3d sculpture.

Image Understanding

Finally, we get to image understanding, the bread and butter of VLM use. After Janus Pro, we have been watching closely at open-source models and their capabilities with these sorts of captioning tasks. Use the code below to generate an image understanding statement about an input image. For this example, we are using the test image meme.jpg provided by the authors of the repo in the test_images directory.

inference_hyper=dict(
        max_think_token_n=1000,
        do_sample=False,
        # text_temperature=0.3,
)

image = Image.open('./test_images/meme.jpg')
prompt = "Can someone explain what’s funny about this meme??"

display(image)
print(prompt)
print('-'*10)
output_dict = inferencer(image=image, text=prompt, understanding_output=True, **inference_hyper)
print(output_dict['text'])

This provided us with the following output to the shown input image:

image

“The humor in this meme comes from the exaggerated change in handwriting style as the exam progresses. At the beginning, the handwriting is clear and legible, indicating a confident and well-prepared student. However, as the exam continues, the handwriting becomes increasingly difficult to read, suggesting that the student is becoming more anxious and less focused. The final part of the meme shows an electrocardiogram (ECG) reading, which humorously implies that the student’s heart rate is racing and their writing is becoming erratic due to stress. This progression from clear handwriting to a chaotic, almost illegible script is a relatable and amusing portrayal of the pressure and anxiety many students feel during exams.”

As we can see from the response, the model is both robust at understanding the subject of the image and with interpreting the complex comedic nature of a meme. This capability extends beyond simple interpretation of the image subject matter, but goes beyond into showing a deeper understanding of the relationships between the objects in the image. The model also showcases a very nice ability to read text which could later have very interesting applications for tasks like OCR.

Closing Thoughts

BAGEL is one of the most exciting foundation model releases we have seen in 2025. The model’s ability to generate images, edit images, and show understanding of images is beyond the capabilities of any open-source competitor. We are eager to see how development of the model goes forward, especially as the open-source community adopts the model and begins distributing finetunings.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

James Skelton
James Skelton
Author
Technical Evangelist // AI Arcanist
See author profile
Category:

Still looking for an answer?

Was this helpful?
Leave a comment...

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.