Technical Writer

Vision-language models are among the advanced artificial intelligence AI systems designed to understand and process visual and textual data together. These models are known to combine the capabilities of computer vision and natural language processing tasks. The models are trained to interpret images and generate descriptions about the image, enabling a range of applications such as image captioning, visual question answering, and text-to-image synthesis. These models are trained on large datasets and powerful neural network architectures, which helps the models to learn complex relationships. This, in turn, allows the models to perform the desired tasks. This advanced system opens up possibilities for human-computer interaction and the development of intelligent systems that can communicate similarly to humans.
Large Multimodal Models (LMMs) are quite powerful however they struggle with the high-resolution input and scene understanding. To address these challenges Monkey was recently introduced. Monkey, a vision-language model, processes input images by dividing the input images into uniform patches, with each patch matching the size used in its original vision encoder training (e.g., 448×448 pixels).
This design allows the model to handle high-resolution images. Monkey employs a two-part strategy: first, it enhances visual capture through higher resolution; second, it uses a multi-level description generation method to enrich scene-object associations, creating a more comprehensive understanding of the visual data. This approach improves learning from the data by capturing detailed visuals, enhancing descriptive text generation’s effectiveness.

The Overall Monkey Architecture (Image Source)
Let’s break down this approach step by step.
This approach improves the model’s ability to understand complex images by combining local detail analysis with a global overview, leveraging advanced techniques like LoRA and cross-attention.
Overall, Monkey offers a sophisticated way to improve resolution and description generation in LMMs by using existing models more efficiently.
To run the Monkey Model and experiment with it, we first start a notebook, or you can start up a terminal. We highly recommend using an A4000 GPU to run the model.
The NVIDIA A6000 GPU is a powerful graphics card that is known for its exceptional performance in various AI and machine learning applications, including visual question answering (VQA). With its memory and advanced Ampere architecture, the A4000 offers high throughput and efficiency, making it ideal for handling the complex computations required in VQA tasks.
!nvidia-smi

We will run the below code cells. This will clone the repository, and install the requirements.txt file.
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey
pip install -r requirements.txt
We can run the gradio demo which is fast and easy to use.
 python demo.py
or follow the code along.
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "echo840/Monkey-Chat"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id
The code above loads the pre-trained model and tokenizer from the Hugging Face Transformers library.
“echo840/Monkey-Chat” is the name of the model checkpoint we will load. Next, we will load the model weights and configurations and map the device to CUDA-enabled GPU for faster computation.
img_path = '/notebooks/quick_start_pytorch_images/image 2.png'
question = "provide a detailed caption for the image"
query = f'<img>{img_path}</img> {question} Answer: '
input_ids = tokenizer(query, return_tensors='pt', padding='longest')
attention_mask = input_ids.attention_mask
input_ids = input_ids.input_ids
pred = model.generate(
    input_ids=input_ids.cuda(),
    attention_mask=attention_mask.cuda(),
    do_sample=False,
    num_beams=1,
    max_new_tokens=512,
    min_new_tokens=1,
    length_penalty = 1,
    num_return_sequences=1,
    output_hidden_states=True,
    use_cache=True,
    pad_token_id=tokenizer.eod_id,
    eos_token_id=tokenizer.eod_id,
)
response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()
print(response)
This code will generate the detailed caption or description or any other output based on the prompt query using Monkey. We will specify the path where we have stored our image and formulating a query string that includes the image reference and the question asking for a caption. Next, the query is tokenised using the ‘tokenizer’ which converts the input texts into token IDs.
Parameters such as do_sample=False and num_beams=1 ensure deterministic output by disabling sampling. Other parameters like max_new_tokens, min_new_tokens, and length_penalty control the length and nature of the generated sequence. After generation, the output tokens are decoded back into human-readable text, skipping any special tokens, to form the final response, which is a caption describing the image. Finally, we print the generated caption.
We tried the model with an extremely basic image of just a screenshot, and it does fairly well in recognizing what the image is.

Monkey Output:- The image displays a computer interface with a blue background. The main focus is on a machine with the label “A4000” and the price “$0.76/hr”. The machine is equipped with 45 GB RAM and 8 CPUs. There are also several buttons and icons on the interface, including a blue “STOP MACHINE” button and a white “Auto-shutdown in 11 hours” button. The text “45 GB RAM” and “8 CPUs” are also visible. The overall layout suggests a user-friendly interface for managing virtual machines.

Monkey Output:- In the image, a golden retriever is captured in mid-air, leaping towards a yellow tennis ball that is suspended in the air. The dog’s body is stretched out, with its front paws reaching for the ball, and its tail is raised high in excitement. The background is a lush green lawn, providing a stark contrast to the dog’s white fur. The image is taken from a low angle, making the dog appear larger and more imposing. The dog’s focused expression and the dynamic pose suggest a moment of intense playfulness.

Prompt:- Provide a detailed caption for the image
Monkey Output:- This image is a painting of a city skyline at night. The sky is filled with swirling yellow and blue colors, creating a starry effect. In the foreground, there is a tall building with a clock on top. The painting also features a large tree with a yellow light shining on it. The overall effect is one of tranquility and beauty, reminiscent of the famous “Starry Night” painting by Vincent van Gogh.
We are quite impressed by the detailed descriptions and captions that provide even the minutest details of the image. The AI-generated caption is truly remarkable!
The below image highlights Monkey’s capabilities in various VQA tasks. Monkey analyzes questions, identifies key image elements, perceives minute text, and reasons about objects, and understands visual charts. The figure also demonstrates Monkey’s impressive captioning ability, accurately describing objects and providing summaries.

Monkey’s results on various tasks (Image Source)
In qualitative analysis, Monkey was compared with GPT4V and other LMMs on the task of generating detailed captions.

Monkey and GPT-4V identified an “Emporio Armani” store in the background, with Monkey providing additional details, such as a woman in a red coat and black pants carrying a black purse. (Image Source)
Further experiments have shown that in many cases, Monkey has demonstrated impressive performance compared to GPT4V when it comes to understanding complex text-based inquiries.

The VQA task comparison results in the below figure show that by scaling up the model size, Monkey achieves significant performance advantages in tasks involving dense text. It not only outperforms QwenVL-Chat [3], LLaVA-1.5 [29], and mPLUG-Owl2 [56] but also achieves promising results compared to GPT-4V [42]. This demonstrates the importance of scaling up model size for performance improvement in multimodal large models and validates our method’s effectiveness in enhancing their performance.

Monkey’s comparison with GPT-4V, QwenVL-Chat, LLaVA-1.5, and mPLUG-Owl2 on VQA task.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get simple AI infrastructure starting at $2.99/GPU/hr on-demand. Try GPU Droplets now!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.