Extracting insights from images has long been a challenge across industries like finance, healthcare, and law. Traditional methods, such as Optical Character Recognition (OCR), have struggled with complex layouts and contextual understanding.
Llama 3.2 Vision, an advanced AI model, enhances image processing capabilities like Visual Question Answering and OCR. By integrating this model with DigitalOcean’s cloud infrastructure, this tutorial provides a scalable and efficient way to implement AI-powered image processing.
In this tutorial, you will learn to set up Llama 3.2 Vision with DigitalOcean’s cloud infrastructure, and demonstrate how to use it for AI-powered image processing for extracting employee IDs and names from images. We will cover the installation and configuration steps, as well as provide examples of how to use the model for Visual Question Answering and OCR. By the end of this tutorial, you will have a solid understanding of how to leverage Llama 3.2 Vision for your image processing needs.
Before proceeding, ensure you have:
Connect to your server via SSH:
ssh root@your_server_ip
Run the following commands to set up a Python virtual environment:
apt install python3.10-venv -y
python3.10 -m venv llama-env
source llama-env/bin/activate
pip install torch torchvision torchaudio
pip install -U "huggingface_hub[cli]"
huggingface-cli login
pip install --upgrade transformers
Boto3 is required to interact with DigitalOcean Spaces, which is S3-compatible.
pip install flask boto3
pip install mysql-connector-python
Install Nginx to serve your Flask application:
sudo apt install nginx -y
Organize your project as follows:
llama-webapp/
├── app.py # Main Flask app file
├── static/
│ └── styles.css # Optional: CSS file for styling
└── templates/
└── index.html # HTML template for the web page
Below is the Flask app (app.py
) that loads the Llama 3.2 model, processes uploaded images, and extracts employee details.
import os
import json
import requests
from PIL import Image
from flask import Flask, request, render_template, session
from transformers import MllamaForConditionalGeneration, AutoProcessor
import boto3
import torch
import mysql.connector
import re
# Initialize Flask app
app = Flask(__name__)
app.secret_key = "your_secret_key" # Replace with a strong secret key
# Load the Llama 3.2 model and processor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# DigitalOcean Spaces setup
SPACE_NAME = "your_space_name"
SPACE_REGION = "your_region"
ACCESS_KEY = "your_access_key"
SECRET_KEY = "your_secret_key"
s3 = boto3.client(
"s3",
region_name=SPACE_REGION,
endpoint_url=f"https://{SPACE_REGION}.digitaloceanspaces.com",
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY
)
# DigitalOcean Managed MySQL setup
DB_HOST = "your_mysql_host"
DB_PORT = 25060
DB_NAME = "your_database_name"
DB_USER = "your_username"
DB_PASSWORD = "your_password"
# Function to establish a database connection
def get_db_connection():
try:
conn = mysql.connector.connect(
host=DB_HOST, port=DB_PORT, database=DB_NAME, user=DB_USER, password=DB_PASSWORD
)
print("✅ Database connection successful!")
return conn
except Exception as e:
print(f"❌ Error connecting to the database: {e}")
return None
# Function to extract Employee ID & Name from an image
def extract_employee_details(image_path):
"""Extracts Employee Name and ID using Llama 3.2 Vision AI."""
try:
image = Image.open(image_path)
prompt = (
"Extract the Employee ID and Name from the given image. "
"Provide output in valid JSON format with keys: 'employee_id' and 'employee_name'."
)
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
raw_result = processor.decode(output[0])
json_match = re.search(r"\{.*?\}", raw_result, re.DOTALL)
extracted_data = json.loads(json_match.group(0)) if json_match else {}
employee_id = extracted_data.get("employee_id", "").strip()
employee_name = extracted_data.get("employee_name", "").strip()
return employee_id, employee_name
except Exception as e:
return None, None
# Flask Routes
@app.route("/", methods=["GET", "POST"])
def index():
"""Handles image uploads, extracts Employee Name & ID, and stores it in MySQL."""
result = None
image_url = session.get("image_url")
if request.method == "POST":
image_file = request.files.get("image")
if not image_file:
return "Error: Please upload an image.", 400
filename = image_file.filename
image_path = os.path.join("/tmp", filename)
image_file.save(image_path)
employee_id, employee_name = extract_employee_details(image_path)
return render_template("index.html", result=result, image_url=image_url)
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Start the Flask application:
python app.py
Open your browser and visit:
http://your_server_ip:5000
Upload an image, extract employee details, and verify data storage in the database.
Llama 3.2 is a state-of-the-art AI model developed by Meta (Facebook) that builds upon its predecessor, Llama 3. It offers improved natural language understanding, better performance in multimodal tasks (including image processing), and enhanced efficiency when integrated with Hugging Face Transformers.
Yes, Llama 3.2 introduces vision models (11B and 90B) that enable it to process and understand images directly, allowing for tasks like image captioning, object recognition, and scene interpretation
Llama 3.2 can assist in image processing tasks such as:
You can install the required libraries and load the model using the following steps:
pip install transformers torch torchvision
Then, load the model with:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/llama-3-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
If working with images, you may also need transformers’s vision models like CLIP:
from transformers import CLIPProcessor, CLIPModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
Llama 3.2’s vision model can generate high-quality captions for images:
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
model_name = "meta-llama/llama-3.2-vision"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)
image = Image.open("your_image.jpg")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.batch_decode(output, skip_special_tokens=True)[0]
print("Generated Caption:", caption)
Yes, you can fine-tune Llama 3.2 Vision using Hugging Face’s transformers library with LoRA (Low-Rank Adaptation).
Example fine-tuning setup:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"]
)
fine_tuned_model = get_peft_model(model, config)
This allows efficient fine-tuning without retraining the entire model.
In this tutorial, you learned how to extract employee IDs and names from images using the Llama 3.2 Vision model. We integrated DigitalOcean Spaces for storing images and used a managed MySQL database for structured data storage. This solution provides an automated way to process and manage employee verification data with AI-powered efficiency.
Continue building with DigitalOcean Gen AI Platform.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!