
Large Language Models (LLMs) and Vision-Language Models (VLMs) face a persistent challenge: the computational cost of processing long documents. As the length of the text increases, so too does the number of tokens, leading to higher memory usage, slower inference, and more dollars spent.Traditional OCR (Optical Character Recognition) systems convert documents into text tokens, which can quickly exceed context windows and strain computational resources. DeepSeek-OCR addresses this issue by using optical context compression, a method that encodes document pages in a more concise form as visual tokens instead of lengthy text sequences. This approach reduces token counts by 7–20x while maintaining accuracy and therefore is a practical solution for large-scale document processing and training data generation.
DeepSeek-OCR is an open-source vision-language model developed by DeepSeek-AI. It consists of two main components: DeepEncoder, which compresses document images into a small set of visual tokens, and DeepSeek-3B-MoE, a decoder that reconstructs the original text from these tokens. The model is designed to balance efficiency and accuracy, achieving competitive results on benchmarks such as OmniDocBench and Fox while using fewer tokens than existing solutions.
This article is part of a series on running OCR models on DigitalOcean including Dolphin, olm-OCR, rolm-OCR, smoldocling, etc.

The DeepEncoder component is responsible for compressing high-resolution document images into a manageable number of visual tokens. It uses a two-stage process:
The decoder is based on DeepSeek’s Mixture-of-Experts (MoE) architecture, which activates only a subset of its 3B parameters during inference (approximately 570M). The strength of MoE is its efficiency while delivering comparable performance to larger models. The decoder reconstructs the original text from the compressed visual tokens, preserving layout and content where possible.
DeepSeek-OCR was trained on an extensive and diverse dataset to ensure robust performance across various document types and languages. The training data includes over 30 million PDF pages spanning more than 100 languages, with particular emphasis on Chinese and English. Additionally, the model was trained on OCR 2.0 data comprising 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures, which extends its capabilities beyond standard text extraction to handle specialized content such as scientific diagrams and financial charts. This comprehensive training approach enables the model to effectively process a wide range of document types and languages while maintaining strong performance on complex visual elements.
DeepSeek-OCR’s performance varies with compression ratio. At compression levels below 10x, the model achieves approximately 97% OCR precision, effectively reconstructing the original text with minimal loss. At 20x compression, accuracy drops to around 60%, which may still be sufficient for archival or secondary use cases.
On the OmniDocBench benchmark, DeepSeek-OCR outperforms competing models while using fewer tokens. With 100 tokens per page, it surpasses GOT-OCR2.0, which typically uses 256 tokens per page. With fewer than 800 tokens per page, it outperforms MinerU2.0, which typically requires over 6,000 tokens per page.
DeepSeek-OCR is suited for several use cases. For large-scale document digitization, libraries, legal firms, and research institutions can process high volumes of documents efficiently. AI labs can use the model for training data generation to create text-image pairs for LLM pretraining, addressing data scarcity issues. The model’s support for over 100 languages makes it versatile for multilingual document processing in global applications. Additionally, its capability to parse charts, tables, and formulas makes it particularly useful for structured data extraction in technical and financial documents.
from transformers import AutoModel, AutoTokenizer
import torch
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True
).eval().cuda().to(torch.bfloat16)
# Load an image and run OCR
from PIL import Image
image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
inputs = tokenizer(prompt, images=[image], return_tensors="pt").to("cuda")
output = model.generate(**inputs)
print(tokenizer.decode(output[0]))
DeepSeek-OCR offers several modes to match different document types and requirements. The Tiny mode operates at 512x512 resolution with 64 vision tokens, making it suitable for quick previews and low-resolution documents. The Small mode uses 640x640 resolution with 100 tokens for standard documents. The Base mode processes images at 1024x1024 resolution with 256 tokens for high-resolution pages. The Large mode handles 1280x1280 resolution with 400 tokens for complex layouts. Finally, the Gundam mode uses dynamic resolution with 795 or more tokens for multi-column and dense documents.
There are several important considerations when using DeepSeek-OCR. Regarding accuracy versus compression, compression ratios beyond 10x may reduce accuracy, particularly for dense or low-resolution documents. While the Gundam mode improves handling of multi-column layouts, highly complex documents such as newspapers may still require manual review for layout complexity. For optimal performance, the model requires NVIDIA GPUs with CUDA support as a hardware requirement.
DeepSeek-OCR is an open-source Vision-Language Model (VLM) developed by DeepSeek-AI, designed for highly efficient document processing. It converts document images into text using a unique optical context compression method to significantly reduce the computational burden.
It uses optical context compression via its DeepEncoder component. Instead of converting an entire page into a long sequence of text tokens, it compresses the visual information into a small set of visual tokens (7–20x fewer than standard text tokens), which are then decoded by the DeepSeek-3B-MoE decoder. This token reduction leads to faster inference and lower memory usage.
The model has a dual-component architecture:
DeepSeek-OCR maintains high accuracy (around 97% OCR precision) at moderate compression ratios (up to 10x). As the compression ratio increases beyond 10x (e.g., to 20x), the accuracy drops to around 60%. Users must select a compression mode that balances their required precision with computational efficiency.
The model was trained on an extensive dataset of over 30 million PDF pages in 100+ languages. Crucially, it was also trained on OCR 2.0 data, which includes millions of synthetic charts, chemical formulas, and geometric figures, enabling it to handle complex and specialized visual elements beyond plain text.
Yes. With training data spanning over 100 languages, including significant emphasis on Chinese and English, DeepSeek-OCR is well-suited for multilingual document processing and global applications.
DeepSeek-OCR supports multiple modes to cater to different document complexities and required resolutions:
| Mode | Resolution | Vision Tokens | Typical Use Case |
|---|---|---|---|
| Tiny | 512x512 | 64 | Quick previews, low-res documents |
| Small | 640x640 | 100 | Standard documents |
| Base | 1024x1024 | 256 | High-resolution pages |
| Large | 1280x1280 | 400 | Complex layouts |
| Gundam | Dynamic | 795+ | Multi-column, dense documents |
Key applications include:
The model’s architecture, featuring DeepEncoder and DeepSeek3B-MoE-A570M, demonstrates practical value in generating training data for LLMs/VLMs. DeepSeek-OCR represents a practical advancement in document processing, offering a way to reduce token counts and computational costs without sacrificing accuracy. Its combination of optical context compression, multi-resolution support, and open-source availability makes it a valuable tool for applications ranging from archival digitization to AI training data generation.
For those interested in exploring its capabilities, the model is available on GitHub and Hugging Face and can be run on a DigitalOcean GPU Droplet. Its architecture and performance suggest potential for broader applications in AI efficiency and long-context processing.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.