Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

Run LLMs with Ollama on H100 GPUs for Maximum Efficiency

Updated on August 6, 2025

AI/ML

By Shaoni Mukherjee

AI Technical Writer

Run LLMs with Ollama on H100 GPUs for Maximum Efficiency

Introduction

This article is a guide to run Large Language Models using Ollama on H100 GPUs offered by DigitalOcean. DigitalOcean GPU Droplets provide a powerful, scalable solution for AI/ML training, inference, and other compute-intensive tasks such as deep learning, high-performance computing (HPC), data analytics, and graphics rendering. These GPUs are designed to handle demanding workloads, GPU Droplets enable businesses to efficiently scale AI/ML operations on-demand, without the need for managing unnecessary costs. Offering simplicity, flexibility, and affordability, DigitalOcean’s GPU Droplets ensure quick deployment and ease of use, making them ideal for developers and data scientists.

Now, with support for NVIDIA H100 GPUs, users can accelerate AI/ML development, test, deploy, and optimize their applications seamlessly—without the need for extensive setup or maintenance typically associated with traditional platforms. Ollama is an open source tool which provides access to a diverse library of pre-trained models, offers effortless installation and setup across different operating systems, and exposes a local API for seamless integration into applications and workflows. Users can customize and fine-tune LLMs, optimize performance with hardware acceleration, and benefit from interactive user interfaces for intuitive interactions.

Key takeaways:

Running large language models with Ollama on an NVIDIA H100 GPU combines an easy-to-use local model deployment tool with one of the most powerful AI accelerators available, allowing even very large models to be served with exceptional speed and throughput.
This guide shows how to provision a DigitalOcean GPU Droplet equipped with an H100 and set up Ollama to load and serve an LLM, so developers can deploy advanced models (like code assistants or chatbots based on Llama 2) on dedicated high-end hardware without complex infrastructure work.
Using an H100 dramatically accelerates model inference (and even fine-tuning tasks) thanks to its advanced Tensor Cores and huge memory; for example, the H100’s support for FP8 precision and massive parallelism translates to lower latency responses and the ability to handle larger batch sizes or context windows compared to using smaller GPUs.
This approach offers the flexibility of cloud scaling—spinning up a powerful GPU when needed—while Ollama provides a straightforward way to run the model, meaning you can harness cutting-edge model performance on demand without requiring deep ML operations expertise.

Prerequisites

Access to H100 GPUs: Ensure you have access to NVIDIA H100 GPUs, either through on-premise hardware or using GPU Droplets by DigitalOcean.
Supported Frameworks: Familiarity with Python and Linux Commands.
CUDA and cuDNN Installed: Ensure NVIDIA CUDA and cuDNN libraries are installed for optimal GPU performance.
Sufficient Storage and Memory: Have ample storage and memory available to handle large model datasets and weights.
Basic Understanding of LLMs: A foundational understanding of large language models and their structure to effectively manage and optimize them. These prerequisites help ensure a smooth and efficient experience when running LLMs with Ollama on H100 GPUs.

What is Ollama?

Ollama offers a way to download a large language model from its vast language model library which consists of Llama3.1, Mistral, Code Llama, Gemma and much more. Ollama combines model weights, configuration, and data into one package, specified by a Modelfile. Ollama provides a flexible platform for creating, importing, and using custom or pre-existing language models, ideal for creating chatbots, text summarization, and much more. It emphasizes privacy, integrates seamlessly with windows, macOS and Linux, and is free to use. Ollama also allows users to deploy models locally with ease. Further, the platform also supports real-time interactions via a REST API. It’s perfect for LLM-powered web apps and tools. It’s very similar to how Docker works. With Docker, we can grab different images from a central hub and run them in containers. Furthermore, Ollama allows us to customize the models by creating a Modelfile. Below is the code to create Modelfile:

FROM llama2

# Set the temperature PARAMETER temperature 1
PARAMETER temperature 1

# Set the system Prompt
SYSTEM """
You are a helpful teaching assistant created by Shaoni.
Answer questions asked based on Artificial Intelligence, Deep Learning.
"""

Next, run the custom model,

ollama create MLexp \-f ./Modelfile
ollama run MLexp

The Power of NVIDIA H100 GPUs

The H100 is Nvidia’s most powerful GPU, specially designed for artificial intelligence applications. With 80 billion transistors—six times more than the A100—it can process large data sets much faster than other GPUs on the market.
As we all know AI applications are data hungry and are computationally expensive. To manage this huge amount of workload H100 are considered to be the best choice.
The H100 features fourth-generation tensor cores and a transformer engine with FP8 precision. The H100 triples the floating-point operations per second (FLOPS) compared to previous models, delivering 60 teraflops of double-precision (FP64) computing, which is crucial for precise calculations in HPC tasks. It can perform single-precision matrix-multiply operations at one petaflop throughput using TF32 precision without requiring any changes to existing code, making it user-friendly for developers.
The H100 introduces DPX instructions that significantly boost performance for dynamic programming tasks, achieving 7X better performance than the A100 and 40X faster than CPUs for specific algorithms like DNA sequence alignment.
H100 GPUs provide the necessary computational power, offering 3 terabytes per second (TB/s) of memory bandwidth per GPU. This high performance allows for efficient handling of large datasets.
The H100 supports scalability through technologies like NVLink and NVSwitch™, which allows multiple GPUs to work together effectively.

GPU Droplets

DigitalOcean GPU Droplets offer a simple, flexible, and cost-effective solution for your AI/ML workloads. These scalable machines are ideal for reliably running training and inference tasks on AI/ML models. Additionally, DigitalOcean GPU Droplets are well-suited for high-performance computing (HPC) tasks, making them a versatile choice for a range of use cases including simulation, data analysis, and scientific computing. Try the GPU Droplets now by signing up for a DigitalOcean account.

Info: Experience the power of AI and machine learning with DigitalOcean GPU Droplets. Leverage NVIDIA H100 GPUs to accelerate your AI/ML workloads, deep learning projects, and high-performance computing tasks with simple, flexible, and cost-effective cloud solutions.

Why Run LLMs with Ollama on H100 GPUs?

To run Ollama efficiently a GPU from NVIDIA is required to run things hassle free. As with CPU users can expect a slow response.

H100 due to its advanced architecture offers exceptional computing power which helps to significantly speed up the efficiency of LLMs.
Ollama lets users customize and fine-tune LLMs to meet their specific needs, enabling prompt engineering, few-shot learning, and tailored fine-tuning to align models with desired outcomes. Pairing Ollama with H100 GPUs enhances model inference and training times for developers and researchers.
H100 GPUs have the capacity to handle models such as Falcon 180b which makes them ideal to create and deploy Gen AI tools like chatbots or RAG applications.
H100 GPUs come with hardware optimizations like tensor cores, which significantly accelerate tasks involving LLMs, especially when dealing with matrix-heavy operations.

Setting Up Ollama with H100 GPUs

Ollama is very well compatible with Windows, macOS, or Linux. Here we are using Linux code as our GPU Droplets are based on Linux OS.

Run the code below in your terminal to check the GPU specification.

nvidia-smi

Next, we will try to install Ollama first using the same terminal.

curl \-fsSL https://ollama.com/install.sh | sh

This will instantly start the Ollama installation.

Once the installation is done we can pull any LLM and start working with the model such as Llama 3.1, Phi3, Mistral, Gemma 2 or any other model.

To run and chat with models, we will run the below code. Please feel free to change the model as per your requirements. Running the model with Ollama is quite straightforward and here we are using the powerful H100, the process to generate a response becomes fast and efficient.

ollama run example\_model

ollama run qwen2:7b

In case of the error "could not connect to ollama app, is it running? Please use the below code to connect to Ollama

sudo systemctl enable ollama

sudo systemctl start ollama

Ollama supports a wide list of models, here are some example models that can be downloaded and used.

Model	Parameters	Size	Download
Llama 3.1	8B	4.7GB	Ollama run llama3.1
Llama 3.1	70B	40GB	Ollama run llama3.1:70b
Llama 3.1	405B	231GB	Ollama run llama3.1:405b
Phi 3 Mini	3.8B	2.3GB	Ollama run phi3
Phi 3 Medium	14B	7.9GB	Ollama run phi3:medium
Gemma 2	27B	16GB	Ollama run gemma2:27b
Mistral	7B	4.1GB	Ollama run mistral
Code Llama	7B	3.8GB	Ollama run codellama

With Ollama users can run the LLMs conveniently without even the need for internet connection as the model and its dependencies get stored locally.

>>> Write a python code for a fibonacci series.


def fibonacci(n):  
    """  
    This function prints the first n numbers of the Fibonacci sequence.

    Parameters:  
    @param n (int): The number of elements in the Fibonacci sequence to print.

    Returns:  
    None

    """

    # Initialize the first two numbers of the Fibonacci sequence.  
    a, b = 0, 1

    # Iterate over the range and generate Fibonacci sequence.  
    for i in range(n):  
        print(a)  
        # Update the next number in the sequence  
        a, b = b, a + b

# Test function with first 10 numbers of the Fibonacci sequence.  
if __name__ == "__main__":  
    fibonacci(10)

This python code defines a simple `fibonacci` function that takes an integer argument and prints the first n numbers in the Fibonacci sequence. The Fibonacci sequence starts with 0 and 1, and each subsequent number is the sum of the previous two.

The if __name__ == "__main__": block at the end tests this function by calling it with a parameter value of 10, which prints out the first 10 numbers in the Fibonacci sequence.

Conclusion

Ollama is a new Gen-AI tool for working with large language models locally, offering enhanced privacy, customization, and offline accessibility. Ollama has led working with LLM simpler and to explore and experiment with open-source LLMs directly on their machines, Ollama promotes innovation and deeper understanding of AI. To access a powerful GPU, consider using DigitalOcean GPU Droplets.

For getting started with Python, we recommend checking out this beginner’s guide to set up your system and prepare for running introductory tutorials.

FAQ’s

Q1. What is Ollama and how does it help with running LLMs?

Ollama is a lightweight, developer-friendly framework for running large language models (LLMs) locally or on servers. It simplifies model management, supports custom models, and optimizes inference performance with GPU acceleration.

Q2. Why should I use H100 GPUs for running LLMs?

NVIDIA H100 GPUs are specifically designed for AI workloads. They offer massive parallel processing, faster training, and lower inference latency—making them ideal for deploying LLMs at scale with maximum throughput.

Q3. Is Ollama compatible with H100 GPUs out of the box?

Yes, Ollama works well with H100s as long as your system has the necessary NVIDIA drivers, CUDA toolkit, and container runtime. You can run models inside Docker containers optimized for GPU use.

Q4. Can I use Ollama in production environments?

Absolutely. Ollama supports containerized deployment, GPU usage, and REST APIs—making it a great choice for production inference pipelines. It also integrates with popular tools for monitoring and scaling.

Q5. How does Ollama manage memory and GPU usage efficiently?

Ollama loads only what’s needed for inference and can stream model responses, reducing memory overhead. On H100s, it leverages tensor cores and high memory bandwidth for faster, more efficient performance.

Q6. What models can I run using Ollama on H100 GPUs?

You can run a wide range of models including LLaMA 2/3, Mistral, Gemma, and Mixtral. Ollama supports quantized and full-precision models, giving you flexibility based on your performance and quality needs.

Q7. Do I need to fine-tune models to use them with Ollama?

No, you can use pre-trained models directly. However, if you need domain-specific results, Ollama also allows you to run fine-tuned or custom models built with tools like LoRA or GGUF.

References

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee

Author

AI Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

See author profile

Category:

Tutorial

Tags:

AI/ML

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Report this