Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices

Published on April 10, 2025

AI/ML

GPU

Write for DO

By Adrien Payong and Shaoni Mukherjee

Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices

Introduction

LLMs are fundamental for advanced NLP tasks, including conversational AI systems, text generation applications, and machine translation processes. Models with hundreds of billions of parameters deliver exceptional performance yet require extensive GPU resources. Traditional single-GPU systems can quickly run out of memory when loading or training large language models. Splitting LLMs across multiple GPUs has become essential to ensure operational efficiency. This guide provides comprehensive insights about splitting and loading LLMs across multiple GPUs while addressing GPU memory constraints and improving model inference speeds. We have covered parallelism methods, including model and data parallelism, and explained their application to distributed LLM training.

Prerequisites

Understanding NLP models such as GPT, BERT, and T5 with their applications in natural language processing tasks.
Proficiency in developing and executing Python scripts with popular deep-learning frameworks.
Understanding GPU acceleration with CUDA technology and the advantages deep learning gains from GPU computation.
Knowledge of the training process and model evaluation methods and their application in prediction tasks.
Foundational knowledge of data parallelism, model parallelism, and distributed training principles.

Why Split Large Language Models Across Multiple GPUs?

Modern large language models such as PaLM, and networks similar to Megatron contain billions of parameters. A single GPU often cannot host massive models because available VRAM options like 12 GB, 24 GB, or even 80 GB may be insufficient to load all model parameters, activations, and optimizer states.
Splitting an LLM across multiple GPUs addresses the bottleneck through two distinct methods:

Memory Scalability: Distributing model parameters across multiple GPUs reduces the likelihood of experiencing out-of-memory (OOM) errors. Huge models require this approach during the training and inference stages.
Performance Gains: When computations run in parallel, training and inference speeds improve significantly. Implemented properly, the process accelerates multi-GPU inference and training.

Splitting LLM remains a fundamental practice for advanced AI tasks, whether you perform multi-GPU deep learning on a single machine or run distributed training for LLMs across multiple servers.

Model Parallelism vs. Data Parallelism

There are two main methods for using multiple GPUs with LLMs that provide specific benefits for various applications:

Data Parallelism This method runs full model replicas across each GPU and assigns unique data segments to each GPU for processing. Each GPU computes gradients based on its data subset during training before synchronizing them across all GPUs.

Model Parallelism The model parallelism approach splits the model across multiple GPUs, where each GPU handles specific layers or parameters of the model. The model’s parameters can be distributed across GPUs at various granularities, including tensor, layer, and pipeline stages.

Types of Model Parallelism

Model parallelism can be further divided into specific techniques.

Tensor Parallelism
Tensor parallelism involves splitting the weights of each layer across multiple GPUs at the tensor level. In large matrix multiplication operations, the distribution across GPUs allows each GPU to process different parts of the matrix independently.

Pipeline Parallelism
Pipeline Parallelism distributes different layers across multiple GPUs, so each GPU processes a specific segment of the model. For example, if GPU 0 completes the first batch during the forward pass, it quickly sends the output to GPU 1. That way, GPU 1 can process the data for the next stage, and this cycle continues. By smartly staggering these mini-batches, all the GPUs can operate simultaneously.

Sharded Data Parallelism This technique uses data parallelism and parameter sharding (Each GPU stores only a portion of the parameters) to reduce memory requirements and maintain efficient model training.

GPU Memory Management: The Hidden Challenge

Multi-GPU systems frequently experience GPU Memory Management as their primary performance bottleneck. The naive method involves pushing model slices across multiple GPUs but ignores the cross-GPU communication costs and potential memory fragmentation issues. During multi-GPU inference with LLM tasks, a careful allocation of layers, tensors, and pipeline stages is essential.

Key considerations:

Batch Size: Increasing batch sizes improves GPU utilization but risks OOM (Out of Memory) errors without careful management. Profiling tools and frameworks such as PyTorch’s built-in profiler or external libraries enable users to identify memory hotspots.
Activation Checkpointing: Activation checkpointing techniques reduce memory consumption by recomputing forward passes for selected layers during backpropagation.
Offloading: Some frameworks enable the transfer of unused GPU memory to CPU or NVMe (Non-Volatile Memory Express) storage devices. While this approach supports huge models, it can also create additional processing delays.

You can gain comprehensive knowledge about GPU memory fundamentals by reading our Introduction to CUDA tutorial.

Tools and Libraries for Splitting LLM to Multiple GPUs

Many open-source frameworks enable multi-GPU deep learning for large models. The following section outlines some leading solutions and their contributions to GPU parallelism in machine learning for large language models.

PyTorch DistributedDataParallel

PyTorch’s DistributedDataParallel (DDP) represents one of the most widely used approaches to distributed training for large language models. It enables straightforward synchronization of gradients across multiple GPUs and nodes. Each process executes the same model by working on a data subset while averaging gradients after each training iteration. Key Benefits

Ease of Use: DDP wraps your model and synchronizes gradients automatically.
Scalability: Highly scalable for large clusters.
Versatile: Works well with single-node multi-GPU and multi-node setups.

HuggingFace Accelerate

HuggingFace’s Accelerate library enables a straightforward way for multi-GPU inference while requiring minimal code changes. The library provides automatic model sharding through the device_map=“auto” parameter with tools for distributed inference. We can illustrate this through the following example:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
    "togethercomputer/LLaMA-2-7B-32K",
    torch_dtype=torch.float16,
    device_map="auto" # Automatically distributes across available GPUs
)

Accelerate initially fills up GPU 0 and moves to GPU 1 before proceeding to the remaining GPUs until the entire model is loaded. This method provides an automatic and straightforward solution for loading an LLM across multiple GPUs without manual partitioning. For a more in-depth explanation, check out Multi-GPU on Raw PyTorch with Hugging Face’s Accelerate Library.

Ollama Multiple GPUs

The Ollama framework provides a solution for running LLMs with efficient CPU and GPU inference capabilities. Multiple GPUs can run Ollama by setting environment variables or editing a settings file to specify model partitioning among them. The official Ollama documentation provides detailed instructions for splitting model weights.

Environment Variables for GPU Partitioning. The configuration of GPU settings requires exporting the appropriate environment variables. For example:

export OLLAMA_GPU_COUNT=5
export OLLAMA_GPU_MEMORY_LIMIT=16GB

Here:

OLLAMA_GPU_COUNT specifies how many GPUs to use.
OLLAMA_GPU_MEMORY_LIMIT defines the upper limit for GPU memory allocation.

For more advanced usage of Ollama, check out how to run LLMs with Ollama on H100 GPUs for maximum efficiency.

vLLM

The vLLM (Versatile Large Language Model) library represents a recent development for efficiently executing large language model inference tasks. The library brings a highly optimized transformer interpreter and introduces PagedAttention to handle the memory demands of the KV(key-value) cache with long prompt processing. vLLM supports distributed inference and serving.

You can use multiple GPUs or machines to serve a model through vLLM when it exceeds the capacity of a single GPU. When initializing a serving instance, users can set the tensor parallel size, as we can see in the example below:

from vllm import LLM
# Initialize model with tensor parallelism across 4 GPUs
llm = LLM(model="meta-llama/Llama-2-70b-hf", tensor_parallel_size=4)
# Generate text for multiple prompts in parallel
outputs = llm.generate(["Write a book", "Explain artificial intelligence"])

DeepSpeed

Microsoft developed DeepSpeed as a software library for optimizing large-scale model training processes. The Zero Redundancy Optimizer algorithm from the DeepSpeed partitions model states across GPUs to eliminate memory redundancy.
DeepSpeed stages include:

ZeRO-1: shards optimizer states,
ZeRO-2: shards optimizer + gradients,
ZeRO-3: shards optimizer + gradients + parameters (the model weights themselves).

Deepspeed includes support for CPU and NVMe offloading through features known as ZeRO-Offload and ZeRO-Infinity.

You can enable ZeRO stage 3 and direct offload_param to use the CPU when required. The following example represents a portion of what you might typically find in a DeepSpeed configuration file.

{
 "train_batch_size": 8,
 "fp16": { "enabled": true },
 "zero_optimization": {
 "stage": 3,
"offload_param": { "device": "cpu" }
 }
}

The DeepSpeed system will manage the distribution of models and gradients following the specified configuration.

Megatron-LM

NVIDIA offers this framework through a GitHub repository, which enables the training of massive transformer models such as GPT-2, GPT-3, T5, etc. Megatron-LM merges tensor parallelism with pipeline parallelism to achieve massive parallel processing capabilities.

It allows users to define the tensor parallel size, which specifies GPU distribution per layer and the pipeline parallel size to control model stage segmentation. Megatron-LM contains advanced techniques beneficial for training models with billions of parameters from scratch.

Distributed Training Across Multiple Machines

Multiple machines for LLM require setting up a distributed environment where multiple GPUs operate on each node. PyTorch’s distributed communication backend (NCCL) allows you to create seamless connections between all processes.

Master Node Setup: Identify the master node IP address and port, which will manage all other nodes.
Rank and World Size: Each process or node is assigned a rank, while the total number of processes represents the world’s size.
Launch: Use torch.distributed.launch or torchrun to initiate multiple processes across different nodes.
Network Optimization: A high-bandwidth network such as InfiniBand should be used to reduce synchronization overhead.

Distributed LLM Training with PyTorch DDP: A Minimal Example

The PyTorch DistributedDataParallel (DDP) system trains models across multiple GPUs by replicating the model on each GPU and synchronizing gradients to simulate single-device training. The guide summarizes essential steps for implementing distributed training through DDP.

1. Initialize the Process Group

You must configure the distributed backend to handle communication between processes. The following example shows how to use the NCCL (NVIDIA Collective Communications Library) backend for GPU operations:

import torch.distributed as dist
dist.init_process_group(backend="nccl")

It initializes a communication group that connects all processes. Each computing process will correspond to a single GPU.

2. Configure the Device and Wrap the Model with DDP

Identify the GPU for the current process by checking the LOCAL_RANK environment variable and then assign the model to that GPU. Then, wrap the model with DistributedDataParallel:

import os
import torch
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = MyModel().to(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

When loss.backward() is executed, DDP synchronizes and averages gradients across all processes.

3. Use a DistributedSampler in the DataLoader

Instead of randomly shuffling data with shuffle=True, use DistributedSampler to ensure each process receives a unique subset of the dataset. For example:

from torch.utils.data import DataLoader, DistributedSampler
sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=sampler)

The method prevents multiple processes from accessing the same training samples.

4. Implement the Training Loop

Within each process, run the usual training loop: Retrieve data from train_loader and send it to the GPU before computing loss and executing loss.backward() followed by optimizer.step().
DDP automatically handles gradient synchronization in the backward pass without extra coding requirements. Once training concludes, clean up with:
dist.destroy_process_group()

5. Launch Training with torchrun

Launch your training script by using the torchrun utility: torchrun --nproc_per_node=4 train_ddp.py -nproc_per_node=4:

This establishes how many processes to launch on each node (machine).
Typically, you set this equal to the number of GPUs available on the node.
The system will execute four processes on the local machine(i.e., assume it has 4 GPUs).

-train_ddp.py:

The training script for DDP-based distributed training.
Inside this script, you should:
- Initialize the process group (torch.distributed.init_process_group)
- Wrap the model in torch.nn.parallel.DistributedDataParallel
- Use a DistributedSampler for your dataset.

Thanks to the initial setup, each process acquires its appropriate LOCAL_RANK and synchronizes with other processes through the predefined process group.

Common Errors and Debugging

This summary provides an overview of frequent distribution errors for Large Language Models on multiple GPUs and details their causes and suggested solutions.

Error	Description	Causes	Debugging Steps / Solutions
Memory Overflow	Memory overflow is the most common error when attempting to distribute an LLM across multiple GPUs.	Batch Size Too Large: Even with multi-GPU setups, very large batches can push memory usage to its limit. Inefficient Memory Usage: Not using mixed precision or gradient checkpointing can lead to high memory consumption. Incorrect Sharding: Improperly configured model parallelism can result in unequal parameter distribution across GPUs.	Lower Batch Size: Start with a smaller batch size and scale up slowly. Enable Mixed Precision: Use FP16 or BF16 to reduce memory footprint. Inspect GPU Usage: Use tools like `nvidia-smi` to identify which GPU exceeds its memory limit first.
Slow Model Synchronization	Synchronization overhead can hinder performance when models and gradients are frequently exchanged across GPUs.	High latency or low-bandwidth connections result in slower parameter updates, affecting training and inference.	Use High-Bandwidth Interconnects: NVLink or InfiniBand can accelerate data transfers. Optimize Communication: Libraries like NCCL provide efficient GPU communication. Overlap Computation and Communication: Techniques like pipeline parallelism reduce idle GPU time.
Inefficient Parallelism	Adding more GPUs does not always guarantee speedups, especially if workloads are unbalanced or data transfer is too slow.	Load Imbalance: One GPU may be doing significantly more work than others. Suboptimal Batch Sizes: Very small batches can lead to excessive synchronization overhead. I/O Bottleneck: GPUs remain underutilized if data pipelines can’t keep up.	Profile Runtimes: Identify bottlenecks by timing each GPU or node. Auto-Tuning: Some frameworks auto-tune batch sizes or chunks for balanced workloads. Distributed Filesystem: Use fast, distributed storage solutions for multi-machine data access.

If you handle the memory overflow issue, slow synchronization, and inefficient parallelism, the stability and performance of your multi-GPU LLM environment will significantly improve.

Multimodal LLM Considerations

Multimodal LLM models, which combine text with images and potentially audio or video, are becoming more common. Their size often exceeds that of text-only models, making multi-GPU strategies essential.
Key points:

Additional Modalities: Each modality introduces specialized encoder or decoder blocks, which contributes to a higher parameter count.
Custom Layers: Separate distribution strategies may be necessary for image encoders(such as ViT) and audio encoders(like Wav2Vec2).
Intermodal Fusion: Textual elements with visual or auditory components might introduce new pipeline stages.
Tool Support: Your chosen framework must support multimodal input/output functionality and integration capabilities with large model parallel setups.

FAQ SECTION

Q1: Can LLM run on multiple GPUs?
Yes. PyTorch and TensorFlow, modern deep learning frameworks, allow you to train models across multiple GPUs and distributed nodes. Data and model parallelism techniques empower LLMs to operate efficiently across multiple GPUs.

Q2: How to run a model on multiple GPUs?
You can use DistributedDataParallel (PyTorch) or tf.distribute (TensorFlow) to wrap your model for distributed training. Alternatively, Hugging Face Accelerate and DeepSpeed automatically distribute model parameters and data on multiple GPUs.

Q3: Can you use multiple GPUs for machine learning?
Absolutely. Deep learning practitioners often use multiple GPUs to accelerate training and inference processes. Scaling performance is feasible through data distribution or model layer splitting.

Q4: Is it OK to use 2 GPUs at once?
Using two GPUs or more is acceptable and can enhance performance. Two GPUs installed on a single machine can nearly halve the training or inference time when configured correctly, even if you don’t have a cluster.

Q5: How to parallelize LLM?
Parallelize LLM, you can use data parallelism, model parallelism, and sometimes a hybrid of both methods. Libraries like DeepSpeed, Megatron-LM, and Hugging Face Accelerate make this process more straightforward.

Q6: Can you have multiple LLM?
Multiple LLM instances can be created if you have sufficient computational power. Using the same GPUs to run multiple large-scale models can create resource conflicts and trigger out-of-memory errors.

Q7: Can I run multiple deep learning models on the same GPU?
Though it is possible to run them on the same GPU, multiple models will deliver better performance when distributed over multiple GPUs. Running multiple models on a single GPU can cause memory limitations and decrease data processing speed.

Q8: What is a double LLM?
The term “double LLM” describes a system architecture that combines two large language models operating together to improve task performance, accuracy, or efficiency. By harnessing the two models’ complementary strengths, this approach assigns different responsibilities to each model.

Q9: Can I train my own LLM?

Training your LLM is possible with sufficient data, computational resources(such as GPUs), and a reliable training pipeline. Developers can train large language models from scratch through open-source programs such as Megatron-LM or DeepSpeed.

Q10: What is the difference between single and multimodal LLM?
A single-modal LLM processes only one input type. Multimodal LLMs can manage various data types, such as text combined with images or audio.

Q11: What is model parallelism, and how does it apply to LLMs?

Model parallelism involves splitting a single model’s parameters across multiple GPUs, with each GPU managing a portion of the network. It is essential for extremely large LLMs because it allows the model to fit into limited GPU memory and supports concurrent processing.

Conclusion

For researchers and developers working on cutting-edge NLP and multimodal AI systems today, splitting large language models across multiple GPUs has become essential.
This article has presented practical methods for handling modern LLMs’ computational and memory limitations. This includes model and data parallelism and tools like DeepSpeed, Hugging Face Accelerate, and Megatron-LM.

Applying proper strategies will enable multi-GPU systems to support scalable training and faster inference times. Future AI development will rest upon efficient GPU memory management and optimized distributed training methods as models evolve to integrate various data types.
Practitioners who understand the architecture of parallelism and use open-source tools while anticipating common performance issues can maximize the potential of large and multimodal LLMs.

Resources

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Adrien Payong

Author

AI consultant and technical writer

See author profile

I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.

See author profile

Shaoni Mukherjee

Editor

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags: