RDMA Explained: The Backbone of High-Performance Computing

Published on June 26, 2025

AI/ML

Deep Learning

Networking

Write for DO

By Adrien Payong and Shaoni Mukherjee

RDMA Explained: The Backbone of High-Performance Computing

Introduction

Enterprises, cloud providers, and data centers must meet high-performance networking and low-latency data transfer requirements for efficiency.
Traditional networking methods show limitations because they cannot handle the growing data demands of AI/ML workloads.
That’s where Remote Direct Memory Access (RDMA) comes into play. RDMA allows direct data transfers between two computers’ memories without involving the CPU, operating system, or most network stack components.

In this article, we will provide an overview of RDMA technology, including its functionality and operation, main protocols, comparisons with TCP/IP networks, practical applications, and common pitfalls. Systems engineers, cloud architects, and anyone passionate about internet technology will find this guide a valuable resource on how RDMA can revolutionize network infrastructure.

Key Takeaways

RDMA enables ultra-fast, low-latency data transfers by allowing direct memory access between servers, bypassing the CPU and OS.
RDMA achieves much lower latency, higher throughput, and CPU efficiency than traditional TCP/IP networking.
For proper use of RDMA technology, specialized network interface cards, network configurations, and RDMA-aware applications are required.
RDMA technology is used in various fields, such as high-performance computing, artificial intelligence, cloud storage, virtualization, and real-time analytics.
Optimal performance with RDMA technology requires accurate configuration of hardware, software, and tuning processes.

What is Remote Direct Memory Access

RDMA enables direct data transfers from one computer’s memory to another without involving the remote system’s CPU or operating system.
Traditional networking systems, like TCP/IP, process data packets by moving them through each layer of the OS networking stack. This involves copying data between buffers at every stage and consuming CPU resources for each packet.

RDMA in Action: The “Zero-Copy” Concept

Traditional networking:

Data transitions from user space into kernel space, travels through the network stack, and returns to user space on the receiving end.
Each copy and stack traversal operation increases latency and consumes CPU processing power.

RDMA networking:

Data moves directly between machines without additional copies, Zero-copy networking.
Once the initial setup is complete, the operating system no longer handles operations.
The CPU can direct its resources towards application logic.

RDMA Architecture: How RDMA Works

The following describes the internal workflow of RDMA data transfer between two systems—Requester (initiator) and Responder (receiver):

Image source

Key Components

CPU: Manages the application logic and posts work queue entries (WQEs).
PCIe: The interface that connects the CPU’s memory to the network interface card (RNIC).
WQE Cache / DMA Engine: Offloads data transfer tasks from the CPU and directly interacts with memory to move data efficiently.
ORT (Outstanding Request Table), Reordering Buffer, Bitmaps, Transport: Maintaining message order, ensuring reliable data transport, and reassembling data within the RNIC.
Receiving Buffer / Basic NIC: The final storage for incoming data on the responder’s RNIC.

Workflow Steps

The user submits a work request to the send queue; the RNIC retrieves this request and temporarily stores it in a local cache.
Request processing continues with the RNIC parsing the request, converting the virtual address to a physical one using a memory translation table, and logging it in a table for quick retransmission if needed.
Incoming data is queued into the receiving buffer, where any out-of-order packets are identified and reordered as necessary.
Upon receipt of a data packet, the RNIC matches it to the corresponding received work request, fetches the destination buffer address, and transfers the reordered data directly into the host memory.
It then sends an acknowledgment (ACK) back to the requester and informs the user application that the data has been received.
Finally, when the ACK is received, the RNIC updates the completion queue to indicate that the original send request has been completed.

Summary

The Requester posts a send request and the RNIC takes control of data movement (no longer needs to engage the CPU after setup).
The Requester’s memory sends data straight to the Responder’s memory using zero-copy networking.
This architecture uses hardware acceleration, skipping most OS and CPU layers for high-performance and low-latency data transfer.

RDMA Protocols: InfiniBand, RoCE, and iWARP

Understanding RDMA protocols will allow you to choose the optimal solution for your environment. RDMA protocols offer unique benefits and are suited for specific operational requirements.

InfiniBand

InfiniBand protocol delivers high performance for RDMA protocols by offering low latency and high throughput. High-performance computing environments rely heavily on InfiniBand, which powers numerous leading supercomputers worldwide.

Key characteristics of InfiniBand:

Native RDMA support with hardware-level reliability
Extremely low latency (sub-microsecond)
Achieves High bandwidth capabilities that exceed 200 Gbps
Requires specialized switches and adapters.
Higher cost compared to Ethernet-based solutions

RDMA over Converged Ethernet (RoCE)

RoCE delivers RDMA functionality over standard Ethernet networks. This results in lower costs and improved accessibility when compared to InfiniBand. There are two versions of RoCE:

RoCEv1: RoCEv1 operates at Layer 2 (link layer) but can only work within one broadcast domain.
RoCEv2: RoCEv2 functions at Layer 3 (network layer), allowing packets to be directed across subnets, and supports a broad network system.

RoCE provides RDMA benefits for existing Ethernet setups, which creates an attractive solution for data centers wanting to enhance their network performance without replacing their existing infrastructure.

iWARP

The Internet Wide Area RDMA Protocol (iWARP) provides RDMA functionality over standard Ethernet networks using TCP/IP. The protocol wraps RDMA traffic inside TCP/IP structures, which provides direct memory access capabilities without special hardware modifications.

Advantages of iWARP:

Works with standard Ethernet switches
Leverages existing TCP/IP infrastructure
Provides reliability through TCP’s error recovery mechanisms
iWARP demands more memory resources than alternative RDMA protocols.

RDMA vs TCP/IP: A Comprehensive Comparison

RDMA differentiates itself from traditional TCP/IP networking through its performance features, making it ideal for high-performance applications that require low latency. The following table contrasts their main attributes:

Characteristic	Traditional TCP/IP	RDMA
Latency	Multiple context switches, protocol processing, higher latency (typically tens of microseconds or more)	Direct memory access; ultra-low latency (as low as 1–2 microseconds)
Throughput	Requires multiple threads and high CPU usage to approach maximum bandwidth	Single-threaded throughput up to 40 Gbps with minimal CPU usage
CPU Utilization	Significant CPU resources required, especially at high data rates	Near-zero CPU usage for data transfer operations
Protocol Overhead	High overhead for error correction, congestion control, packet sequencing	Hardware-level management reduces protocol overhead
Scalability	Bottlenecks emerge as node count increases due to CPU and protocol limitations	Scales efficiently with consistent performance even as nodes increase

RDMA Hardware Requirements and Compatibility

Below are essential hardware components:

RDMA NIC: RDMA deployments rely on network interface cards that support RDMA functionality. These cards must support InfiniBand, RoCE, or iWARP protocol. Most customers choose RDMA-capable network cards from NVIDIA (Mellanox), Intel, and Chelsio.
Network Infrastructure: To deploy RoCE networking, you must use Ethernet switches that offer the necessary features to maintain high RDMA performance. InfiniBand requires specialized InfiniBand switches.
Memory Requirements: System memory must be sufficient to handle buffer registration during RDMA operations. Users should properly configure the ulimit -l setting to enable memory locking.
System Requirements: The highest RDMA performance in GPU-accelerated systems requires the RDMA adapter and GPU to operate under the same PCIe root complex. This configuration provides minimal latency while optimizing bandwidth consumption.

Operating System Support for RDMA

RDMA has broad OS support, with Linux leading in high-performance environments:

RDMA Implementation on Linux

Because Linux provides strong RDMA support, it is the operating system of choice for high-performance network environments. RDMA functionality is enabled through the rdma-core package and specific kernel modules that establish direct communication with RDMA hardware.

The Linux RDMA Stack

The Linux RDMA stack is built from these essential components:

Kernel Modules: RDMA operation modules can be found in the directory /lib/modules/$(uname -r)/kernel/drivers/infiniband/. The directory includes:
- Hardware Device Drivers (hw/): Modules for InfiniBand and RDMA-capable Ethernet NICs.
- Software Drivers (sw/): The software driver simulates RDMA functionality using standard Ethernet hardware.
- Upper Layer Protocol Modules (ulp/): These modules support RDMA protocols including iSER(iSCSI Extensions for RDMA), SRP(SCSI RDMA Protocol), and IPoIB(IP over InfiniBand).
User Space Libraries: Developers can use powerful user space libraries like libibverbs and librdmacm to build high-performance applications that interact directly with RDMA hardware. These libraries provide standardized APIs, simplifying the process of developing RDMA-aware software without digging into complex kernel-level interactions.
Configuration Files: The file /etc/rdma/rdma.conf allows system administrators to control the selection and behavior of RDMA protocols. System configurations enable administrators to decide which drivers and protocols will remain active.

Practical Steps: RDMA Setup in Linux

Implementing RDMA on Linux involves several essential stages.

In the following example, we will show how to establish RDMA communication between two Linux nodes. We will assume both servers (let’s call them nodeA and nodeB) have RDMA-capable network adapters and are running Ubuntu 22.04.

Step 1: Install RDMA Core Libraries and Drivers: We can install the RDMA libraries, drivers, and debugging utilities using our distribution’s package manager. These packages include drivers and tools to configure, manage, and test RDMA. For Ubuntu, this can be done by running:

sudo apt update
sudo apt install rdma-core ibverbs-utils infiniband-diags

Step 2: Enable and Start the RDMA Management Service: Next, enable and start the RDMA management service, which manages RDMA devices and is essential for RDMA operations:

sudo systemctl enable rdma
sudo systemctl start rdma

With these commands, RDMA services will also start automatically upon system reboot.

Step 3: Configure Memory Limits: This configuration will enable users in the RDMA group to lock as much memory as they want, which is required for stability and performance. For this purpose, you must increase the memory limit of users or groups involved in RDMA transfers by editing /etc/security/limits.d/rdma.conf:

@rdma   soft   memlock   unlimited
@rdma   hard   memlock   unlimited

Step 4: Configure RDMA Network Interfaces: Assign IP addresses to the RDMA-capable network interfaces on both nodes. For example, if the name of your InfiniBand or RoCE interface is ib0, run the following commands:

On nodeA:

sudo ip addr add 192.168.100.1/24 dev ib0
sudo ip link set ib0 up

On nodeB:

sudo ip addr add 192.168.100.2/24 dev ib0
sudo ip link set ib0 up

This step prepares the interface for RDMA traffic.

Step 5: Test and Validate RDMA Connectivity: Once the setup is complete, you must ensure RDMA is working as expected.

Check for available RDMA devices:

ibv_devinfo

This command displays information about RDMA devices present on the system.

Test basic connectivity with ping:

ping 192.168.100.2 # From nodeA to nodeB

This ensures there is basic network connectivity between the nodes.

Test RDMA communication with rping:

On nodeB (as server):

rping -s -a 192.168.100.2

On nodeA (as client):

rping -c -a 192.168.100.2

Successful completion confirms the RDMA path is operational and performing as expected.

Windows Server RDMA Support

Microsoft Windows Server 2012 and subsequent versions incorporate RDMA support. Windows uses the NetworkDirect interface for native RDMA support. SMB Direct is a feature of SMB 3.0 that uses RDMA for faster file transfers.
Features:

Automatic detection of RDMA adapters
Multiple RDMA connections per session for fault tolerance
Integrates smoothly with your existing SMB infrastructure.

RDMA Integration in VMware

RDMA support was introduced in VMware vSphere 7.0 and above for virtualization with RoCE v2 (RDMA over Converged Ethernet v2) adapters and NVMe over RDMA storage solutions. This can allow you to use RDMA inside virtualized environments to:

Connect to NVMe storage devices over RDMA fabrics
Reduce latency for storage operations
Increase virtualized workload performance

Real-World Use Cases and Examples

RDMA technology has improved performance, efficiency, and scalability in various fields and use cases. Here are the most common use cases where RDMA can provide tangible value:

Use Case	How RDMA Adds Value	Example Technologies / Scenarios
High-Performance Computing (HPC) Clusters	Enables ultra-low-latency message passing and shared storage between nodes. Essential for parallel scientific computing, simulations, and applications where every microsecond counts.	InfiniBand networks for supercomputers; MPI (Message Passing Interface); Storage systems like Lustre and GPFS.
AI/ML Distributed Training	Enables fast parameter synchronization and data transfer between GPUs across nodes to improve scaling and speed up the training of large deep learning and AI models.	TensorFlow, PyTorch, NVIDIA NCCL, GPUDirect RDMA; RoCE/InfiniBand networks in multi-GPU clusters.
Cloud Storage & Data Services	Provides high-throughput, low-latency access to remote disks for fast storage services, improved data replication, and better performance for cloud databases and distributed caches.	NVMe over Fabrics (NVMe-oF); Microsoft SMB Direct & Storage Spaces Direct; iSER for block storage, MySQL.
Private Cloud & Virtualized Data Centers	Provides nearly native network speeds for virtual machines and containers, supporting low-latency applications and enabling scalable, high-performance cloud environments within virtualized setups.	VMware ESXi with PVRDMA or SR-IOV; Azure RDMA-enabled VMs; Low-latency trading
Big Data Analytics & In-Memory Computing	Accelerates data transfers and inter-process communications in analytics and streaming frameworks, minimizing tail latencies and assisting real-time, large-scale data processing.	Apache Spark, Apache Kafka, Apache Ignite; RDMA-accelerated microservices and market data feeds.
Caching & Database Systems	Reduces latency for distributed cache operations and database replication, ensuring swift access to in-memory data across cluster nodes—ideal for cloud-native and SaaS solutions.	DigitalOcean Managed Valkey with RDMA support; RDMA-enabled Memcached.

Common Mistakes and Misconceptions

Below is a table of the most common RDMA mistakes and misconceptions that occur during deployment, with a brief explanation for each one.

Mistake or Misconception	Explanation
Assuming RDMA works on any network card	RDMAs require RDMA-capable NICs (RNICs) on both ends. Attempting RDMA with non-capable hardware or without enabling the RDMA feature (e.g., through the drivers/firmware) will fail.
Misconfiguring the network for RoCE	When using RoCE, the network switches must be configured for PFC (Priority-based Flow Control) and ECN (Explicit Congestion Notification) to avoid packet drops and achieve optimal performance.
Confusing RDMA protocols and compatibility	InfiniBand and Ethernet are not directly compatible; RoCE and iWARP are different protocols and not interoperable. Mixing hardware or assuming “RDMA = InfiniBand” can cause compatibility issues.
Neglecting the need for application support	Enabling RDMA in hardware and network does not speed up every workload. Applications must be RDMA-aware and specifically configured to leverage RDMA transports for benefits.
Expecting RDMA to solve all performance problems	While RDMA reduces network overhead, bear in mind that it won’t solve issues arising from other bottlenecks like disk I/O, CPU limitations, or software inefficiencies. Conducting thorough profiling is essential to ensure RDMA addresses actual pain points.
Believing RDMA is unreliable or risky	Modern RDMA technologies such as InfiniBand and RoCE are open standards, enjoy large support, and are highly reliable when configured correctly. Security features and interoperability have seen considerable improvements.
Ignoring maintenance and tuning	Neglecting essential maintenance tasks like firmware updates, buffer tuning, and network monitoring can lead to performance drops or system errors. Regular checks and monitoring are key to ensure RDMA operates smoothly and efficiently.

Knowledge of these mistakes can help you get the most out of your RDMA architecture. This way, you can avoid significant troubleshooting down the road.

FAQs SECTION

What is RDMA used for?
RDMA technology is used in high-performance computing, cloud storage, AI/ML clusters, virtualization, and financial systems for ultra-low latency and high-performance data transfers.

How is RDMA different from TCP/IP?
Unlike traditional network protocols like TCP/IP, RDMA bypasses the standard network stack. Instead, it allows direct memory access, which speeds up data transfers between applications and network hardware.

Does RDMA require special hardware?
Yes, to use RDMA, you’ll need network interface cards (NICs) that support RDMA. For some protocols, you might also need switches that support lossless data transmission to maximize performance.

What OS supports RDMA?
RDMA support is available across multiple platforms, including Linux via rdma-core, Windows Server 2012, and later( via NetworkDirect, and various UNIX-based systems. Besides, VMware vSphere offers RDMA functionality for virtualized environments.

Is RDMA only for supercomputers?
No—While RDMA was originally developed for high-performance computing, it’s now widely adopted in cloud services, enterprise data centers, AI and machine learning workloads, and virtualization setups.

Conclusion

Enterprises, cloud providers, and data centers can use RDMA to satisfy the demands of AI, ML, and other data-intensive applications for low latency, high throughput, and CPU-efficient networking.

With RDMA, you can bypass the TCP/IP stack and instead use alternative protocols like InfiniBand or RoCE. This can lead to lower latency, higher throughput, and better CPU efficiency. RDMA can be used with Linux, Windows Server, and VMware platforms with solid hardware and network infrastructure compatibility. .
If your organization is considering a data center upgrade, it is essential to understand RDMA’s architecture, practical considerations, and use cases. You can further improve your understanding by exploring the following articles:

Integrating RDMA with these solutions will ensure a scalable, future-ready network foundation for your most demanding workloads.

References

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Adrien Payong

Author

AI consultant and technical writer

See author profile

I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.

See author profile

Shaoni Mukherjee

Editor

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags: