Linux has robust systems and tooling to manage hardware devices, including storage drives. In this article we’ll cover, at a high level, how Linux represents these devices and how raw storage is made into usable space on the server.
Block storage is another name for what the Linux kernel calls a block device. A block device is a piece of hardware that can be used to store data, like a traditional spinning hard disk drive (HDD), solid state drive (SSD), flash memory stick, and so on. It is called a block device because the kernel interfaces with the hardware by referencing fixed-size blocks, or chunks of space.
In other words, block storage is what you think of as regular disk storage on a computer. Once it is set up, it acts as an extension of the current filesystem tree, and you should be able to write to or read information from each drive interchangeably.
Disk partitions are a way of breaking up a storage drive into smaller usable units. A partition is a section of a storage drive that can be treated in much the same way as a drive itself.
Partitioning allows you to segment the available space and use each partition for a different purpose. This gives a user more flexibility, allowing them to potentially segment a single disk for multiple operating systems, swap space, or specialized filesystems.
While disks can be formatted and used without partitioning, operating systems usually expect to find a partition table, even if there is only a single partition written to the disk. It is generally recommended to partition new drives for greater flexibility.
When partitioning a disk, it is important to know what partitioning format will be used. This generally comes down to a choice between MBR (Master Boot Record) and GPT (GUID Partition Table).
MBR is over 30 years old. Because of its age, it has some serious limitations. For instance, it cannot be used for disks over 2TB in size, and can only have a maximum of four primary partitions.
GPT is a more modern partitioning scheme that resolves some of the issues inherent with MBR. Systems running GPT can have many more partitions per disk. This is usually only limited by the restrictions imposed by the operating system itself. Additionally, the disk size limitation does not exist with GPT and the partition table information is available in multiple locations to guard against corruption. GPT can also write a “protective MBR” for compatibility with MBR-only tools.
In most cases, GPT is the better choice unless your operating system prevents you from using it.
While the Linux kernel can recognize a raw disk, it must be formatted to be used. Formatting is the process of writing a filesystem to the disk and preparing it for file operations. A filesystem is the system that structures data and controls how information is written to and retrieved from the underlying disk. Without a filesystem, you could not use the storage device for any standard filesystem operations.
There are many different filesystem formats, each with trade-offs, including operating system support. They all present the user with a similar representation of the disk, but the features and the platforms that they support can be very different.
Some of the more popular filesystems for Linux are:
Additionally, Windows primarily uses *NTFS and ExFAT, and macOS primarily uses HFS+ and APFS. It is usually possible to read and sometimes write these filesystem formats on different platforms, but may require additional compatibility tools.
In Linux, almost everything is represented by a file somewhere in the filesystem hierarchy. This includes hardware like storage drives, which are represented on the system as files in the
/dev directory. Typically, files representing storage devices start with
hd followed by a letter. For instance, the first drive on a server is usually something like
Partitions on these drives also have files within
/dev, represented by appending the partition number to the end of the drive name. For example, the first partition on the drive from the previous example would be
/dev/hd* device files represent the traditional way to refer to drives and partitions, there is a significant disadvantage to using these values alone. The Linux kernel decides which device gets which name on each boot, so this can lead to confusing scenarios where your devices change device nodes.
To work around this issue, the
/dev/disk directory contains subdirectories corresponding with different, more persistent ways to identify disks and partitions on the system. These contain symbolic links that are created at boot back to the correct
/dev/[sh]da* files. The links are named according to the directory’s identifying trait (for example, by partition label in for the
/dev/disk/by-partlabel directory). These links will always point to the correct devices, so they can be used as static identifiers for storage spaces.
Some or all of the following subdirectories may exist under
by-label: Most filesystems have a labeling mechanism that allows the assignment of arbitrary user-specified names for a disk or partition. This directory consists of links named after these user-supplied labels.
by-uuid: UUIDs, or universally unique identifiers, are a long, unique string of letters and numbers that can be used as an ID for a storage resource. These are generally not very human-readable, but are almost always unique, even across systems. As such, it might be a good idea to use UUIDs to reference storage that may migrate between systems, since naming collisions are less likely.
by-partuuid: GPT tables offer their own set of labels and UUIDs, which can also be used for identification. This functions in much the same way as the previous two directories, but uses GPT-specific identifiers.
by-id: This directory contains links generated by the hardware’s own serial numbers and the hardware they are attached to. This is not entirely persistent, because the way that the device is connected to the system may change its
by-id, this directory relies on a storage device’s connection to the system itself. The links here are constructed using the system’s interpretation of the hardware used to access the device. This has the same drawbacks as
by-idas connecting a device to a different port can alter this value.
by-uuid are the best options for persistent identification of specific devices.
Note: DigitalOcean block storage volumes control the device serial numbers reported to the operating system. This allows for the
by-id categorization to be reliably persistent on this platform. This is the preferred method of referring to DigitalOcean volumes as it is both persistent and predictable on first boot.
In Linux and other Unix-like operating systems, the entire system, regardless of how many physical devices are involved, is represented by a single unified file tree. When a filesystem on a drive or partition is to be used, it must be hooked into the existing tree. Mounting is the process of attaching a formatted partition or drive to a directory within the Linux filesystem. The drive’s contents can then be accessed from that directory.
Drives are almost always mounted on dedicated empty directories – mounting on a non-empty directory means that the directory’s usual contents will be inaccessible until the drive is unmounted). There are many different mounting options that can be set to alter the behavior of a mounted device. For example, the drive can be mounted in read-only mode to ensure that its contents won’t be altered.
The Filesystem Hierarchy Standard recommends using
/mnt or a subdirectory under it for temporarily mounted filesystems. It makes no recommendations on where to mount more permanent storage, so you can choose whichever scheme you’d like. In many cases,
/mnt subdirectories are used for more permanent storage as well.
Linux systems use a file called
/etc/fstab (filesystem table) to determine which filesystems to mount during the boot process. Filesystems that do not have an entry in this file will not be automatically mounted unless scripted by some other software.
Each line of the
/etc/fstab file represents a different filesystem that should be mounted. This line specifies the block device, the mount point to attach it to, the format of the drive, and the mount options, as well as a few other pieces of information.
While many use cases will be accommodated by these core features, there are more complex management paradigms available for joining together multiple disks, notably RAID.
RAID stands for redundant array of independent disks. RAID is a storage management and virtualization technology that allows you to group drives together and manage them as a single unit with additional capabilities.
The characteristics of a RAID array depend on its RAID level, which defines how the disks in the array relate to each other. Some of the more common levels are:
If you have a new storage device that you wish to use in your Linux system, this article will guide you through the process of partitioning, formatting, and mounting your new filesystem. This should be sufficient for most use cases where you are mainly concerned with adding additional capacity. To learn how to perform storage administration tasks, check out How To Perform Basic Administration Tasks for Storage Devices in Linux.
If you’ve enjoyed this tutorial and our broader community, consider checking out our DigitalOcean products which can also help you achieve your development goals.
DigitalOcean Block Storage allows you to attach additional storage volumes to your Droplets quickly and easily. Block Storage volumes function like regular block devices when attached to your servers, allowing you to use familiar tools to manage your storage needs. In this series, we will introduce basic Linux storage terminology, cover how to create and manage Block Storage volumes, and how to perform a variety of administrative tasks to keep your volumes running smoothly.