Recently, our site reliability engineering team started getting alerted about memory pressure on some of our Redis instances which have very small working sets *1. As we started digging into the issue, it became clear that there were problems with freeing memory after initial allocation because there were a relatively small number of keys but a comparatively large amount of memory allocated by `redis-server` processes. Despite initially looking like a leak, the problem was actually an issue between an alternative memory allocator and transparent huge pages.
If you already know what transparent huge pages are and how `madvise(2)` works, you can skim this section. For those who don’t, read on.
A page is a chunk of memory that a processor allocates for use, typically in 4kb chunks. When an application has to access virtual memory, it has to resolve its virtual memory address to the physical address of the page. The intermediary between physical addresses and mapped virtual memory is called the page table. For every 1GB of memory allocated in 4kb pages, there are 262,144 entries in the page table — and, of course, the more pages in the page table, the longer it takes to translate addresses.
Virtual memory makes page management even more complicated by allowing applications to address pages which don’t actually exist in main memory. When this happens, it causes a fault, but the kernel knows how to handle faults in virtual memory and will pull pages off of secondary storage (e.g. local spinning rust or flash, NAS, etc.) without the application knowing the fault happened.
Huge pages are exactly what they sound like — pages that are much larger than 4kb in size. They cut down on the number of entries in the page table, thereby reducing the number of table lookups needed to find where a specific range of virtual memory is mapped.
Linux implements support for huge pages *2, which requires changes in software running in user space to take advantage of these potential performance benefits. They come in two varieties (2MB and 1GB — the available sizes depend upon the CPU in use) and have to get configured at boot time via parameters that get passed to the kernel.
The implementation of huge pages itself is pretty boring, so let’s talk about transparent huge pages. This is where the fun begins.
User space software has traditionally had to implement its own support for huge pages, but it’s difficult to do and requires lots of testing to be utilized effectively. Rather than having these user space applications manage their interactions with huge pages, transparent huge pages allow applications to use huge pages…. well, transparently. This manifests itself as the kernel doing some additional management of memory being allocated, marked, and subsequently freed with (in our case) 2MB underlying pages.
This all sounds useful, but it turns out that some alternative memory allocators don’t play nicely with transparent huge pages.
`madvise(2)` is not part of POSIX, but it is inspired by the POSIX function `posix_fadvise(2)`*3. It gives advice to the kernel about what it should do with a specific range of memory when it comes time to evict pages. The advice must be given for a specific range of memory starting at an address for `n` bytes after that address.
It also passes a parameter around the piece of advice, like “free this address and `n` bytes after it whenever you’re ready” (`MADV_DONTNEED`) or “this address and `n` bytes after it are going to be used soon, so you should probably read some pages ahead” (`MADV_WILLNEED`).
Some memory allocators, like the one included as part of glibc, don’t deal with marking pages using `madvise(2)`. However, `jemalloc(3)`*does* mark ranges with `madvise(…, MADV_DONTNEED)`, but it’s important to note that it’s on a range rather than at the “left” and “right” edges of a specific page or group of pages.
This rabbit hole began when a `redis-server` process, which had recently been moved over to `LD_PRELOAD``jemalloc.so`, began using significant amounts of memory. Initial signs pointed to the fact that using an alternative allocator might be part of the issue, so that’s where we started digging.
It turns out that `jemalloc(3)` uses `madvise(2)` extensively to notify the operating system that it’s done with a range of memory which it had previously `malloc`'ed. Because the machine used transparent huge pages, the page size was 2MB. As such, a lot of the memory which was being marked with `madvise(…, MADV_DONTNEED)` was within ranges substantially smaller than 2MB. This meant that the operating system never was able to evict pages which had ranges marked as `MADV_DONTNEED` because the entire page would have to be unneeded to allow it to be reused.
So despite initially looking like a leak, the operating system itself was unable to free memory because of `madvise(2)` and transparent huge pages. *4 This led to sustained memory pressure on the machine and `redis-server` eventually getting OOM killed.
by Sam Kottler
Bugs around memory allocation often become more apparent with data stores because they tend to allocate and free memory at a relatively rapid pace. We use Redis as a cache and queue for ephemeral jobs, meaning that it allocates and frees substantial amounts of memory given that types of operations we are doing.
Huge pages are also incorporated into some other widely used Unix kernels, like FreeBSD, as superpages; the same concept is available on Windows as large pages. Despite the different names, the functionality is fundamentally the same.
undefined is specifically targeted towards file access rather than direct memory management, and takes a file descriptor as the first argument rather than a pointer to an address.
Note that disabling transparent huge pages isn’t possible via
undefined. Rather, it requires manually echoing settings into
undefined at or after boot. In
undefined or by hand:
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi if test -f /sys/kernel/mm/transparent_hugepage/defrag; then echo never > /sys/kernel/mm/transparent_hugepage/defrag fi