kaashif's blog

Programming, with some mathematics on the side

The problem with using splice(2) for a faster cat(1)

2023-03-12

A few weeks ago, I was reading a Hacker News post about a clipboard manager. I can't remember which one exactly, but an example is gpaste - they let you have a clipboard history, view that history, persist things to disk if you want, and so on.

One comment caught my eye: it asked why clipboard managers didn't use the splice(2) syscall. After all, splice allows copying the contents of a file descriptor to a pipe without any copies between userspace and kernelspace.

Indeed, replacing a read-write combo with splice does yield massive performance gains, and we can benchmark that. That got me thinking: why don't other tools use splice too, like cat? What are the performance gains? Are there any edge cases where it doesn't work? How can we profile this?

There are blog posts from a while ago lamenting the lack of usage of splice, e.g. https://endler.dev/2018/fastcat/ and interestingly enough, things may have changed since 2018 (specifically, in 2021), giving us new reasons to avoid splice.

The conclusion is basically that splice isn't generic enough, the details are pretty interesting.

What's our performance metric?

The basic question we're trying to answer is how fast can a program take a filename and write the contents to stdout? We're measuring performance in bits per second.

One important point is that we want to benchmark with the kernel read cache warmed, i.e. we run the benchmarks a few times until the number settles down. This is important because the only difference between any of our methods will be a memory-to-memory copy, which is always going to be multiple times slower than a disk-to-memory read, even with DMA.

Warming the read cache means everything is memory-to-memory and differences in how we do that will show up.

I'll create a file with 10,000M of zeroes and benchmark cat using pv as follows:

$ dd if=/dev/zero of=10g_zero bs=1M count=10000
$ cat 10g_zero | pv > /dev/null
...
$ !! # Repeat to warm cache
9.77GiB 0:00:02 [4.72GiB/s] [   <=>                                                             ]

So 4.72GiB/s is the number to beat!

read-write implementation

This is the dumb way you'd write a file to stdout. Make a buffer, open the file, read it out in chunks, and write those chunks to stdout. The only thing to tune here is really the buffer size I think. 32k seems to get the best performance on my machine.

Here's the code, no error handling:

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>


int main(int argc, char* argv[]) {
    size_t buf_size = 32 * 1024;
    char *buf = malloc(buf_size);

    char *fname = argv[1];
    int fd = open(fname, O_RDONLY);

    while (1) {
        ssize_t bytes_read = read(fd, buf, buf_size);

        if (bytes_read == 0) {
            return EXIT_SUCCESS;
        }

        write(STDOUT_FILENO, buf, bytes_read);
    }
}

I called this slow.c. Here's the benchmark:

$ ./slow 10g_zero | pv > /dev/null
9.77GiB 0:00:01 [7.38GiB/s] [  <=>                                                              ]

So that's actually faster than cat already. 7.38 GiB/s vs 4.72 GiB/s. But this is doing unnecessary memory-to-memory copies from kernelspace to userspace on read, then from userspace to kernelspace on write. Our ideal solution would just move (not even copy) pages from the file to stdout, with all buffers owned by the kernel.

splice implementation

The splice implementation is a bit more complex, but not much. Looking at the man page for splice with man 2 splice, we can see the description:

splice() moves data between two file descriptors without copying between kernel address
space and user address space.  It transfers up to len bytes of data from the  file  de-
scriptor  fd_in  to  the file descriptor fd_out, where one of the file descriptors must
refer to a pipe.

Here's my code for my splice-based cat:

#define _GNU_SOURCE

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[]) {
    size_t buf_size = 16 * 1024;

    char *fname = argv[1];

    int fd = open(fname, O_RDONLY);
    off64_t offset = 0;

    while (1) {
        ssize_t bytes_spliced = splice(fd, &offset, STDOUT_FILENO, NULL, buf_size, SPLICE_F_MOVE | SPLICE_F_MORE);

        if (bytes_spliced == 0) {
            return EXIT_SUCCESS;
        }

        if (bytes_spliced < 0) {
            fprintf(stderr, "%s\n", strerror(errno));
            return bytes_spliced;
        }
    }
}

I called this fast.c.

Some notes about this:

  • #define _GNU_SOURCE gives us access to splice, which is a non-standard (where the standard is POSIX) extension to fcntl.h. This is one reason splice probably isn't used more widely - it's not portable.

  • The flag SPLICE_F_MOVE is literally a no-op, it used to be a hint to the kernel to move pages where possible, but now does literally nothing. I added it because I do want a move, but I know it does nothing.

  • SPLICE_F_MORE is a hint saying more data is coming in a future splice. It's true for most splices in our case (all but the last). Not sure how useful it is outside of socket programming, where it's sometimes not obvious to the kernel that more data is coming.

Enough with the notes! Let's see some performance numbers!

$ ./fast 10g_zero | pv > /dev/null
9.77GiB 0:00:00 [26.8GiB/s] [ <=>                                                               ]

Whoa, holy shit, 26.8 GiB/s? That's more than 5.6x as fast as cat! This warrants some further investigation.

Profiling, fast and slow

This section title is a reference to "Thinking, Fast and Slow" by Daniel Kahneman, which I haven't read.

fast is so fast I feel like we have to look into it to make sure nothing weird is going on.

We can use perf to profile our programs and see where we're spending time. You can install it by installing the linux-tools version specific to your kernel version. I'm on Ubuntu so I needed to do:

$ sudo apt install linux-tools-5.19.0-32-generic

Let's look at cat first. Here's the command to run your program and record performance in perf.data:

$ sudo perf record -- cat ../10g_zero > /dev/null

Why sudo? Without sudo, perf says something about kernel symbols and symbol map restrictions if you're not root, so I just run everything here as root. Sue me. It's not like we're running untrusted code here!

To generate a breakdown with the percentage of time spent in each function:

$ sudo perf report

For the above case, the report looks like:

Overhead  Command  Shared Object      Symbol
  75.43%  cat      [kernel.kallsyms]  [k] copy_user_generic_string                              
   3.22%  cat      [kernel.kallsyms]  [k] filemap_read                                          
   2.75%  cat      [kernel.kallsyms]  [k] filemap_get_read_batch

Then a bunch of negligible <1% stuff.

The function copy_user_generic_string copies to/from userspace. It's clear that's what's taking the vast majority of time. The perf report for slow looks the same:

Overhead  Command  Shared Object         Symbol
  70.53%  slow     [kernel.kallsyms]     [k] copy_user_generic_string
   3.86%  slow     [kernel.kallsyms]     [k] filemap_read
   3.82%  slow     [kernel.kallsyms]     [k] filemap_get_read_batch

This is as expected. Let's look at the perf report for fast:

$ sudo perf record ../fast ../10g_zero > /dev/null
Invalid argument

Oh, that's because at least one of the input and output have to be a pipe and in this case, both are files. Let's just throw a cat in there:

$ sudo perf record ../fast ../10g_zero | cat > /dev/null
Invalid argument

Huh? What? This is annoying, maybe perf does something dodgy to stdout so we can't splice to it? Let's try making perf output to a file:

$ sudo perf record -o perf.out -- ../fast ../10g_zero | cat > /dev/null

That finally works. What an ordeal. The report looks like this:

Overhead  Command  Shared Object         Symbol
  60.55%  fast     [kernel.kallsyms]     [k] mutex_spin_on_owner                                ▒
   7.86%  fast     [kernel.kallsyms]     [k] filemap_get_read_batch                             ▒
   2.95%  fast     [kernel.kallsyms]     [k] copy_page_to_iter                                  ▒
   2.86%  fast     [kernel.kallsyms]     [k] __mutex_lock.constprop.0                           ▒
   2.47%  fast     [kernel.kallsyms]     [k] copy_user_generic_string

Notice how little time we're spending copying pages between user and kernel. It's clear that the stories of increased performance are true.

The final straw: why splice isn't more widely used

Our journey has led us to a few reasons why splice isn't used more widely:

  • Not portable: this is kind of a non-reason because everyone just uses Linux, but maybe someone cares about this.

  • Not general: you can't splice between files and files (you can just use sendfile for that anyway), or sockets and sockets, you need to have a pipe at one of the ends of the splice. This means file-to-file operations like cat f1 f2 f3 > f4 are impossible with splice.

  • Not universally supported: not all filesystems actually let you splice to/from them. It's possible to try a fast implementation and fall back to a slow one if we're on a non-splice filesystem, but that adds complexity for little gain.

And here's the kicker IMO: there still are bugs. Here's one, you still can't splice from /dev/zero to a pipe:

$ ./fast /dev/zero | pv > /dev/null
Invalid argument

Here's a thread on the kernel mailing list about that: https://lore.kernel.org/all/202105071116.638258236E@keescook/t/. It's slightly unfair to call this a bug since it was intentional - the death of generic splice was a planned affair:

The general loss of generic splice read/write is known.

The ultimate reason for this /dev/zero funkiness is that there's no real demand for it to work, I guess. Instead of directly using /dev/zero, I used actual zero files.

Conclusion

My advice is to use splice where you can, but keep in mind its drawbacks and lack of generality. If you control the types of fds passed in and the filesystem, then you can really go crazy and experience almost zero-copy file copies.

But if you're writing a general tool in the vein of cat or tee, it's probably best to stay away from splice unless you really handle all of the weird cases.