bfs 3.0: the fastest find yet!

Tavian Barnes GitHub

bfs is a tool I wrote to do breadth-first search through a filesystem. It started out simply enough, but over the years it's grown to include almost every feature from every other find implementation I could find, plus many of its own innovations. The new 3.0 release series includes the biggest new feature so far: asynchronous, parallel directory traversal.

The kitchen sync

In pseudocode, the way bfs used to work looked like this:

while (const char *path = popdir()) {
    DIR *dir = opendir(path);
    while (struct dirent *de = readdir(dir)) {
        if (visit(de)) {
            pushdir(de);
        }
    }
    closedir(dir);
}

Here, visit() is doing the filtering, printing, pruning, etc. that was written on the command line, e.g.

$ bfs -nohidden -name '*.c' -ls

For today, we can ignore all of that and just focus on what dominates the execution time: I/O. A timeline of the I/O operations that bfs does looks like this:

opendir()openat() readdir()getdents() readdir() ... closedir()close() opendir()openat() ...

The C library functions are shown on top, and the syscalls they call are highlighted. These syscalls are invoked synchronously, meaning execution pauses and waits for them to complete before continuing. It would be better to invoke them asynchronously, and do other useful work (i.e. visit()) while we wait for them to complete.

I/O queues

bfs 3.0 introduces I/O queues, an API which offloads these syscalls to background threads, making them appear asynchronous. The new main loop looks more like this now:

while (DIR *dir = ioq_pop()) {
    while (struct dirent *de = readdir(dir)) {
        if (visit(de)) {
            ioq_opendir(path);
        }
    }
    ioq_closedir(dir);
}

(The real implementation is somewhat more complicated, but the main loop is largely similar.) The timeline now looks something like this:

main ioq_pop() readdir() ioq_opendir() ... ioq_closedir() ioq_pop() ...
ioq opendir()openat() readdir()getdents() closedir()close() opendir()openat() readdir()getdents() ...

By going asynchronous, bfs can be more efficient because it doesn't have to stall waiting for I/O. The main thread may occasionally block waiting on the I/O queue, but in the best case it can pop entries with just a couple atomic memory operations. This benefit is separate from parallelism—even on a single-core machine, bfs can now do useful work where it previously would have been blocked on I/O. That said, parallelism is beneficial too:

main ioq_pop() readdir() ioq_opendir() ... ioq_closedir() ioq_pop() ...
ioq opendir()openat() readdir()getdents() closedir()close() ...
ioq opendir()openat() readdir()getdents() closedir()close() ...
...

bfs 3.0 will create up to 8 threads to process I/O queue operations, depending on how many cores you have. It does not scale very well beyond that point, being bottlenecked on the single main thread. You can override the default thread count choice with -j1, -j2, etc., (like make).

What about io_uring?

If you've been keeping up with recent developments in async I/O (at least on Linux), you may be wondering why bfs doesn't use io_uring. The answer is pretty simple: io_uring doesn't support the getdents() syscall that underlies readdir(). There is a patchset implementing it that's on track to being merged, and bfs has a pull request ready to take advantage of it.

Interestingly, just using a single io_uring instance was not competitive with bfs's I/O queues in my benchmarks. My current best implementation uses one io_uring per thread, and uses the same I/O queue implementation to send commands back and forth with the main thread. Of course, performance may be different once IORING_OP_GETDENTS is integrated into the kernel.

So how fast is it?

The point of the ~5,000 lines of code it took to implement I/O queues was to improve performance. To measure that benefit, I used a combination of hyperfine and my new benchmark harness tailfin. The results speak for themselves.

For exploring my entire home directory (~7.6 million files), the new bfs is 2.2× faster than the previous release series, and almost 3× faster than GNU find. Surprisingly, fd was the slowest in this benchmark.

Benchmark 1: fd -u '^$' ~
  Time (mean ± σ):      8.209 s ±  0.023 s    [User: 31.908 s, System: 339.069 s]
  Range (minmax):    8.170 s8.250 s    16 runs

Benchmark 2: find ~ -false
  Time (mean ± σ):      7.021 s ±  0.045 s    [User: 1.138 s, System: 5.783 s]
  Range (minmax):    6.968 s7.103 s    16 runs

Benchmark 3: bfs-2.6.3 ~ -false
  Time (mean ± σ):      5.307 s ±  0.030 s    [User: 0.779 s, System: 4.355 s]
  Range (minmax):    5.268 s5.395 s    16 runs

Benchmark 4: bfs-3.0.1 ~ -false
  Time (mean ± σ):      2.416 s ±  0.023 s    [User: 2.823 s, System: 11.547 s]
  Range (minmax):    2.369 s2.454 s    16 runs

Summary
  bfs-3.0.1 ~ -false ran
    2.20 ± 0.02 times faster than bfs-2.6.3 ~ -false
    2.91 ± 0.03 times faster than find ~ -false
    3.40 ± 0.03 times faster than fd -u '^$' ~

I used the expression -false for these benchmarks so that visit() does the absolute minimum amount of work possible, and the benchmark focuses on just directory traversal. Similarly, for fd I used the regex ^$ which never matches a filename.

Why breadth-first?

A big motivation for bfs is that in my experience, if you're looking for a particular file, breadth-first search usually finds it earlier than depth-first. That's because most interesting files tend to be close to the root (e.g. ~/Downloads/pay_slip.pdf), while most files tend to be deeper in the tree. (And if you're looking for a deep file, you'll probably have to search a large portion of the tree anyway.) This makes it great for interactive use with things like fzf.

I measured its advantage for this use case using a command line that quits as soon as a match is found. My whole home directory contains only one file with this name, at depth 3.

Benchmark 1: fd -usg1 kd-forest.mkv ~
  Time (mean ± σ):      2.783 s ±  1.761 s    [User: 10.441 s, System: 113.302 s]
  Range (minmax):    0.847 s5.316 s    10 runs

Benchmark 2: find ~ -name kd-forest.mkv -quit
  Time (mean ± σ):      4.786 s ±  0.023 s    [User: 1.476 s, System: 3.245 s]
  Range (minmax):    4.757 s4.829 s    10 runs

Benchmark 3: bfs-2.6.3 ~ -name kd-forest.mkv -quit
  Time (mean ± σ):      15.9 ms ±   1.1 ms    [User: 6.9 ms, System: 8.8 ms]
  Range (minmax):     6.7 ms24.4 ms    168 runs

Benchmark 4: bfs-3.0.1 ~ -name kd-forest.mkv -quit
  Time (mean ± σ):      11.2 ms ±   1.1 ms    [User: 5.2 ms, System: 37.4 ms]
  Range (minmax):     6.0 ms14.8 ms    219 runs

Summary
  bfs-3.0.1 ~ -name kd-forest.mkv -quit ran
    1.42 ± 0.17   times faster than bfs-2.6.3 ~ -name kd-forest.mkv -quit
  248.68 ± 159.28 times faster than fd -usg1 kd-forest.mkv ~
  427.64 ± 42.68  times faster than find ~ -name kd-forest.mkv -quit

This is the main use case I care about (basically recursive tab completion), and bfs is 430× faster than GNU find.