bfs
3.0: the fastest find
yet!
bfs
is a tool I wrote to do breadth-first search through a filesystem.
It started out simply enough, but over the years it's grown to include almost every feature from every other find
implementation I could find, plus many of its own innovations.
The new 3.0 release series includes the biggest new feature so far: asynchronous, parallel directory traversal.
The kitchen sync
In pseudocode, the way bfs
used to work looked like this:
while (const char *path = popdir()) {
DIR *dir = opendir(path);
while (struct dirent *de = readdir(dir)) {
if (visit(de)) {
pushdir(de);
}
}
closedir(dir);
}
Here, visit()
is doing the filtering, printing, pruning, etc. that was written on the command line, e.g.
$ bfs -nohidden -name '*.c' -ls
For today, we can ignore all of that and just focus on what dominates the execution time: I/O.
A timeline of the I/O operations that bfs
does looks like this:
opendir()
openat()
readdir()
getdents()
readdir()
...
closedir()
close()
opendir()
openat()
...
The C library functions are shown on top, and the syscalls they call are highlighted.
These syscalls are invoked synchronously, meaning execution pauses and waits for them to complete before continuing.
It would be better to invoke them asynchronously, and do other useful work (i.e. visit()
) while we wait for them to complete.
I/O queues
bfs
3.0 introduces I/O queues, an API which offloads these syscalls to background threads, making them appear asynchronous.
The new main loop looks more like this now:
while (DIR *dir = ioq_pop()) {
while (struct dirent *de = readdir(dir)) {
if (visit(de)) {
ioq_opendir(path);
}
}
ioq_closedir(dir);
}
(The real implementation is somewhat more complicated, but the main loop is largely similar.) The timeline now looks something like this:
main
ioq_pop()
readdir()
ioq_opendir()
...
ioq_closedir()
ioq_pop()
...
ioq
opendir()
openat()
readdir()
getdents()
closedir()
close()
opendir()
openat()
readdir()
getdents()
...
By going asynchronous, bfs
can be more efficient because it doesn't have to stall waiting for I/O.
The main thread may occasionally block waiting on the I/O queue, but in the best case it can pop entries with just a couple atomic memory operations.
This benefit is separate from parallelism—even on a single-core machine, bfs
can now do useful work where it previously would have been blocked on I/O.
That said, parallelism is beneficial too:
main
ioq_pop()
readdir()
ioq_opendir()
...
ioq_closedir()
ioq_pop()
...
ioq
opendir()
openat()
readdir()
getdents()
closedir()
close()
...
ioq
opendir()
openat()
readdir()
getdents()
closedir()
close()
...
...
bfs
3.0 will create up to 8 threads to process I/O queue operations, depending on how many cores you have.
It does not scale very well beyond that point, being bottlenecked on the single main thread.
You can override the default thread count choice with -j1
, -j2
, etc., (like make
).
What about io_uring?
If you've been keeping up with recent developments in async I/O (at least on Linux), you may be wondering why bfs
doesn't use io_uring.
The answer is pretty simple: io_uring doesn't support the getdents()
syscall that underlies readdir()
.
There is a patchset implementing it that's on track to being merged, and bfs
has a pull request ready to take advantage of it.
Interestingly, just using a single io_uring instance was not competitive with bfs
's I/O queues in my benchmarks.
My current best implementation uses one io_uring per thread, and uses the same I/O queue implementation to send commands back and forth with the main thread.
Of course, performance may be different once IORING_OP_GETDENTS
is integrated into the kernel.
So how fast is it?
The point of the ~5,000 lines of code it took to implement I/O queues was to improve performance. To measure that benefit, I used a combination of hyperfine and my new benchmark harness tailfin. The results speak for themselves.
For exploring my entire home directory (~7.6 million files), the new bfs
is 2.2× faster than the previous release series, and almost 3× faster than GNU find.
Surprisingly, fd
was the slowest in this benchmark.
Benchmark 1: fd -u '^$' ~
Time (mean ± σ): 8.209 s ± 0.023 s [User: 31.908 s, System: 339.069 s]
Range (min … max): 8.170 s … 8.250 s 16 runs
Benchmark 2: find ~ -false
Time (mean ± σ): 7.021 s ± 0.045 s [User: 1.138 s, System: 5.783 s]
Range (min … max): 6.968 s … 7.103 s 16 runs
Benchmark 3: bfs-2.6.3 ~ -false
Time (mean ± σ): 5.307 s ± 0.030 s [User: 0.779 s, System: 4.355 s]
Range (min … max): 5.268 s … 5.395 s 16 runs
Benchmark 4: bfs-3.0.1 ~ -false
Time (mean ± σ): 2.416 s ± 0.023 s [User: 2.823 s, System: 11.547 s]
Range (min … max): 2.369 s … 2.454 s 16 runs
Summary
bfs-3.0.1 ~ -false ran
2.20 ± 0.02 times faster than bfs-2.6.3 ~ -false
2.91 ± 0.03 times faster than find ~ -false
3.40 ± 0.03 times faster than fd -u '^$' ~
I used the expression -false
for these benchmarks so that visit()
does the absolute minimum amount of work possible, and the benchmark focuses on just directory traversal.
Similarly, for fd
I used the regex ^$
which never matches a filename.
Why breadth-first?
A big motivation for bfs
is that in my experience, if you're looking for a particular file, breadth-first search usually finds it earlier than depth-first.
That's because most interesting files tend to be close to the root (e.g. ~/Downloads/pay_slip.pdf
), while most files tend to be deeper in the tree.
(And if you're looking for a deep file, you'll probably have to search a large portion of the tree anyway.)
This makes it great for interactive use with things like fzf.
I measured its advantage for this use case using a command line that quits as soon as a match is found. My whole home directory contains only one file with this name, at depth 3.
Benchmark 1: fd -usg1 kd-forest.mkv ~
Time (mean ± σ): 2.783 s ± 1.761 s [User: 10.441 s, System: 113.302 s]
Range (min … max): 0.847 s … 5.316 s 10 runs
Benchmark 2: find ~ -name kd-forest.mkv -quit
Time (mean ± σ): 4.786 s ± 0.023 s [User: 1.476 s, System: 3.245 s]
Range (min … max): 4.757 s … 4.829 s 10 runs
Benchmark 3: bfs-2.6.3 ~ -name kd-forest.mkv -quit
Time (mean ± σ): 15.9 ms ± 1.1 ms [User: 6.9 ms, System: 8.8 ms]
Range (min … max): 6.7 ms … 24.4 ms 168 runs
Benchmark 4: bfs-3.0.1 ~ -name kd-forest.mkv -quit
Time (mean ± σ): 11.2 ms ± 1.1 ms [User: 5.2 ms, System: 37.4 ms]
Range (min … max): 6.0 ms … 14.8 ms 219 runs
Summary
bfs-3.0.1 ~ -name kd-forest.mkv -quit ran
1.42 ± 0.17 times faster than bfs-2.6.3 ~ -name kd-forest.mkv -quit
248.68 ± 159.28 times faster than fd -usg1 kd-forest.mkv ~
427.64 ± 42.68 times faster than find ~ -name kd-forest.mkv -quit
This is the main use case I care about (basically recursive tab completion), and bfs
is 430× faster than GNU find.