I'm Tavian, a master's student in the Reliable Computer Systems lab at the David R. Cheriton School of Computer Science, University of Waterloo. I used to work at Microsoft Research Montreal, and before that at a startup called Maluuba, and at the University of Calgary.

Check out some of the projects I'm working on, or read some of the highlights from my blog (the full archive is in the sidebar):

I'm probably @tavianator on your favourite website:


Here are some projects I'm currently working on:

  • bfs: A breadth-first version of the UNIX find command
  • acap: Nearest neighbor search in Rust
  • bistring: Non-destructive string processing

These are some archived projects I'm no longer actively maintaining:

You can browse many of these projects on my cgit instance as well.


Breadth-first search for your files.


bfs is a variant of the UNIX find command that operates breadth-first rather than depth-first. It is otherwise compatible with many versions of find, including

If you're not familiar with find, the GNU find manual provides a good introduction.


As Close As Possible — nearest neighbor search in Rust.

use acap::euclid::Euclidean;
use acap::vp::VpTree;
use acap::NearestNeighbors;

let tree = VpTree::balanced(vec![
    Euclidean([3, 4]),
    Euclidean([5, 12]),
    Euclidean([8, 15]),
    Euclidean([7, 24]),

let nearest = tree.nearest(&[7, 7]).unwrap();
assert_eq!(nearest.item, &Euclidean([3, 4]));
assert_eq!(nearest.distance, 5);


The bistring library provides non-destructive versions of common string processing operations like normalization, case folding, and find/replace. Each bistring remembers the original string, and how its substrings map to substrings of the modified version.

For example:

>>> from bistring import bistr
>>> s = bistr('𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐, 𝖇𝖗𝖔𝖜𝖓 🦊 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 🐶')
>>> s = s.normalize('NFKD')     # Unicode normalization
>>> s = s.casefold()            # Case-insensitivity
>>> s = s.replace('🦊', 'fox')  # Replace emoji with text
>>> s = s.replace('🐶', 'dog')
>>> s = s.sub(r'[^\w\s]+', '')  # Strip everything but letters and spaces
>>> s = s[:19]                  # Extract a substring
>>> s.modified                  # The modified substring, after changes
'the quick brown fox'
>>> s.original                  # The original substring, before changes
'𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐, 𝖇𝖗𝖔𝖜𝖓 🦊'


Try selecting some of the “original” or “modified” text below, or editing the code block!

import BiString, * as bistring from "bistring";

let s = (function() {


s.original == ""

s.modified == ""


2020-09-07 Tavian Barnes GitHub

Memory allocation failures are notoriously hard to handle correctly. Since they're generally rare, and non-deterministic when they do happen, C code paths that handle malloc() returning NULL are typically under-tested and thus frequently buggy. Compounding the issue is the bad advice found often online that since Linux will overcommit by default, checking malloc() for NULL is useless. Not only can malloc() sometimes return NULL even with the default overcommit setting anyway, that's not the only configuration that your code may run in. It's perfectly fine for your program to explicitly abort() if malloc() fails, but library code should more carefully propagate these errors to allow people to write more resilient applications. In writing bfs, I've tried to properly check for and handle all allocation failures, mainly to prove to myself that it's possible.

Recently while staring at the code of bfs, I noticed a bug in an error handling path. While parsing the group table, if a memory allocation failed at exactly the wrong spot, a struct group would be left in an inconsistent state: some fields would be newly allocated, while others would be copied from the original struct. The error path assumed every field was our own allocation, and so tried to free them all. But some of those pointers weren't ours to free, and the invalid free() would lead to a crash.

The fix was pretty simple: moving the initialization to NULL of the gr_mem field up above any allocations, so that the error paths always see NULL instead of the pointer they shouldn't be freeing. Confirming the fix took more effort, since the bug is basically impossible to reproduce. And since error paths like this are notoriously hard to test, I was worried there were more bugs like this lurking elsewhere. So, I made a tool to help me find them.

Injecting faults

In order to test code paths that handle malloc() failure, I built a library that overrides malloc() and friends in order to deterministically inject failures. The implementation is fairly simple: it maintains a count of the number of allocations that have happened so far, and if that count reaches a particular value, it returns NULL instead of delegating to the real malloc() implementation. Since it simulates an Out Of Memory (OOM) condition, I've called it OOMify.

static bool should_inject(void) {
        size_t n = __atomic_fetch_add(&stats.total, 1, __ATOMIC_SEQ_CST);

        if (n == ctl.inject_at) {
                if (ctl.stop) {
                return true;
        } else if (ctl.inject_after && n >= ctl.inject_at) {
                return true;
        } else {
                return false;

void *malloc(size_t size) {
        __atomic_fetch_add(&stats.malloc, 1, __ATOMIC_RELAXED);

        if (should_inject()) {
                errno = ENOMEM;
                return NULL;
        } else {
                return __libc_malloc(size);

The above implementation is specific to glibc, and therefore mainly useful on Linux, but it could be easily adjusted to support more platforms. This code is compiled into a library liboomify.so. A separate oomify program uses LD_PRELOAD to inject this library into another program which it spawns, communicating with liboomify via pipes.

By default, oomify first runs the given program once without injecting any failures, to count the number of allocations it normally performs. Then it runs the program again that many times, making each of those allocations fail separately. As such, it works best for quick, derterministic, single-threaded programs.

Finding bugs

Running oomify on an unpatched bfs finds the bug that motivated its existence, which is good.

$ oomify bfs >/dev/null
oomify: bfs did 2311 allocations
malloc(): Cannot allocate memory
malloc(): Cannot allocate memory
cfdup(): Cannot allocate memory
cfdup(): Cannot allocate memory
cannot allocate memory for thread-local data: ABORT
Assertion 's' failed at src/shared/json.c:1760, function json_variant_format(). Aborting.
oomify: alloc 1285: bfs terminated with signal 6 (Aborted)
Assertion 's' failed at src/shared/json.c:1760, function json_variant_format(). Aborting.
oomify: alloc 1314: bfs terminated with signal 6 (Aborted)
munmap_chunk(): invalid pointer
oomify: alloc 1486: bfs terminated with signal 6 (Aborted)
munmap_chunk(): invalid pointer
oomify: alloc 1490: bfs terminated with signal 6 (Aborted)
oomify: alloc 1494: bfs terminated with signal 11 (Segmentation fault)

We can pass -n1486 to fail just that allocation, and -s to trigger a SIGSTOP at that point in order to attach a debugger.

$ oomify -n1486 -s bfs >/dev/null &
$ gdb bfs $(pgrep bfs)
GNU gdb (GDB) 9.2
0x00007fdac07325f3 in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fdac07325f3 in raise () from /usr/lib/libc.so.6
#1  0x00007fdac08d4283 in should_inject () at oominject.c:31
#2  0x00007fdac08d42ca in malloc (size=5) at oominject.c:49
#3  0x00007fdac0784b4f in strdup () from /usr/lib/libc.so.6
#4  0x000055f3e91db121 in bfs_parse_groups () at pwcache.c:187
#5  0x000055f3e91d8620 in parse_cmdline (argc=1, argv=0x7ffd9f01d958) at parse.c:3555
#6  0x000055f3e91cdfdb in main (argc=1, argv=0x7ffd9f01d958) at main.c:103
(gdb) frame 4
#4  0x000055f3e91db121 in bfs_parse_groups () at pwcache.c:187
187                     ent->gr_name = strdup(ent->gr_name);
(gdb) cont

Program received signal SIGABRT, Aborted.
0x00007fdac0732615 in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fdac0732615 in raise () from /usr/lib/libc.so.6
#1  0x00007fdac071b862 in abort () from /usr/lib/libc.so.6
#2  0x00007fdac07745e8 in __libc_message () from /usr/lib/libc.so.6
#3  0x00007fdac077c27a in malloc_printerr () from /usr/lib/libc.so.6
#4  0x00007fdac077c6ac in munmap_chunk () from /usr/lib/libc.so.6
#5  0x00007fdac08d43a1 in free (ptr=0x55f3ea69f8b9) at oominject.c:82
#6  0x000055f3e91db42b in bfs_free_groups (groups=0x55f3ea69f850) at pwcache.c:271
#7  0x000055f3e91db2fb in bfs_parse_groups () at pwcache.c:240
#8  0x000055f3e91d8620 in parse_cmdline (argc=1, argv=0x7ffd9f01d958) at parse.c:3555
#9  0x000055f3e91cdfdb in main (argc=1, argv=0x7ffd9f01d958) at main.c:103
(gdb) frame 6
#6  0x000055f3e91db42b in bfs_free_groups (groups=0x55f3ea69f850) at pwcache.c:271
271                                     free(entry->gr_mem[j]);

This is exactly what we expect of the original bug. With the patch applied, we see no more of those munmap_chunk(): invalid pointer messages or segmentation faults:

$ oomify bfs >/dev/null
oomify: bfs did 2147 allocations
malloc(): Cannot allocate memory
malloc(): Cannot allocate memory
cfdup(): Cannot allocate memory
cfdup(): Cannot allocate memory
cannot allocate memory for thread-local data: ABORT
Assertion 's' failed at src/shared/json.c:1760, function json_variant_format(). Aborting.
oomify: alloc 1282: bfs terminated with signal 6 (Aborted)
Assertion 's' failed at src/shared/json.c:1760, function json_variant_format(). Aborting.
oomify: alloc 1597: bfs terminated with signal 6 (Aborted)
Assertion 's' failed at src/shared/json.c:1760, function json_variant_format(). Aborting.
oomify: alloc 1659: bfs terminated with signal 6 (Aborted)
malloc(): Cannot allocate memory
malloc(): Cannot allocate memory
bftw(): Cannot allocate memory
bftw(): Cannot allocate memory

However, there are some other strange errors. The JSON error is particularly surprising as bfs has nothing to do with JSON. I detailed on Twitter how I tracked that bug through systemd of all things to the underlying issue in glibc.

With a workaround for that bug in place, a new issue emerges:

oomify: alloc 1282: bfs terminated with signal 11 (Segmentation fault)

This bug originates in something called libp11-kit:

(gdb) bt
#0  0x00007fdeca2685f3 in raise () from /usr/lib/libc.so.6
#1  0x00007fdeca459283 in should_inject () at oominject.c:31
#2  0x00007fdeca4592ca in malloc (size=232) at oominject.c:49
#3  0x00007fdeca26023a in newlocale () from /usr/lib/libc.so.6
#4  0x00007fdec9614135 in ?? () from /usr/lib/libp11-kit.so.0
#5  0x00007fdeca4702de in call_init.part () from /lib64/ld-linux-x86-64.so.2
(gdb) cont

Program received signal SIGSEGV, Segmentation fault.
0x00007fdeca26077a in freelocale () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fdeca26077a in freelocale () from /usr/lib/libc.so.6
#1  0x00007fdec96140ae in ?? () from /usr/lib/libp11-kit.so.0
#2  0x00007fdeca47068b in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#3  0x00007fdeca26adb7 in __run_exit_handlers () from /usr/lib/libc.so.6
#4  0x00007fdeca26af5e in exit () from /usr/lib/libc.so.6
#5  0x00007fdeca253159 in __libc_start_main () from /usr/lib/libc.so.6
#6  0x0000561b772096ee in _start ()

I tracked this one down and submitted a pull request to fix it.

Another bug surfaces when checking ACLs:

oomify: alloc 2260: bfs terminated with signal 11 (Segmentation fault)

This one is in libacl:

(gdb) bt
#0  0x00007f0e425fc5f3 in raise () from /usr/lib/libc.so.6
#1  0x00007f0e427ed283 in should_inject () at oominject.c:31
#2  0x00007f0e427ed2ca in malloc (size=56) at oominject.c:49
#3  0x00007f0e42796d9c in ?? () from /usr/lib/libacl.so.1
#4  0x00007f0e42795529 in ?? () from /usr/lib/libacl.so.1
#5  0x00007f0e42795d72 in acl_from_mode () from /usr/lib/libacl.so.1
#6  0x00007f0e427953b6 in acl_get_file () from /usr/lib/libacl.so.1
#7  0x0000562e29551584 in bfs_check_acl (ftwbuf=0x7ffd6401fa18) at fsade.c:218
(gdb) cont

Program received signal SIGSEGV, Segmentation fault.
0x00007f0e427946c4 in ?? () from /usr/lib/libacl.so.1
(gdb) bt
#0  0x00007f0e427946c4 in ?? () from /usr/lib/libacl.so.1
#1  0x00007f0e42795e00 in acl_from_mode () from /usr/lib/libacl.so.1
#2  0x00007f0e427953b6 in acl_get_file () from /usr/lib/libacl.so.1
#3  0x0000562e29551584 in bfs_check_acl (ftwbuf=0x7ffd6401fa18) at fsade.c:218

It's a case of passing NULL to a function that doesn't handle it when an allocation fails. I reported it here, and it's also already fixed.

Funnily enough, after all this effort I didn't find any more bugs in my own code, only in others'. But I'm happy to have found and fixed a few issues in some other open source software, and maybe other people will find OOMify useful. You can find it on GitHub if you want to try it out.


2020-07-27 Tavian Barnes

If you've been here before, you've probably noticed that this site has gotten a little makeover recently. The old Wordpress site has been replaced with this new one, a static site generated by mdBook. I ported over most of the content, and tried to avoid breaking any links. If I missed anything, let me know! I did use the opportunity to clean up a few old posts I felt weren't accurate or useful any more.

Actually, it's the "let me know" part that made me drag my feet on this for so long. The one disadvantage to static sites is there's no straightforward way to handle comments. But honestly, the old comment system didn't work great either. There were no push notifications in either direction, so it was hard to have a back and forth conversation. My new approach to comments is simple: send them to me on social media, e.g. on Twitter. I'll edit in any relevant conversations. That should scale fairly well for now.

The source code to this site is on GitHub as well. Feel free to use the issues tab to ask questions or report mistakes, or better yet, send a PR to fix them :)

Porting k-d forests to Rust

2020-05-03 Tavian Barnes Reddit

I recently decided to give a serious go to learning Rust. As my first non-toy project, I decided to port an old piece of code I'd been wanting to dust off anyway: k-d forests. The basic premise of it is to generate images with every possible 8-bit RGB color, by placing each color next to the most similarly-colored pixel placed so far.

The core functionality of that code is an implementation of nearest neighbor search in a metric space, so an obvious place to start is defining a trait to represent a distance metric. The most obvious definition would be something like this:

/// A metric space.
pub trait Metric {
    /// Computes the distance between this point and another point.
    fn distance(&self, other: &Self) -> f64;

I made two enhancements. The first was to allow distances to be represented by an arbitrary orderable type. Often we just want to know whether one point is closer than another, and we can do that faster than actually computing the exact distances. This is true with Euclidean distance, for example, because we can compare distances without taking expensive square roots. This is known as an order embedding of the actual distance.

// An order embedding for distances.
pub trait Distance: Copy + From<f64> + Into<f64> + Ord {}

The second enhancement was to make Metric generic, to allow computing distances between different types. This is useful for nearest neighbor search, where the dataset may have more information than the query points (imagine searching for the closest Business to a GpsLocation). The revised Metric trait looks like this:

/// A metric space.
pub trait Metric<T: ?Sized = Self> {
    /// The type used to represent distances.
    type Distance: Distance;

    /// Computes the distance between this point and another point.
    fn distance(&self, other: &T) -> Self::Distance;

To actually perform nearest neighbor searches against a set of points, we need data structures that index those points:

/// A nearest neighbor to a target.
pub struct Neighbor<T> {
    /// The found item.
    pub item: T,
    /// The distance from the target.
    pub distance: f64,

/// A nearest neighbor search index.
pub trait NearestNeighbors<T, U: Metric<T> = T> {
    /// Returns the nearest match to `target` (or `None` if this index is empty).
    fn nearest(&self, target: &U) -> Option<Neighbor<&T>>;

    /// Returns the nearest match to `target` within the `threshold`, if one exists.
    fn nearest_within(&self, target: &U, threshold: f64) -> Option<Neighbor<&T>>;

    /// Returns the up to `k` nearest matches to `target`.
    fn k_nearest(&self, target: &U, k: usize) -> Vec<Neighbor<&T>>;

    /// Returns the up to `k` nearest matches to `target` within the `threshold`.
    fn k_nearest_within(
        target: &U,
        k: usize,
        threshold: f64,
    ) -> Vec<Neighbor<&T>>;

These methods could all be implemented in terms of the ones below them, e.g. fn nearest(...) { self.k_nearest(target, 1).pop() }. But it will be handy to introduce a special type for accumulating results during a search:

/// Accumulates nearest neighbor search results.
pub trait Neighborhood<T, U: Metric<T>> {
    /// Returns the target of the nearest neighbor search.
    fn target(&self) -> U;

    /// Check whether a distance is within this neighborhood.
    fn contains(&self, distance: f64) -> bool {
        distance < 0.0 || self.contains_distance(distance.into())

    /// Check whether a distance is within this neighborhood.
    fn contains_distance(&self, distance: U::Distance) -> bool;

    /// Consider a new candidate neighbor.
    fn consider(&mut self, item: T) -> U::Distance;

The nearest*() methods use a simple implementation that keeps track of the best match seen so far. The k_nearest*() methods use a more complicated implementation that keeps track of the k best matches so far with a priority queue. Ultimately, all of the NearestNeighbors functionality can be provided by implementing just one method:

pub trait NearestNeighbors<T, U: Metric<T> = T> {
    /// Search for nearest neighbors and add them to a neighborhood.
    fn search<'a, 'b, N>(&'a self, neighborhood: N) -> N
        T: 'a,
        U: 'b,
        N: Neighborhood<&'a T, &'b U>;

The simplest possible implementation just does an exhaustive search:

/// A NearestNeighbors implementation that does exhaustive search.
pub struct ExhaustiveSearch<T>(Vec<T>);
impl<T, U: Metric<T>> NearestNeighbors<T, U> for ExhaustiveSearch<T> {
    fn search<'a, 'b, N>(&'a self, neighborhood: N) -> N
        T: 'a,
        U: 'b,
        N: Neighborhood<&'a T, &'b U>,
        for e in &self.0 {

Vantage-point trees

Before covering k-d trees, I'm going to go over my implementation of vantage-point trees (VP trees). VP trees are structurally similar to k-d trees, but more general as they work for any metric space, not just $\mathbb{R}^n$. Every node in a VP tree holds a point and a radius, and two subtrees for the points inside/outside the radius.

/// A node in a VP tree.
struct VpNode<T> {
    /// The vantage point itself.
    item: T,
    /// The radius of this node.
    radius: f64,
    /// The subtree inside the radius, if any.
    inside: Option<Box<Self>>,
    /// The subtree outside the radius, if any.
    outside: Option<Box<Self>>,

/// A vantage-point tree.
pub struct VpTree<T> {
    /// The root node of the tree.
    root: Option<Box<VpNode<T>>>,

Searching is fairly simple: we recurse down the tree, using the triangle inequality property of metric spaces to prune subtrees. For a target point $t$, current distance threshold $\tau$, and a node with point $p$ and radius $r$, it's only necessary to search the "inside" subtree when $d(t,p) - r \le \tau$. Similarly, we only need to search the outside subtree when $r - d(t,p) \le \tau$. As a heuristic, we search the subtree that contains the target point first. In code, that looks like:

impl<'a, T, U, N> VpSearch<'a, T, U, N> for VpNode<T>
    T: 'a,
    U: Metric<&'a T>,
    N: Neighborhood<&'a T, U>,
    fn search(&'a self, neighborhood: &mut N) {
        let distance = neighborhood.consider(&self.item).into();
        if distance <= self.radius {
            self.search_inside(distance, neighborhood);
            self.search_outside(distance, neighborhood);
        } else {
            self.search_outside(distance, neighborhood);
            self.search_inside(distance, neighborhood);

    fn search_inside(&'a self, distance: f64, neighborhood: &mut N) {
        if let Some(inside) = &self.inside {
            if neighborhood.contains(distance - self.radius) {

    fn search_outside(&'a self, distance: f64, neighborhood: &mut N) {
        if let Some(outside) = &self.outside {
            if neighborhood.contains(self.radius - distance) {

k-d trees

k-d trees have a similar structure. Every node in a k-d tree holds a point $p$, and splits its children into two subtrees based on a coordinate $i$. Points $q$ with $q_i < p_i$ go in the left subtree, and those with $q_i > p_i$ go in the right.

/// A node in a k-d tree.
struct KdNode<T> {
    /// The value stored in this node.
    item: T,
    /// The left subtree, if any.
    left: Option<Box<Self>>,
    /// The right subtree, if any.
    right: Option<Box<Self>>,

/// A k-d tree.
pub struct KdTree<T> {
    root: Option<Box<KdNode<T>>>,

In order to extract coordinate values from points, we define a trait to represent points embedded in Cartesian space:

/// A point in Cartesian space.
pub trait Cartesian: Metric<[f64]> {
    /// Returns the number of dimensions necessary to describe this point.
    fn dimensions(&self) -> usize;

    /// Returns the `i`th coordinate of this point (`i < self.dimensions()`).
    fn coordinate(&self, i: usize) -> f64;

/// Marker trait for cartesian metric spaces.
pub trait CartesianMetric<T: ?Sized = Self>:
    Cartesian + Metric<T, Distance = <Self as Metric<[f64]>>::Distance>

With that out of the way, the k-d tree implementation is also simple:

impl<T: Cartesian> KdNode<T> {
    /// Recursively search for nearest neighbors.
    fn search<'a, U, N>(&'a self, i: usize, closest: &mut [f64], neighborhood: &mut N)
        T: 'a,
        U: CartesianMetric<&'a T>,
        N: Neighborhood<&'a T, U>,

        let target = neighborhood.target();
        let ti = target.coordinate(i);
        let si = self.item.coordinate(i);
        let j = (i + 1) % self.item.dimensions();

        let (near, far) = if ti <= si {
            (&self.left, &self.right)
        } else {
            (&self.right, &self.left)

        if let Some(near) = near {
            near.search(j, closest, neighborhood);

        if let Some(far) = far {
            let saved = closest[i];
            closest[i] = si;
            if neighborhood.contains_distance(target.distance(closest)) {
                far.search(j, closest, neighborhood);
            closest[i] = saved;

We can optimize the layout a bit by storing all the nodes together in a flat Vec rather than separate Boxes.


k-d trees and VP trees are typically constructed by bulk-loading all points up front. While it's possible to add and remove points from these trees after construction, doing so tends to unbalance the trees. Luckily, there is a general technique to create a dynamic data structure from a static one called "dynamization." k-d forests are a straightforward application of dynamization to k-d trees.

To dynamize insertion, we replace our single tree with multiple trees of exponentially increasing size. As described in the original post, we typically only have to rebuild a few small trees on every insertion. Think of it like incrementing a binary number; only when we have a carry at a particular bit do we have to rebuild the corresponding tree.

/// A dynamic wrapper for a static nearest neighbor search data structure.
pub struct Forest<T>(Vec<Option<T>>);

impl<T, U> Forest<U>
    U: FromIterator<T> + IntoIterator<Item = T>,
    /// Add a new item to the forest.
    pub fn push(&mut self, item: T) {
        let mut items = vec![item];

        for slot in &mut self.0 {
            match slot.take() {
                // Collect the items from any trees we encounter...
                Some(tree) => {
                // ... and put them all in the first empty slot
                None => {
                    *slot = Some(items.into_iter().collect());


/// Implement NearestNeighbors for Forest<V> whenever it's implemented for V.
impl<T, U, V> NearestNeighbors<T, U> for Forest<V>
    U: Metric<T>,
    V: NearestNeighbors<T, U>,
    fn search<'a, 'b, N>(&'a self, neighborhood: N) -> N
        T: 'a,
        U: 'b,
        N: Neighborhood<&'a T, &'b U>,
            .fold(neighborhood, |n, t| t.search(n))

This can be optimized further in the case that multiple elements are inserted at once.

Soft deletion

To dynamize deletion, we do a "soft delete": simply set a deleted flag to true for that item. When the number of deleted items reaches a threshold fraction of the total size, we rebuild the trees without those items. A custom Neighborhood implementation lets us easily skip the deleted items during searches, without having to modify the underlying trees at all! I think that's pretty slick.

/// A trait for objects that can be soft-deleted.
pub trait SoftDelete {
    /// Check whether this item is deleted.
    fn is_deleted(&self) -> bool;

/// Neighborhood wrapper that ignores soft-deleted items.
struct SoftNeighborhood<N>(N);

impl<T, U, N> Neighborhood<T, U> for SoftNeighborhood<N>
    T: SoftDelete,
    U: Metric<T>,
    N: Neighborhood<T, U>,
    fn target(&self) -> U {

    fn contains_distance(&self, distance: U::Distance) -> bool {

    fn consider(&mut self, item: T) -> U::Distance {
        if item.is_deleted() {
        } else {

/// A NearestNeighbors implementation that supports soft deletes.
pub struct SoftSearch<T>(T);

impl<T, U, V> NearestNeighbors<T, U> for SoftSearch<V>
    T: SoftDelete,
    U: Metric<T>,
    V: NearestNeighbors<T, U>,
    fn search<'a, 'b, N>(&'a self, neighborhood: N) -> N
        T: 'a,
        U: 'b,
        N: Neighborhood<&'a T, &'b U>,


This code is available on GitHub here. I had a good experience re-writing it in Rust. I had to ask a couple questions, but got good answers quickly and with a minimum of trademarked Stack Overflow condescension. Prior to this, the only Rust I had written were these 36 lines which were actually just rearranged from existing code, so I was surprised how easy it was to pick up — with the caveat that I've been reading about Rust for quite a while now, even if I hadn't written much.

The new code is quite a bit more general than the old C code, which had the data structure hard-coded to its application of picking nearby colors. Despite the extra levels of abstraction, the Rust implementation is only about 10–20% slower than the C one. And there's even a k-d tree building optimization in the C version that I didn't get around to implementing in Rust, since it requires a bit of tricky interior mutability. That'll probably be the subject of a future post.

The extra generality let me easily implement a couple new features too: you can now use an existing image as the source of colors, instead of the all-RGB-colors cube (that's how I generated the banner image, from this painting). And you can use an image as a target too, placing pixels over their closest counterparts in a given image. Finally, I rendered some videos of the images being generated, which you can find on this YouTube playlist (updated to fix YouTube issues).

spawn() of Satan

2018-09-21 Tavian Barnes Reddit

The "Unix philosophy" is all about small programs that do one thing well, but easily work together to accomplish complicated things. The summary I'm most familiar with is Peter H. Salus's

  • Write programs that do one thing and do it well.
  • Write programs to work together.
  • Write programs to handle text streams, because that is a universal interface.

The Unix creators themselves, Ken Thompson and Dennis Ritchie, listed as their first design consideration

  • Make it easy to write, test, and run programs.

One could reasonably assume, then, that it would be easy to write a program that runs other programs, in the programming language of Unix (C). Maybe something like this:

pid_t pid = spawn("echo", "Hello", "World!", NULL);

But anyone familiar with Unix programming knows that's not how it works. Rather than creating new processes from scratch, you first create an exact copy of yourself with fork(), then use exec() to replace that copy with what you wanted to execute.

pid_t pid = fork();
if (pid < 0) {
} else if (pid == 0) {
	// Child
	execlp("echo", "echo", "Hello", "World!", NULL);
	// If exec() returns, it failed
} else {
	// Parent

The good

fork() and exec() seem like weird primitives to provide at first. With most things in life, we think about creating them by building them up out of their constituents, not by first copying something else and then molding the copy into what you wanted. But despite the apparent strangeness, fork() and exec() are actually pretty convenient once you want to do anything slightly more complicated than just launch a new process.

Want to run the new process in a new working directory? Or launch it with reduced privileges? Maybe redirect its standard streams to files or pipe it to another process? With the fork()/exec() model, that's all easy: just do whatever you want to do after fork() and before exec().

} else if (pid == 0) {
	// This is something like the shell command
	//     cd ~; grep -ri 'password' >passwords 2>/dev/null
	// (error checking omitted)
	const char *home = getpwuid(getuid())->pw_dir;
	dup2(open("passwords", O_WRONLY), STDOUT_FILENO);
	dup2(open("/dev/null", O_WRONLY), STDERR_FILENO);
	execlp("grep", "grep", "-ri", "password", NULL);
} ...

If you had some API that combined everything into one call, you'd need parameters for every possible thing you might want to customize about the new process. Windows, for example, has at least four variants of the CreateProcess() function, each taking 10+ arguments.

The bad

Even once you're used to the strangeness of fork(), it does have a few real downsides. For one, it's slow. Though we're long passed the days when the entire memory of the forking process had to be copied, a modern copy-on-write fork() implementation still has to copy metadata like page tables that grows with the size of the process's address space. This makes fork() slow for big processes, and that work is often wasted if the forked process is just going to exec() soon. There is an alternative, vfork(), which doesn't do any unnecessary copying, but it's difficult to use safely. vfork() creates something more like a new thread than a new process, where the child shares memory with the parent. The child must then be arduously careful to avoid corrupting the parent's state, negating many of the "good" things above about fork().

Another downside to fork() is that it's basically impossible to use safely from a multithreaded program. fork() copies just the calling thread, potentially in the middle of operations by other threads. If, for example, another thread had locked a mutex (maybe within some implementation detail of a C library function like malloc()), it will remain locked in the child process, easily leading to deadlock. POSIX provides pthread_atfork() as an attempt to mitigate this, but its man page notes

The intent of pthread_atfork() was to provide a mechanism whereby the application (or a library) could ensure that mutexes and other process and thread state would be restored to a consistent state. In practice, this task is generally too difficult to be practicable.

After a fork(2) in a multithreaded process returns in the child, the child should call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2) to execute a new program.

fork() is also basically impossible to implement on systems without an MMU, since the parent and child should share pointer values that refer to different objects in memory.

The ugly

There is an attempt to mitigate these issues called posix_spawn(). Like other such APIs, it supports a laundry list of configuration options for flexibility in launching the new process. The above example might be rewritten like this using posix_spawn():

#include <spawn.h>

// Error checking omitted
posix_spawn_file_actions_t actions;
posix_spawn_file_actions_addopen(&actions, STDOUT_FILENO, "passwords", O_WRONLY, 0);
posix_spawn_file_actions_addopen(&actions, STDERR_FILENO, "/dev/null", O_WRONLY, 0);

const char *home = getpwuid(getuid())->pw_dir;
// No posix_spawn() API to do this in the child, so do it in the parent

char *argv[] = {"grep", "-ri", "password", NULL};
extern char **environ;

pid_t pid;
int ret = posix_spawnp(&pid, "grep", &actions, NULL, argv, environ);
if (ret != 0) {
	errno = ret;

Spawning processes is inherently complex, but posix_spawn() is unnecessarily complicated and verbose while still being too limited and incomplete for many use cases. The API looks like this:

int posix_spawn(pid_t *pid, const char *path,
                const posix_spawn_file_actions_t *file_actions,
                const posix_spawnattr_t *attrp,
                char *const argv[], char *const envp[]);
int posix_spawnp(pid_t *pid, const char *file,
                 const posix_spawn_file_actions_t *file_actions,
                 const posix_spawnattr_t *attrp,
                 char *const argv[], char *const envp[]);

posix_spawnp() is like posix_spawn() but it resolves the executable on your PATH, like execvp() vs. execv(). But these functions already take a fair few parameters, including an attributes object (posix_spawnattr_t) with its own flags. Surely, whether to search the PATH could have been controlled with a flag like POSIX_SPAWN_USEPATH instead.

The calling convention is unfamiliar and inconsistent with the rest of POSIX. posix_spawn() returns 0 on success, and a (positive) error code on failure. In contrast, fork() returns the new process ID on success, or -1 on failure, with the error code in errno. This convention would save one function parameter, and make error handling more uniform with the rest of POSIX.

The two types posix_spawn_file_actions_t and posix_spawnattr_t both control various things about the new process. We could save another parameter, and a whole lot of typing, by merging these into one type with a shorter name. In my opinion, this would be a better interface:

pid_t posix_spawn(const char *file, const posix_spawn_t *attrs,
                  char *const argv[], char *const envp[]);

Its use would look something like this:

posix_spawn_t attrs;
posix_spawn_setflags(&attrs, POSIX_SPAWN_USEPATH);

const char *home = getpwuid(getuid())->pw_dir;
posix_spawn_addchdir(&attrs, home);
posix_spawn_addopen(&attrs, STDOUT_FILENO, "passwords", O_WRONLY, 0);
posix_spawn_addopen(&attrs, STDERR_FILENO, "/dev/null", O_WRONLY, 0);

char *argv[] = {"grep", "-ri", "password", NULL};
pid_t pid = posix_spawn("grep", &attrs, argv, NULL);
if (pid < 0) {


The posix_spawn() specification has a section Asynchronous Error Notification that says

A library implementation of posix_spawn() or posix_spawnp() may not be able to detect all possible errors before it forks the child process. …

Thus, no special macros are available to isolate asynchronous posix_spawn() or posix_spawnp() errors. Instead, errors detected by the posix_spawn() or posix_spawnp() operations in the context of the child process before the new process image executes are reported by setting the child's exit status to 127.

Many important error conditions can only be detected by this special 127 error code. This includes things like non-existent executables, failures in any of the file actions, and overlong argument lists. None of these conditions can be distinguished from each other, and furthermore none of them can be distinguished from a successfully launched program returning 127 for some other reason.

The standard states that nothing could be done about this because no more information is available from wait() and they didn't want to extend it just for posix_spawn(). But wait() is not the only available method of interprocess communication! Rich Felker, author of the musl C standard library, realized it could easily be easily achieved with pipes, something like this:

int ret;

int pipefd[2];
if (pipe2(pipefd, O_CLOEXEC) != 0) {
	return errno;

*pid = fork();
if (*pid < 0) {
	ret = errno;
} else if (*pid == 0) {
	// Child

	execve(path, argv, envp);

	// exec() failed, write errno to the pipe
	ret = errno;
	while (write(pipefd[1], &ret, sizeof(ret)) < sizeof(ret));
} else {
	// Parent

	// Read errno from the pipe
	if (read(pipefd[0], &ret, sizeof(ret)) == sizeof(ret)) {
		// Error occurred, clean up the process
		int wstatus;
		waitpid(*pid, &wstatus, 0);
	} else {
		// No error
		ret = 0;

return ret;


Several potentially convenient things are missing from the specification. Most frustrating for me is the lack of posix_spawn_file_actions_add[f]chdir(), though there is now a proposal to add it to the next version (at my suggestion).

Another thing notably missing from POSIX is the execvpe() function, an exec() that lets you specify both the argument vector and environment variables while following the PATH. This would be helpful for implementing posix_spawnp(). Luckily there's a cheeky way to emulate it:

int execvpe(const char *file, char *const argv[], char *const envp[]) {
	extern char **environ;
	environ = envp; // Yeah, just clobber the global variable
	return execvp(file, argv);

For more general use, one would need to be more careful about things like thread safety and restoring the original environment on failure. But it's good enough for posix_spawnp() since we've already fork()ed.


I'm not just venting frustration — I've decided to do something about all this by writing a replacement spawn()-type API here. It's part of bfs, but it should be fairly self-contained. If there's enough interest I could separate it out into its own project.

Cracking DHCE (Diffie-Hellman color exchange)

2018-09-21 Tavian Barnes Reddit

I recently saw a video that explains Diffie–Hellman key exchange in terms of mixing colors of paint. It's a wonderfully simple and informative analogy, that Wikipedia actually uses as well. If you don't know about Diffie-Hellman, definitely watch the video and/or read the Wikipedia page to get a handle on it—it's not that complicated once you get the "trick." The color analogy intrigued me because I know just enough about both cryptography and color theory to be dangerous. So in this post, I'm going to attack the security of the color exchange protocol. ("Real" Diffie-Hellman remains secure, as far as I know.)

To summarize and somewhat formalize the "DHCE" protocol, suppose we have a $\mathrm{mix}(x, y, r)$ function that mixes the colors $x$ and $y$ in an $r : (1 - r)$ ratio. Starting with a shared, public color $S$, Alice and Bob do the following:

Picks a secret color $a$Picks as secret color $b$
Sends $A = \mathrm{mix}(S, a, 1/2)$Sends $B = \mathrm{mix}(S, b, 1/2)$
Computes $k = \mathrm{mix}(a, B, 1/3)$Computes $k = \mathrm{mix}(b, A, 1/3)$

In the end, they end up with the same secret color $k$ because paint mixing is associative and commutative, i.e. it doesn't matter what order you mix the paints in. The $1/3$ ratio in the final mixing step is because $A$ and $B$ contain twice as much paint as $a$ and $b$, and we want the final mixture to contain equal parts of $a$, $b$, and $S$.

But how secret is $k$, really? In order for it to be secret, our eavesdropper Eve must be unable to feasibly compute $k$ given the public information $S$, $A$, and $B$. How hard this is for her depends on the details of the $\mathrm{mix}$ function, specifically how hard it is to invert. If there were a simple $\mathrm{unmix}(x, y, r)$ function such that $\mathrm{unmix}(S, A, 1/2) = a$, that would be terrible for the security of our protocol—Eve could trivially compute a and b from the public information, and use them to reconstruct the key. In fact, such a function does exist.

Recall that mixing paint is a subtractive process. That is, each pigment absorbs some wavelengths of light while reflecting others. The mixture of two pigments absorbs much of the light that either pigment would have absorbed, so adding a pigment to another tends to reduce the number of wavelengths that are strongly reflected. Unfortunately for us, literal subtraction is too simplistic to be a realistic model of subtractive color mixing; a weighted geometric average is more accurate. This excellent article by Scott Allen Burns explains why.

An accurate color mixing model would consider the absorption/reflection of many different ranges of wavelengths, but to a first approximation you can pretend that light is only composed of three components: the primary colors red, green, and blue. Either way, pigments can be represented by a vector in $\mathbb{R}^n$ whose components are the reflectivity of that pigment to different ranges of light wavelengths. Our mixing function is then

\mathrm{mix}(x, y, r) = x^r * y^{1-r}

where $x^r$ denotes elementwise exponentiation, and $*$ is elementwise multiplication. The inverse function is then

\mathrm{unmix}(x, y, r) = (y * x^{-r})^{1/(1-r)}.

To break our protocol, Eve simply computes

a = \mathrm{unmix}(S, A, 1/2) = A^2 * S^{-1}

and uses it to compute

k = \mathrm{mix}(a, B, 1/3) = A^{2/3} * B^{2/3} * S^{-1/3}.

So what's the moral of this story? I guess the point I'm trying to make is that in cryptography, the devil is in the details. The video explains that the security comes from the fact that "given a mixed color, it's hard to reverse it in order to find the exact original colors." But in reality, Eve is not just given the mixed colors, she also knows one of the original colors from before they were mixed. Knowing just $A$ or $B$ doesn't help, but together with $S$, she can reconstruct the secret colors. Sure, she may have to buy some expensive lab equipment to accurately measure the reflectivity of the mixture to many different wavelengths, but she only has to buy it once to read an unlimited number of messages between Alice and Bob. The real Diffie-Hellman protocol remains secure because even knowing both $A$ and $S$, it is hard to invert the "mixing" operation to expose the secret number $a$ needed to compute the secret key.

bfs from the ground up, part 3: optimization

2017-10-12 Tavian Barnes Reddit GitHub

I recently released version 1.1.3 of bfs, my breadth-first drop-in replacement for the UNIX find command. The major change in this release is a refactor of the optimizer, so I figured it would be a good time to write up some of the details of its implementation.

After bfs parses its command line, it applies a series of optimizations to the resulting expression tree, in the hopes that this will make it run faster. bfs supports different levels of optimization (-O1, -O2, etc.), so this post will talk about each level in order, and what new optimization each one introduces.


At -O1, bfs enables a basic set of Boolean simplifications, things like $(\texttt{-true} \mathbin{\texttt{-and}} B) \iff B$. This actually comes up quite often, because options that have no per-file effect are still part of the expression tree. For example,

$ bfs -follow -print

results in the same expression tree as

$ bfs -true -print

(but with the follow flag set).

The full list of simplifications applied by -O1 is:

{} & \texttt{-not} & \texttt{-true} & \iff & \texttt{-false} \\
{} & \texttt{-not} & \texttt{-false} & \iff & \texttt{-true} \\
\texttt{-not} & \texttt{-not} & A & \iff & A & \text{(double negation)} \\
A & \texttt{-and} & \texttt{-true} & \iff & A & \text{(conjunction elimination)} \\
\texttt{-true} & \texttt{-and} & B & \iff & B & \text{(conjunction elimination)}\\
\texttt{-false} & \texttt{-and} & B & \iff & \texttt{-false} \\
A & \texttt{-or} & \texttt{-false} & \iff & A & \text{(disjunctive syllogism)} \\
\texttt{-false} & \texttt{-or} & B & \iff & B & \text{(disjunctive syllogism)} \\
\texttt{-true} & \texttt{-or} & B & \iff & \texttt{-true} \\
\texttt{-not} \, A & \texttt{-and} & \texttt{-not} \, B & \iff & \texttt{-not} \, (A \mathbin{\texttt{-or}} B) & \text{(De Morgan's laws)} \\
\texttt{-not} \, A & \texttt{-or} & \texttt{-not} \, B & \iff & \texttt{-not} \, (A \mathbin{\texttt{-and}} B) & \text{(De Morgan's laws)} \\

Most of these optimizations are just implemented directly with manual pattern matching, e.g.

struct expr *optimize_not_expr(struct expr *expr) {
	if (optlevel >= 1)
		if (expr->rhs == &expr_true) {
			return &expr_false;
		} else if (expr->rhs == &expr_false) {
			return &expr_true;
		} else if (expr->rhs->eval == eval_not) {
			return extract_child_expr(expr, &expr->rhs->rhs);

Some tempting optimizations like $(A \mathbin{\texttt{-and}} \texttt{-false}) \iff \texttt{-false}$ are not actually valid unless we know that $A$ has no side effects. The next level is responsible for that kind of optimization.


Dead code elimination

At -O2, bfs begins to distinguish between expressions that have side effects and those that don't. Expressions like -type f with no side effects can be completely removed if their return value is never used. These laws take into account the purity of the subexpressions:

\mathrm{pure} & \texttt{-and} & \texttt{-false} & \iff & \texttt{-false} \\
\mathrm{pure} & \texttt{-or} & \texttt{-true} & \iff & \texttt{-true} \\
\mathrm{pure} & \texttt{,} & B & \iff & B \\
(A \mathbin{\{\texttt{-and}, \texttt{-or}, \texttt{,}\}} \mathrm{pure}) & \texttt{,} & B & \iff & A \mathbin{\texttt{,}} B \\

The implementation keeps track of a pure flag for each (sub)expression, and tests this to know whether these optimizations are safe to apply:

struct expr *optimize_and_expr(struct expr *expr) {
	if (optlevel >= 2) {
		if (expr->lhs->pure && expr->rhs == &expr_false) {
			return extract_child_expr(expr, &expr->rhs);

Data-flow analysis

Data-flow analysis is a broad term for optimizations that take into account the possible values an expression may have in its context, and propagate those values to future contexts. As an example, given this command line:

$ bfs -type f -and -type d

bfs's data-flow analysis determines that -type d can never be true at the point it's evaluated (after all, a path can't be both a regular file and a directory). This analysis proceeds by attaching a set of data flow facts to every node in the tree, at three different points in the evaluation: those facts known to be true before the expression is evaluated, and those known to be true after the expression returns either true or false.

Currently, bfs keeps track of three data flow facts: the minimum and maximum depth a path may be in the directory tree, and its possible file types. Unreachable situations are represented by impossible facts, such as an empty set of possible types, or an empty range of possible depths. Facts are inferred as the expression tree is walked recursively:

 * Data flow facts about an evaluation point.
struct opt_facts {
	/** Minimum possible depth at this point. */
	int mindepth;
	/** Maximum possible depth at this point. */
	int maxdepth;

	/** Bitmask of possible file types at this point. */
	enum bftw_typeflag types;

 * Optimizer state.
struct opt_state {
	/** The command line we're optimizing. */
	const struct cmdline *cmdline;

	/** Data flow facts before this expression is evaluated. */
	struct opt_facts facts;
	/** Data flow facts after this expression returns true. */
	struct opt_facts facts_when_true;
	/** Data flow facts after this expression returns false. */
	struct opt_facts facts_when_false;

static void infer_type_facts(struct opt_state *state, const struct expr *expr) {
	state->facts_when_true.types &= expr->idata;
	state->facts_when_false.types &= ~expr->idata;

struct expr *optimize_and_expr_recursive(struct opt_state *state, struct expr *expr) {
	struct opt_state lhs_state = *state;
	expr->lhs = optimize_expr_recursive(&lhs_state, expr->lhs);

	struct opt_state rhs_state = *state;
	// Due to short-circuit evaluation, rhs is only evaluated if lhs returned true
	rhs_state.facts = lhs_state.facts_when_true;
	expr->rhs = optimize_expr_recursive(&rhs_state, expr->rhs);

	state->facts_when_true = rhs_state.facts_when_true;
	            &lhs_state.facts_when_false, &rhs_state.facts_when_false);

	return optimize_and_expr(expr);


struct expr *optimize_expr_recursive(struct opt_state *state, struct expr *expr) {
	if (expr->eval == eval_type) {
		infer_type_facts(state, expr);
	} else if (expr->eval == eval_and) {
		expr = optimize_and_expr_recursive(state, expr);

	if (expr->pure) {
		if (facts_impossible(&state->facts_when_true) {
			return &expr_false;
		} else if (facts_impossible(&state->facts_when_false) {
			return &expr_true;

	return expr;

-O3 (default)

Some tests are faster to execute than others. -type f is nearly instantaneous, just checking a single integer. -name '*.c' is a little slower, requiring the machinery of fnmatch(). -links 2 is even slower, needing a stat() system call, which may even require I/O, to count the number of hard links to the file.

At -O3, the default optimization level, bfs re-orders expressions to reduce their expected cost. Most primitives are initialized with a cost and probability of returning true, which I measured on my own computer. Some heuristics are applied as well, considering e.g. -name '*.c' to be more likely than -name 'foo.c'.

The cost and probability for compound expressions are computed from their constituent parts:

\mathrm{cost}(A \mathbin{\texttt{-and}} B) & = & \mathrm{cost}(A) + \Pr(A)\,\mathrm{cost}(B) \\
\mathrm{cost}(A \mathbin{\texttt{-or}} B) & = & \mathrm{cost}(A) + (1 -
\Pr(A))\,\mathrm{cost}(B) \\
\mathrm{cost}(A \mathbin{\texttt{,}} B) & = & \mathrm{cost}(A) + \mathrm{cost}(B) \\

It's impossible in general to compute the probabilities, without knowing how $A$ and $B$ are related. bfs makes the naïve assumption that $A$ and $B$ are independent:

\Pr(A \mathbin{\texttt{-and}} B) & \approx & \Pr(A)\,\Pr(B) \\
\Pr(A \mathbin{\texttt{-or}} B) & \approx & 1 - (1 - \Pr(A))\,(1 - \Pr(B)) \\
\Pr(A \mathbin{\texttt{,}} B) & = & \Pr(B) \\

Then, for -and and -or, if $A$ and $B$ are both pure and swapping them would reduce the expected cost, they get swapped:

struct expr *optimize_and_expr(struct expr *expr) {
	struct expr *lhs = expr->lhs;
	struct expr *rhs = expr->rhs;


	expr->cost = lhs->cost + lhs->probability*rhs->cost;
	expr->probability = lhs->probability*rhs->probability;

	if (optlevel >= 3 && lhs->pure && rhs->pure) {
		double swapped_cost = rhs->cost + rhs->probability*lhs->cost;
		if (swapped_cost &lt; expr->cost) {
			expr->lhs = rhs;
			expr->rhs = lhs;
			expr->cost = swapped_cost;

	return expr;


The final level of optimization, which is not enabled by default, contains aggressive optimizations that may affect correctness in corner cases. Its main effect is to set -maxdepth to the highest depth inferred by data-flow analysis for any impure expression in the entire tree. For example,

$ bfs -O4 -depth 5 -or -depth 6

will infer -mindepth 5 and -maxdepth 6, skipping traversal of deeper paths entirely. This optimization is unsafe in general because skipping these paths can change what errors bfs reports, and therefore its exit status. Exemplified:

$ mkdir -p foo/bar/baz && chmod -r foo/bar/baz
$ bfs foo -depth 1
'foo/bar/baz': Permission denied
$ echo $?
$ bfs -O4 foo -depth 1
$ echo $?

This optimization is actually so aggressive it will skip the entire traversal if no side effects are reachable, by setting -maxdepth to -1:

$ time bfs -O4 / -false
bfs -O4 / -false  0.00s user 0.00s system 88% cpu 0.002 total

The other thing -O4 does is treat certain expressions as pure that may have side effects in corner cases. -empty will attempt to read directories, and -xtype will follow symbolic links, either of which may affect the result of bfs if they encounter a permissions error. Therefore we don't consider these tests pure except at -O4.

Try it out

bfs implements a debugging flag -D opt that logs details about the optimizations being performed. Expressions are dumped in a Lisp-style S-expression syntax:

$ bfs -D opt -not -false
-O1: constant propagation: (-not (-false)) <==> (-true)
-O1: conjunction elimination: (-a (-true) (-true)) <==> (-true)
-O1: conjunction elimination: (-a (-true) (-print)) <==> (-print)
$ bfs -D opt -type f -or -not -type f
-O1: conjunction elimination: (-a (-true) (-type f)) <==> (-type f)
-O2: data flow: (-type f) --> (-false)
-O1: constant propagation: (-not (-false)) <==> (-true)
-O2: purity: (-or (-type f) (-true)) <==> (-true)
-O1: conjunction elimination: (-a (-true) (-print)) <==> (-print)
$ bfs -D opt -name '*.c' -type f
-O1: conjunction elimination: (-a (-true) (-name *.c)) <==> (-name *.c)
-O3: cost: (-a (-name *.c) (-type f)) <==> (-a (-type f) (-name *.c)) (~420 --> ~383.909)
$ bfs -D opt -O4 -type f -type d
-O1: conjunction elimination: (-a (-true) (-true)) <==> (-true)
-O1: conjunction elimination: (-a (-true) (-type f)) <==> (-type f)
-O2: data flow: (-type d) --> (-false)
-O2: purity: (-a (-type f) (-false)) <==> (-false)
-O1: short-circuit: (-a (-false) (-print)) <==> (-false)
-O2: data flow: mindepth --> 2147483647
-O4: data flow: maxdepth --> -1

bfs from the ground up, part 2: parsing

2017-04-24 Tavian Barnes GitHub

Today is the release of version 1.0 of bfs, a fully-compatible* drop-in replacement for the UNIX find command. I thought this would be a good occasion to write more about its implementation. This post will talk about how I parse the command line.

The find command has the most complex command line syntax of any command line tool I'm aware of. Unlike most utilities which take a sequence of options, values, and paths, find's command line is a rich short-circuiting Boolean expression language. A command like

find \( -type f -o -type d \) -a -print

means something like this, in pseudocode:

for each file {
    (type == f || type == d) && print()

Due to the short-circuiting behaviour of -a/&&, this only prints files and directories, skipping symbolic links, device nodes, etc.

To shorten some common commands1, find lets you omit -a, treating LHS RHS the same as LHS -a RHS. Sadly this confuses a lot of users, who write things like

find -print -type f

without understanding how find will interpret it as an expression. Plenty of invalid bug reports have been made about this over the years. In fact, the find command line syntax was deeply confusing to me before I wrote bfs; I hope not everyone has to implement find from scratch just to understand it.

We've seen two kinds of parameters so far: tests, like -type f, that check something about the current file and return a truth value; and actions, like -print, that perform some side effect (and usually return true, unless the action fails). There is a third kind of parameter: options, that affect the overall behaviour of find, but don't do any per-file work. An example is -follow, which instructs find to follow symbolic links. A command line like

find -follow -type f -print

acts something like this:

follow_symlinks = true;
for each file {
    true && type == f && print()

Nesting an option in the middle of an expression can be confusing, so GNU find warns unless you put them at the beginning.

If you don't include any actions in your expression, find automatically adds -print for you, so

find -type f -o -type d

is the same as

find \( -type f -o -type d \) -print

(which is the same as ... -a -print, of course).

The find command line can also include root paths to start from. If you don't specify any paths, GNU find looks in . for you. There are also some flags that can come before the paths, and are not part of the expression. Overall, the command line is structured like this:

find [flags...] [paths...] [expression...]

If you get the order wrong, you get a frustrating error:

$ find -type f .
find: paths must precede expression: .

bfs does away with this particular restriction, because I hate it when a computer knows what I want to do but refuses to do it.


In order to parse a find-style command line, we'll need to come up with a grammar for it. find supports expressions like

  • ( EXPR )
  • ! EXPR, -not EXPR
  • EXPR -o EXPR, EXPR -or EXPR,

in decreasing order of precedence. People often write grammars like this:

     | ! EXPR | -not EXPR
     | EXPR -a EXPR | EXPR -and EXPR | EXPR EXPR
     | EXPR -o EXPR | EXPR -or EXPR
     | EXPR , EXPR

and use some feature of their parser generator to specify the different precedences of the various production rules. But you can always encode the precedences right in the grammar itself:

LITERAL : -follow | -type TYPE | -print | ...

       | ( EXPR )
       | ! FACTOR | -not FACTOR

     | TERM -a FACTOR
     | TERM -and FACTOR

       | CLAUSE -o TERM
       | CLAUSE -or TERM

     | EXPR , CLAUSE

bfs uses a recursive descent parser to parse the command line, where each nonterminal symbol gets a function that parses it, e.g. parse_clause(). These functions recursively call other parsing functions like parse_term(), resulting in code that is structured very much like the grammar itself.

The major limitation of recursive descent parsers is they can't handle left-recursive rules like


because the immediate parse_clause() call recurses infinitely. Rules like this can always be re-written right-recursively like


but naïvely that will change a left-associative rule into a right-associative one (terms like a -o b -o c will group like a -o (b -o c) instead of (a -o b) -o c. Care must be taken2 to parse right-recursively, but build the expression tree left-associatively.


Here's an example implementation of a technique that handles left-recursive rules:

static struct expr *parse_clause(struct parser_state *state, struct expr *lhs) {
	// Parse with the right-recursive rules
	   // CLAUSE : TERM
	//        | TERM -o CLAUSE
	struct expr *expr = parse_term(state);

	// But build the expression tree left-associatively
	if (lhs) {
		expr = new_or_expr(lhs, expr);

	const char *arg = state->argv[0];
	if (strcmp(arg, "-o") != 0 && strcmp(arg, "-or") != 0) {
		return expr;

	parser_advance(state, T_OPERATOR, 1);
	return parse_clause(state, expr);

The trick is to thread the left-hand side of the expression through the recursive calls. Actually, the tail-recursion can be replaced with a loop. The actual implementation of parse_clause() looks more like this (omitting error checking):

static struct expr *parse_clause(struct parser_state *state) {
	struct expr *clause = parse_term(state);

	while (true) {

		const char *arg = state->argv[0];
		if (strcmp(arg, "-o") != 0 && strcmp(arg, "-or") != 0) {
		parser_advance(state, T_OPERATOR, 1);

		struct expr *lhs = clause;
		struct expr *rhs = parse_term(state);
		clause = new_or_expr(state, lhs, rhs, argv);

	return clause;

parse_term() is a little trickier: we need a bit of lookahead to decide whether to apply the TERM : TERM FACTOR rule:

if (strcmp(arg, "-o") == 0 || strcmp(arg, "-or") == 0
    || strcmp(arg, ",") == 0
    || strcmp(arg, ")") == 0) {

These are all the tokens that may directly follow a TERM, and cannot be the first token of a FACTOR. Not all grammars can be parsed with a single such token of lookahead—the ones that can are known as LL(1).

Notice the skip_paths() call. That's how bfs flexibly handles paths anywhere in the command line, before, after, or even inside the expression. Non-paths are either (, ), ,, !, or start with -, so we can always tell them apart. There are a couple corner cases though:

static void skip_paths(struct parser_state *state) {
	while (true) {
		const char *arg = state->argv[0];
		if (!arg) {

		if (arg[0] == '-') {
			if (strcmp(arg, "--") == 0) {
				// find uses -- to separate flags from the rest
				// of the command line.  We allow mixing flags
				// and paths/predicates, so we just ignore --.
				parser_advance(state, T_FLAG, 1);
			if (strcmp(arg, "-") != 0) {
				// - by itself is a file name.  Anything else
				// starting with - is a flag/predicate.

		// By POSIX, these are always options
		if (strcmp(arg, "(") == 0 || strcmp(arg, "!") == 0) {

		if (state->expr_started) {
			// By POSIX, these can be paths.  We only treat them as
			// such at the beginning of the command line.
			if (strcmp(arg, ")") == 0 || strcmp(arg, ",") == 0) {

		parse_root(state, arg);
		parser_advance(state, T_PATH, 1);

find supports unusual command lines like

find \) , -print

where \) and , are treated as paths because they come before the expression starts. To maintain 100% compatibility, bfs supports these weird filenames too, but only before it sees any non-path, non-flag tokens. parser_advance() updates state->expr_started based on the types of tokens it sees.


There are many different literals (options, tests, and actions) supported by bfs. parse_literal() is driven by a table that looks like this:

typedef struct expr *parse_fn(struct parser_state *state, int arg1, int arg2);

struct table_entry {
	const char *arg;
	bool prefix;
	parse_fn *parse;
	int arg1;
	int arg2;

static const struct table_entry parse_table[] = {
	{"follow", false, parse_follow, BFTW_LOGICAL | BFTW_DETECT_CYCLES, true},
	{"print", false, parse_print},
	{"type", false, parse_type, false},

The table contains entries with a function pointer to the parsing function for each literal, and up to two integer arguments that get passed along to it. The extra arguments let one function handle multiple similar literals; for example parse_type() handles -type and -xtype. The prefix flag controls whether extra characters after the literal should be accepted; this is used by a couple of literals like -O and -newerXY.

There are two stages to lookup: the first is an exact match, which could be done by binary search, but is currently just a linear scan. If that fails, a fuzzy match pass is run to provide a hint to the user. The fuzzy matching is based on Levenshtein distance (or "min-edit" distance). It takes the standard QWERTY keyboard layout into account, so for example -dikkiq is considered very close to -follow:

$ bfs -diqqik
error: Unknown argument '-dikkiq'; did you mean '-follow'?


To help see how the command line was parsed, GNU find and bfs both support a -D tree flag that dumps the parsed expression tree. bfs outputs Lisp-style S-expressions:

$ bfs -D tree -follow -type f -print
-L -D tree . -color (-a (-type f) (-print))


This also plays into a nice analogy between Boolean algebra and "regular" algebra, where "or" is like addition and "and" is like multiplication. Just like we can write $xy$ instead of $x \times y$, we can omit the explicit -a.


Actually, as all of find's binary operators are associative, this doesn't really matter. But it's still a useful technique to know.

A quick trick for faster naïve matrix multiplication

2016-11-12 Tavian Barnes

If you need to multiply some matrices together very quickly, usually it's best to use a highly optimized library like ATLAS. But sometimes adding such a dependency isn't worth it, if you're worried about portability, code size, etc. If you just need good performance, rather than the best possible performance, it can make sense to hand-roll your own matrix multiplication function.

Unfortunately, the way that matrix multiplication is usually taught:

C_{i,j} = \sum_k A_{i,k} \, B_{k,j}

$\Bigg($$\Bigg) = \Bigg($$\Bigg) \, \Bigg($$\Bigg)$

leads to a slow implementation:

void matmul(double *dest, const double *lhs, const double *rhs,
            size_t rows, size_t mid, size_t cols) {
    for (size_t i = 0; i < rows; ++i) {
        for (size_t j = 0; j < cols; ++j) {
            const double *rhs_row = rhs;
            double sum = 0.0;
            for (size_t k = 0; k < mid; ++k) {
                sum += lhs[k] * rhs_row[j];
                rhs_row += cols;
            *dest++ = sum;
        lhs += mid;

This function multiplies a rows×mid matrix with a mid×cols matrix using the "linear algebra 101" algorithm. Unfortunately, it has a bad memory access pattern: we loop over dest and lhs pretty much in order, but jump all over the place in rhs, since it's stored row-major but we need its columns.

Luckily there's a simple fix that's dramatically faster: instead of computing each cell of the destination separately, we can update whole rows of it at a time. Effectively, we do this:

C_{i} = \sum_j A_{i,j} \, B_j

$\Bigg($$\Bigg) = \Bigg($$\Bigg) \, \Bigg($$\Bigg)$

In code, it looks like this:

void matmul(double *dest, const double *lhs, const double *rhs,
            size_t rows, size_t mid, size_t cols) {
    memset(dest, 0, rows * cols * sizeof(double));

    for (size_t i = 0; i < rows; ++i) {
        const double *rhs_row = rhs;
        for (size_t j = 0; j < mid; ++j) {
            for (size_t k = 0; k < cols; ++k) {
                dest[k] += lhs[j] * rhs_row[k];
            rhs_row += cols;
        dest += cols;
        lhs += mid;

On my computer, that drops the time to multiply two 256×256 matrices from 37ms to 13ms (with gcc -O3). ATLAS does it in 5ms though, so always use something like it if it's available.

bfs from the ground up, part 1: traversal

2016-09-10 Tavian Barnes GitHub

bfs is a tool I've been writing for about a year, with the goal of being a drop-in replacement for the UNIX find command. This series of posts will be deep technical explorations of its implementation, starting from the lower levels all the way up to the user interface. bfs is small (only about 3,500 lines of C code), which makes it possible to do a fairly complete analysis. But the codebase is fairly clean and highly optimized, which should make the analysis interesting.

The most important feature of bfs, from which it gets its name, is that it operates breadth-first, rather than depth-first. This means if you have a directory structure like

./linux/{the entire Linux kernel source tree}

there's a good chance find gets bogged down looking through the Linux kernel sources before it finds the file you're looking for, target. bfs, on the other hand, will return it either first or second.

The actual file traversal is implemented by a function called bftw(), which is similar to the POSIX nftw() function. It walks a directory tree in breadth-first order, invoking the callback fn for each file it encounters.

 * Breadth First Tree Walk (or Better File Tree Walk).
 * Like ftw(3) and nftw(3), this function walks a directory tree recursively,
 * and invokes a callback for each path it encounters.  However, bftw() operates
 * breadth-first.
 * @param path
 *         The starting path.
 * @param fn
 *         The callback to invoke.
 * @param nopenfd
 *         The maximum number of file descriptors to keep open.
 * @param flags
 *         Flags that control bftw() behavior.
 * @param ptr
 *         A generic pointer which is passed to fn().
 * @return
 *         0 on success, or -1 on failure.
int bftw(const char *path, bftw_fn *fn, int nopenfd, enum bftw_flags flags, void *ptr);

At this point, it might help to open up bftw.c and follow along.

The queue

Like most breadth-first search implementations, bftw() uses a FIFO queue to keep track of the nodes it has yet to visit. This is the primary downside of breadth-first search—this queue can take up significantly more memory than the stack used in depth-first search. As a result, bfs uses more memory than find for large directory hierarchies, but not an unreasonable amount in practice.

struct dirqueue is implemented as a circular buffer backed by an array:

 * A queue of 'dircache_entry's to examine.
struct dirqueue {
	/** The circular buffer of entries. */
	struct dircache_entry **entries;
	/** Bitmask for circular buffer indices; one less than the capacity. */
	size_t mask;
	/** The index of the front of the queue. */
	size_t front;
	/** The index of the back of the queue. */
	size_t back;

front and back are the read and write indices, respectively. Whenever one reaches the end, it wraps around to the beginning. The size of the queue is always a power of two, so the wrapping is implemented by a bitwise AND with mask.

entries back front mask = 7 (capacity = 8)

Popping is easy:

/** Remove an entry from the dirqueue. */
static struct dircache_entry *dirqueue_pop(struct dirqueue *queue) {
	if (queue->front == queue->back) {
		return NULL;

	struct dircache_entry *entry = queue->entries[queue->front];
	queue->front += 1;
	queue->front &= queue->mask;
	return entry;

Pushing is straightforward too, a little more complicated only in the case that it has to grow the array..

Using this queue implementation, bftw()'s implementation might look something like this (in pseudocode):

int bftw(const char *path, bftw_fn *fn, int nopenfd, enum bftw_flags flags, void *ptr) {
	struct dirqueue queue;
	struct dircache cache;
	struct dircache_entry *entry = dircache_add(&cache, path);


	do {
		DIR *dir = opendir(entry);
		for (child in readdir(dir)) {
			path = entry->path + "/" + child;
			if (isdir(path)) {
				dirqueue_push(&queue, dircache_add(&cache, path));

		entry = dirqueue_pop(&queue);
	} while (entry);

	return 0;

The cache

What's the point of the dircache type, and the dircache_entry type used to represent directories? The goal is to avoid re-traversing path components that we don't have to. When a system call like open("a/b/c") is made, the kernel has to resolve a, then a/b, and finally a/b/c, potentially following links and/or mount points for each component. Modern Unices provide a set of syscalls ending in at() that allow applications to avoid redundant traversals if they keep a file descriptor to the relevant directory open. For example, if int fd is an open file descriptor to a/b, openat(fd, "c") is equivalent to open("a/b/c"), without having to re-resolve a/b.

The dircache is primarily used to hold these open file descriptors so that future directories can be opened efficiently with openat(). struct dircache_entry looks like this:

 * A single entry in the dircache.
struct dircache_entry {
	/** The parent entry, if any. */
	struct dircache_entry *parent;
	/** This directory's depth in the walk. */
	size_t depth;

	/** Reference count. */
	size_t refcount;
	/** Index in the priority queue. */
	size_t heap_index;

	/** An open file descriptor to this directory, or -1. */
	int fd;

	/** The device number, for cycle detection. */
	dev_t dev;
	/** The inode number, for cycle detection. */
	ino_t ino;

	/** The offset of this directory in the full path. */
	size_t nameoff;
	/** The length of the directory's name. */
	size_t namelen;
	/** The directory's name. */
	char name[];

Each entry holds some information about the directory it refers to, such as its depth, inode number, and of course, its name. The names are stored inline with the entries using a C99 flexible array member. To avoid the wasted space of storing "a", "a/b", and "a/b/c", only the name itself ("c") is stored. Thus to reconstruct the path, you have to follow the chain of parents to the root:

 * Get the full path to a dircache_entry.
 * @param entry
 *         The entry to look up.
 * @param[out] path
 *         Will hold the full path to the entry, with a trailing '/'.
static int dircache_entry_path(const struct dircache_entry *entry, char **path) {
	size_t namelen = entry->namelen;
	size_t pathlen = entry->nameoff + namelen;

	if (dstresize(path, pathlen) != 0) {
		return -1;

	// Build the path backwards
	do {
		char *segment = *path + entry->nameoff;
		namelen = entry->namelen;
		memcpy(segment, entry->name, namelen);
		entry = entry->parent;
	} while (entry);

	return 0;

(dstresize() is part of a simple dynamic string implementation bfs uses.) The entries effectively form a tree of unexplored directories and their ancestors.

parent: NULL depth: 0 refcount: 3 nameoff: 0 namelen: 2 name: "a/" parent: depth: 1 refcount: 1 nameoff: 2 namelen: 2 name: "b/" parent: depth: 1 refcount: 1 nameoff: 2 namelen: 2 name: "d/" parent: depth: 2 refcount: 1 nameoff: 4 namelen: 2 name: "c/" queue:

Since the number of file descriptors an application may have open at any given time is limited, not every entry can have an open file descriptor. struct dircache holds a size-limited priority queue of open entries. As a heuristic, the entries with the highest reference counts are kept open, because those have the greatest number of unexplored descendants.

 * A directory cache.
struct dircache {
	/** A min-heap of open entries, ordered by refcount. */
	struct dircache_entry **heap;
	/** Current heap size. */
	size_t size;
	/** Maximum heap size. */
	size_t capacity;

The priority queue implementation is an array-backed binary heap. Incrementing/decrementing a reference count results in a bubble-down/up operation, which use the entry's heap_index field to locate the entry in the middle of the heap.

Reading directories

bftw() uses the standard DIR * type and readdir() function to read the contents of directories. But since there's no standard opendirat() function, the dircache_entry_open() function emulates it with open() followed by fdopendir(). Some extra logic handles EMFILE (too many open files) by shrinking the cache and retrying. Since DIR takes up quite a bit of memory, it's closed as soon as possible, hanging onto the file descriptor only with dup() (actually F_DUPFD_CLOEXEC).

For each directory entry we read, we need to know its type (file, directory, symbolic link, etc.), to tell if we need to descend into it, and to pass to the callback function. Normally one would use stat() to determine this, but stat()ing each file we encounter is slow, due to extra I/O and syscalls. Luckily, many Unices provide this information as an extension in their struct dirent, in the d_type field. When available, bftw() uses this information to avoid a stat() call entirely. The effect is noticeable:

$ strace bfs 2>&1 >/dev/null | grep stat | wc -l
$ strace find 2>&1 >/dev/null | grep stat | wc -l

Almost all the remaining stat() calls in bfs are actually from within glibc's fdopendir() implementation; these could be avoided if glibc checked the fcntl(fd, F_GETFL) result for O_DIRECTORY first.


One interesting feature that find has is the ability to do a post-order traversal, where a directory's descendants are processed before the directory itself. Confusingly, this is triggered by the -depth option. Mostly this gets used as part of the -delete action, since a directory can't be deleted until all of its children are first.

Breadth-first search doesn't have a straightforward analogue of post-order traversal, but we can use struct dircache_entry's reference counts to implement something good enough for -delete: whenever an entry's reference count drops to zero, we know it has no descendants left to process and we can invoke the callback again. Passing the BFTW_DEPTH flag (which keeps the confusing name for symmetry with nftw()'s matching FTW_DEPTH flag) causes bftw() to invoke the callback again right before freeing an entry. The resulting order is a little weird—files are listed immediately, while directories are listed after their contents—but it works!

$ bfs dir -depth

Putting it all together

The remainder of the code is pretty much glue and plumbing to implement the desired interface and handle errors. bftw()'s real implementation is factored into a few subroutines that deal with struct bftw_state:

 * Holds the current state of the bftw() traversal.
struct bftw_state {
	/** bftw() callback. */
	bftw_fn *fn;
	/** bftw() flags. */
	int flags;
	/** bftw() callback data. */
	void *ptr;

	/** The appropriate errno value, if any. */
	int error;

	/** The cache of open directories. */
	struct dircache cache;
	/** The queue of directories left to explore. */
	struct dirqueue queue;
	/** The current dircache entry. */
	struct dircache_entry *current;
	/** The current traversal status. */
	enum bftw_status status;

	/** The current path being explored. */
	char *path;

	/** Extra data about the current file. */
	struct BFTW ftwbuf;
	/** stat() buffer for the current file. */
	struct stat statbuf;

bftw() provides the callback with a lot of details about the current path in a buffer of type struct BFTW:

 * Data about the current file for the bftw() callback.
struct BFTW {
	/** The path to the file. */
	const char *path;
	/** The string offset of the filename. */
	size_t nameoff;

	/** The depth of this file in the traversal. */
	size_t depth;
	/** Which visit this is. */
	enum bftw_visit visit;

	/** The file type. */
	enum bftw_typeflag typeflag;
	/** The errno that occurred, if typeflag == BFTW_ERROR. */
	int error;

	/** A stat() buffer; may be NULL if no stat() call was needed. */
	const struct stat *statbuf;

	/** A parent file descriptor for the *at() family of calls. */
	int at_fd;
	/** The path relative to atfd for the *at() family of calls. */
	const char *at_path;
	/** Appropriate flags (like AT_SYMLINK_NOFOLLOW) for the *at() family of calls. */
	int at_flags;

The bftw_init_buffers() function is responsible for filling in this information for the callback. Invocations of the callback are done by the bftw_handle_path() function. The callback can control the traversal by returning different enum bftw_action values:

enum bftw_action {
	/** Keep walking. */
	/** Skip this path's siblings. */
	/** Skip this path's children. */
	/** Stop walking. */

so every bftw_handle_path() call is followed by a switch that does the appropriate thing for the action, often with a goto.


bftw()'s tuned implementation allows bfs to be very fast. Over my home directory (1,937,127 files, 267,981 directories), it's about 15–25% faster than GNU find 4.6.0 with warm caches:

$ time bfs > /dev/null
bfs > /dev/null  0.60s user 2.48s system 99% cpu 3.096 total
$ time find > /dev/null
find > /dev/null  1.08s user 2.96s system 99% cpu 4.062 total

On the other hand, with cold caches and rotational disks, it's about 100% slower for me, because the breadth-first ordering makes it jump around a lot more.

$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time bfs > /dev/null
bfs > /dev/null  1.54s user 11.04s system 8% cpu 2:29.57 total
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time find > /dev/null
find > /dev/null  2.52s user 12.12s system 14% cpu 1:44.54 total

On the plus side, since breadth-first search is likely to find what I'm looking for before depth-first search does, I can often ^C it faster even if a complete traversal would be slower.

The Approximating and Eliminating Search Algorithm

2016-03-15 Tavian Barnes Reddit

Nearest neighbour search is a very natural problem: given a target point and a set of candidates, find the closest candidate to the target. For points in the standard k-dimensional Euclidean space, k-d trees and related data structures offer a good solution. But we're not always so lucky.

More generally, we may be working with "points" in an exotic space that doesn't nicely decompose into separate dimensions like $\mathbb{R}^n$ does. As long as we have some concept of distance, it still makes sense to ask what the nearest neighbour of a point is. If our notion of distance is completely unconstrained, we may not be able to better than exhaustive search. But if the distance function is a metric, we can use that to our advantage.

To be a distance metric, a function $d(x, y)$ has to satisfy these three laws:

  • $d(x, y) = 0$ if and only if $x = y$
  • $d(x, y) = d(y, x)$
  • $d(x, z) \le d(x, y) + d(y, z)$ for all $y$

The last condition is known as the triangle inequality, and it's the key to performing nearest neighbour searches efficiently. Many common distance functions happen to be metrics, such as Euclidean distance, Manhattan distance, Hamming distance, and Levenshtein distance.

For searching in general metric spaces, many nice data structures such as vantage point trees and BK-trees exist. But I'd like to talk about another, less popular but supremely interesting one: the Approximating and Eliminating Search Algorithm (AESA).


AESA is a bazooka of an algorithm; it takes $O(n^2)$ time and memory to pre-process the set of candidate points, and $O(n)$ time to answer a nearest neighbour query. The remarkable thing about it is it reduces the number of distance computations per query to $O(1)$ on average. That bears repeating: the distance function is invoked an expected constant number of times, totally independent of the number of candidates! This is very useful when your distance function is expensive to compute, like Levenshtein distance is. Variants of the algorithm reduce both the quadratic pre-processing time and the linear per-query overhead, and I'll talk about these variants in future posts, but for now let's go over the basic AESA.

The idea is to pre-compute the distance between every single pair of candidates (hence $O(n^2)$). These pre-computed distances are used to derive successively better and better lower bounds from the target to each candidate. It looks like this:

t a lower bound c b

Here, $t$ is the target point, $b$ is the best match so far, $a$ is the "active" candidate, and $c$ is another candidate being considered. By calculating $d(t, a)$, and using the pre-computed value of $d(a, c)$, we can eliminate $c$ as a possibility without even computing $d(t, c)$.

Formally, the lower bound is obtained by rearranging the triangle inequality:

d(t,c) & \ge \phantom| d(t,a) - d(c,a) \phantom| \\
d(c,t) & \ge \phantom| d(c,a) - d(t,a) \phantom| \\
d(t,c) & \ge | d(t,a) - d(a,c) |

If this lower bound is larger than the distance to the best candidate we've found so far, $c$ cannot possibly be the nearest neighbour. AESA uses the algorithm design paradigm of best-first branch and bound, using the lower bounds to both prune candidates, and as a heuristic to select the next active candidate.


A simple Python implementation looks like this:

import math

class Aesa:
    def __init__(self, candidates, distance):
        Initialize an AESA index.

        candidates: The list of candidate points.
        distance: The distance metric.

        self.candidates = candidates
        self.distance = distance

        # Pre-compute all pairs of distances
        self.precomputed = [[distance(x, y) for y in candidates] for x in candidates]

    def nearest(self, target):
        """Return the nearest candidate to 'target'."""

        size = len(self.candidates)

        # All candidates start out alive
        alive = list(range(size))
        # All lower bounds start at zero
        lower_bounds = [0] * size

        best_dist = math.inf

        # Loop until no more candidates are alive
        while alive:
            # *Approximating*: select the candidate with the best lower bound
            active = min(alive, key=lambda i: lower_bounds[i])
            # Compute the distance from target to the active candidate
            # This is the only distance computation in the whole algorithm
            active_dist = self.distance(target, self.candidates[active])

            # Update the best candidate if the active one is closer
            if active_dist < best_dist:
                best = active
                best_dist = active_dist

            # *Eliminating*: remove candidates whose lower bound exceeds the best
            old_alive = alive
            alive = []

            for i in old_alive:
                # Compute the lower bound relative to the active candidate
                lower_bound = abs(active_dist - self.precomputed[active][i])
                # Use the highest lower bound overall for this candidate
                lower_bounds[i] = max(lower_bounds[i], lower_bound)
                # Check if this candidate remains alive
                if lower_bounds[i] < best_dist:

        return self.candidates[best]


Let's run a little experiment to see how many times it really calls the distance metric.

from random import random

dimensions = 3
def random_point():
    return [random() for i in range(dimensions)]

count = 0
def euclidean_distance(x, y):
    global count
    count += 1

    s = 0
    for i in range(len(x)):
        d = x[i] - y[i]
        s += d*d
    return math.sqrt(s)

points = [random_point() for n in range(1000)]
aesa = Aesa(points, euclidean_distance)

print('{0} calls during pre-computation'.format(count))
count = 0


print('{0} calls during nearest neighbour search'.format(count))
count = 0

for i in range(1000):

print('{0} calls on average during nearest neighbour search'.format(count / 1000))
count = 0

On a typical run, this prints something like

1000000 calls during pre-computation
6 calls during nearest neighbour search
5.302 calls on average during nearest neighbour search

Raising the number of points to 10,000, pre-processing takes much longer, but the average number of distance metric evaluations stays at around 5.3!

100000000 calls during pre-computation
5 calls during nearest neighbour search
5.273 calls on average during nearest neighbour search


Vidal (1986). An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognition Letters, Volume 4, Issue 3, July 1986, pp. 145–157.

Micó, Oncina, Vidal (1994). A new version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognition Letters, Volume 15, Issue 1, January 1994, pp. 9–17.

Vilar (1995). Reducing the overhead of the AESA metric space nearest neighbour searching algorithm. Information Processing Letters, Volume 56, Issue 5, 8 December 1995, pp. 265–271.

Micó, Oncina, Carrasco (1996). A fast branch & bound nearest neighbour classifier in metric spaces. Pattern Recognition Letters, Volume 17, Issue 7, 10 June 1996, pp. 731–739.

Java autoboxing performance

2015-07-11 Tavian Barnes Comments Reddit

In Java, primitives (int, double, char) are not Objects. But since a lot of Java code requires Objects, the language provides boxed versions of all primitive types (Integer, Double, Character). Autoboxing allows you to write code like

Character boxed = 'a';
char unboxed = boxed;

which is desugared into

Character boxed = Character.valueOf('a');
char unboxed = boxed.charValue();

Unfortunately, the Java Virtual Machine can't always see through this process, so avoiding unnecessary boxing is often key to getting good performance. That's why specialized types like OptionalInt and IntStream exist, for example. In this post, I'm going to outline one of the reasons that autoboxing is hard for the JVM to eliminate.

The scenario

For a motivating example, lets say we want a generic Levenshtein distance computation that works on any type of data that can be viewed as a sequence:

public class Levenshtein<T, U> {
    private final Function<T, List<U>> asList;

    public Levenshtein(Function<T, List<U>> asList) {
        this.asList = asList;

    public int distance(T a, T b) {
        // Wagner-Fischer algorithm, with two active rows

        List<U> aList = asList.apply(a);
        List<U> bList = asList.apply(b);

        int bSize = bList.size();
        int[] row0 = new int[bSize + 1];
        int[] row1 = new int[bSize + 1];

        for (int i = 0; i <= bSize; ++i) {
            row0[i] = i;

        for (int i = 0; i < bSize; ++i) {
            U ua = aList.get(i);
            row1[0] = row0[0] + 1;

            for (int j = 0; j < bSize; ++j) {
                U ub = bList.get(j);
                int subCost = row0[j] + (ua.equals(ub) ? 0 : 1);
                int delCost = row0[j + 1] + 1;
                int insCost = row1[j] + 1;
                row1[j + 1] = Math.min(subCost, Math.min(delCost, insCost));

            int[] temp = row0;
            row0 = row1;
            row1 = temp;

        return row0[bSize];

As long as your objects can be viewed as a List<U>, this class can compute the edit distance between them. Now we're probably going to want to compute the distance between Strings, so we need a way to view a String as a List:

public class StringAsList extends AbstractList<Character> {
    private final String str;

    public StringAsList(String str) {
        this.str = str;

    public Character get(int index) {
        return str.charAt(index); // Autoboxing!    }

    public int size() {
        return str.length();


Levenshtein<String, Character> lev = new Levenshtein<>(StringAsList::new);
lev.distance("autoboxing is fast", "autoboxing is slow"); // 4

Because of the way Java generics are implemented, we can't have a List<char>, so we settle for a List<Character> and all the boxing that entails. (As a side note, this limitation may be lifted in Java 10.)


To see what kind of performance we're getting, we need to benchmark the distance() method. Microbenchmarking is very hard to get right in Java, but luckily OpenJDK ships a tool called JMH (Java Microbenchmark Harness) that does most of the hard work for us. If you're interested, I recommend reading the documentation and samples; it's fascinating stuff. Here's the benchmark:

public class MyBenchmark {
    private Levenshtein<String, Character> lev = new Levenshtein<>(StringAsList::new);

    public int timeLevenshtein() {
        return lev.distance("autoboxing is fast", "autoboxing is slow");

(Returning the value from the method allows JMH to do some gymnastics to convince the runtime that the value is used, preventing dead-code elimination from screwing with the results.)

And here are the results:

$ java -jar target/benchmarks.jar -f 1 -wi 8 -i 8
# JMH 1.10.2 (released 3 days ago)
# VM invoker: /usr/lib/jvm/java-8-openjdk/jre/bin/java
# VM options:
# Warmup: 8 iterations, 1 s each
# Measurement: 8 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.tavianator.boxperf.MyBenchmark.timeLevenshtein

# Run progress: 0.00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration   1: 1517.495 ns/op
# Warmup Iteration   2: 1503.096 ns/op
# Warmup Iteration   3: 1402.069 ns/op
# Warmup Iteration   4: 1480.584 ns/op
# Warmup Iteration   5: 1385.345 ns/op
# Warmup Iteration   6: 1474.657 ns/op
# Warmup Iteration   7: 1436.749 ns/op
# Warmup Iteration   8: 1463.526 ns/op
Iteration   1: 1446.033 ns/op
Iteration   2: 1420.199 ns/op
Iteration   3: 1383.017 ns/op
Iteration   4: 1443.775 ns/op
Iteration   5: 1393.142 ns/op
Iteration   6: 1393.313 ns/op
Iteration   7: 1459.974 ns/op
Iteration   8: 1456.233 ns/op

Result "timeLevenshtein":
  1424.461 ±(99.9%) 59.574 ns/op [Average]
  (min, avg, max) = (1383.017, 1424.461, 1459.974), stdev = 31.158
  CI (99.9%): [1364.887, 1484.034] (assumes normal distribution)

# Run complete. Total time: 00:00:16

Benchmark                    Mode  Cnt     Score    Error  Units
MyBenchmark.timeLevenshtein  avgt    8  1424.461 ± 59.574  ns/op


In order to see what's happening on the hot path(s) of your code, JMH integrates with the Linux tool perf to show you the JIT-compiled assembly of the hottest code regions. (To view the assembly, you'll need the hsdis plugin for your platform. If you're running Arch, I have packaged it for you on the AUR.) Simply add -prof perfasm to the JMH command line to see the results:

$ java -jar target/benchmarks.jar -f 1 -wi 8 -i 8 -prof perfasm
cmp    $0x7f,%eax
jg     0x00007fde989a6148  ;*if_icmpgt
                           ; - java.lang.Character::valueOf@3 (line 4570)
                           ; - com.tavianator.boxperf.StringAsList::get@8 (line 14)
                           ; - com.tavianator.boxperf.StringAsList::get@2 (line 5)
                           ; - com.tavianator.boxperf.Levenshtein::distance@121 (line 32)
cmp    $0x80,%eax
jae    0x00007fde989a6103  ;*aaload
                           ; - java.lang.Character::valueOf@10 (line 4571)
                           ; - com.tavianator.boxperf.StringAsList::get@8 (line 14)
                           ; - com.tavianator.boxperf.StringAsList::get@2 (line 5)
                           ; - com.tavianator.boxperf.Levenshtein::distance@121 (line 32)

There's a lot of output, but sections like the above indicate that the boxing hasn't been optimized out. And what's the deal with the comparisons to 0x7f/0x80? The answer is in the source for Character.valueOf():

    private static class CharacterCache {
        private CharacterCache(){}

        static final Character cache[] = new Character[127 + 1];

        static {
            for (int i = 0; i < cache.length; i++)
                cache[i] = new Character((char)i);

    public static Character valueOf(char c) {
        if (c <= 127) { // must cache
            return CharacterCache.cache[(int)c];
        return new Character(c);

It turns out the Java Language Standard mandates that Character.valueOf() reuse cached Character instances for the first 127 chars. The goal is to reduce allocations and garbage collections, but it seems like premature optimization to me. And here it's impeding other optimizations! The JVM can't tell that Character.valueOf(c).charValue() == c, because it doesn't know the contents of the cache array. So instead it grabs a Character from the cache and reads its value, just to end up with the same value it started with.

The workaround

The workaround is pretty simple:

@@ -11,7 +11,7 @@ public class StringAsList extends AbstractList<Character> {
     public Character get(int index) {
-        return str.charAt(index); // Autoboxing!
+        return new Character(str.charAt(index));

By replacing the autoboxing with explicit boxing, avoiding Character.valueOf() altogether, the code becomes much easier for the JVM to reason about:

    private final char value;

    public Character(char value) {
        this.value = value;

    public char charValue() {
        return value;

Even though we've "added" an allocation, the JVM sees right through it and simply grabs the char directly out of the String. The performance boost is noticeable:

$ java -jar target/benchmarks.jar -f 1 -wi 8 -i 8
# Run complete. Total time: 00:00:16

Benchmark                    Mode  Cnt     Score    Error  Units
MyBenchmark.timeLevenshtein  avgt    8  1221.151 ± 58.878  ns/op

A full 14% faster. -prof perfasm confirms that char values are loaded right out of the String and compared in registers:

movzwl 0x10(%rsi,%rdx,2),%r11d  ;*caload
                                ; - java.lang.String::charAt@27 (line 648)
                                ; - com.tavianator.boxperf.StringAsList::get@9 (line 14)
                                ; - com.tavianator.boxperf.StringAsList::get@2 (line 5)
                                ; - com.tavianator.boxperf.Levenshtein::distance@121 (line 32)
cmp    %r11d,%r10d
je     0x00007faa8d404792  ;*if_icmpne
                           ; - java.lang.Character::equals@18 (line 4621)
                           ; - com.tavianator.boxperf.Levenshtein::distance@137 (line 33)


Boxing is still a weak area for HotSpot, and I'd like to see it become better. It should take more advantage of the semantics of boxed types to eliminate boxing in more cases, removing the need for these workarounds.

All the code used for these benchmarks is available on GitHub.


pron 2015-07-11

There is some ongoing work on improving "box elision": http://mail.openjdk.java.net/pipermail/valhalla-dev/2014-November/000380.html

Fast, Branchless Ray/Bounding Box Intersections, Part 2: NaNs

2015-03-23 Tavian Barnes Comments Reddit

In part 1, I outlined an algorithm for computing intersections between rays and axis-aligned bounding boxes. The idea to eliminate branches by relying on IEEE 754 floating point properties goes back to Brian Smits in 1, and the implementation was fleshed out by Amy Williams. et al. in 2.

To quickly recap, the idea is to replace the naïve slab method:

bool intersection(box b, ray r) {
    double tmin = -INFINITY, tmax = INFINITY;

    for (int i = 0; i < 3; ++i) {
        if (ray.dir[i] != 0.0) {
            double t1 = (b.min[i] - r.origin[i])/r.dir[i];
            double t2 = (b.max[i] - r.origin[i])/r.dir[i];

            tmin = max(tmin, min(t1, t2));
            tmax = min(tmax, max(t1, t2));
        } else if (ray.origin[i] <= b.min[i] || ray.origin[i] >= b.max[i]) {
            return false;

    return tmax > tmin && tmax > 0.0;

with the equivalent but faster:

bool intersection(box b, ray r) {
    double tmin = -INFINITY, tmax = INFINITY;

    for (int i = 0; i < 3; ++i) {
        double t1 = (b.min[i] - r.origin[i])*r.dir_inv[i];
        double t2 = (b.max[i] - r.origin[i])*r.dir_inv[i];

        tmin = max(tmin, min(t1, t2));
        tmax = min(tmax, max(t1, t2));

    return tmax > max(tmin, 0.0);

Are the two algorithms really equivalent? We've eliminated the $\mathtt{ray.dir}_i \ne 0$ checks by relying on IEEE 754 floating point behaviour. When $\mathtt{ray.dir}_i = \pm 0$, $\mathtt{ray.dir\_inv}_i = \pm \infty$. If the ray origin's $i$ coordinate is inside the box, meaning $\mathtt{b.min}_i < \mathtt{r.origin}_i < \mathtt{b.max}_i$, we'll have $t_1 = -t_2 = \pm \infty$. Since $\max(n, -\infty) = \min(n, +\infty) = n$ for all $n$, $t_{\min}$ and $t_{\max}$ will remain unchanged.

On the other hand, if the $i$ coordinate is outside the box ($\mathtt{r.origin}_i < \mathtt{b.min}_i$ or $\mathtt{r.origin}_i > \mathtt{b.max}_i$), we'll have $t_1 = t_2 = \pm \infty$, and therefore either $t_{\min} = +\infty$ or $t_{\max} = -\infty$. One of those values will stick around through the rest of the algorithm, causing us to return false.

Unfortunately the above analysis has a caveat: if the ray lies exactly on a slab ($\mathtt{r.origin}_i = \mathtt{b.min}_i$ or $\mathtt{r.origin}_i = \mathtt{b.max}_i$), we'll have (say)

t_1 &= (\mathtt{b.min}_i - \mathtt{r.origin}_i) \cdot \mathtt{r.dir\_inv}_i \\
&= 0 \cdot \infty \\
&= \mathrm{NaN}

which behaves a lot less nicely than infinity. Correctly handling this edge (literally!) case depends on the exact behaviour of min() and max().

On min and max

The most common implementation of min() and max() is probably

#define min(x, y) ((x) < (y) ? (x) : (y))
#define max(x, y) ((x) > (y) ? (x) : (y))

This form is so pervasive it was canonised as the behaviour of the min/max instructions in the SSE/SSE2 instruction sets. Using these instructions is key to getting good performance out of the algorithm. That being said, this form has some odd behaviour involving NaN. Since all comparisons with NaN are false,

\min(x, \mathrm{NaN}) & = & \max(x, \mathrm{NaN}) & = & \mathrm{NaN} \\
\min(\mathrm{NaN}, x) & = & \max(\mathrm{NaN}, x) & = & x.

The operations neither propagate nor suppress NaNs; instead, when either argument is NaN, the second argument is always returned. (There is also similar odd behaviour concerning signed zero, but it doesn't affect this algorithm.)

In contrast, the IEEE 754-specified min/max operations (called "minNum" and "maxNum") suppress NaNs, always returning a number if possible. This is also the behaviour of C99's fmin() and fmax() functions. On the other hand, Java's Math.min() and Math.max() functions propagate NaNs, staying consistent with most other binary operations on floating point values. 3 and 4 have some more discussion about the various min/max implementations in the wild.

The problem

The IEEE and Java versions of min() and max() provide consistent behaviour: all rays that lie exactly on a slab are considered not to intersect the box. It's easy to see why for the Java version, as the NaNs eventually pollute all the computations and make us return false. For the IEEE version, $\min(t_1, t_2) = \max(t_1, t_2) = \pm \infty$, which is the same as when the ray is entirely outside the box.

(Since this is an edge case, you might wonder why we don't choose to return true instead of false for rays on the boundary. It turns out to be much harder to get this behaviour with efficient code.)

With the SSE-friendly min/max implementations, the behaviour is inconsistent. Some rays that lie on a slab will intersect, even if they are completely outside the box in another dimension:

Ray/box NaN example

In the above image, the camera lies in the plane of the top face of the cube, and intersections are computed with the above algorithm. The top face is seen to extend out past the sides, due to improper NaN handling.

The workaround

When at most one argument is NaN, we can simulate the IEEE behaviour with

\mathrm{minNum}(x, y) &= \min(x, \min(y, +\infty)) \\
\mathrm{maxNum}(x, y) &= \max(x, \max(y, -\infty))

Thierry Berger-Perrin applies a similar strategy in 5, effectively computing

tmin = max(tmin, min(min(t1, t2), INFINITY));
tmax = min(tmax, max(max(t1, t2), -INFINITY));

in the loop instead. It is also fine to do

tmin = max(tmin, min(min(t1, t2), tmax));
tmax = min(tmax, max(max(t1, t2), tmin));

which is around 30% faster because CPUs are slower at handling floating point special cases (infinities, NaNs, subnormals). For the same reason, it's better to unroll the loop like this, to avoid dealing with any more infinities than necessary:

bool intersection(box b, ray r) {
    double t1 = (b.min[0] - r.origin[0])*r.dir_inv[0];
    double t2 = (b.max[0] - r.origin[0])*r.dir_inv[0];

    double tmin = min(t1, t2);
    double tmax = max(t1, t2);

    for (int i = 1; i < 3; ++i) {
        t1 = (b.min[i] - r.origin[i])*r.dir_inv[i];
        t2 = (b.max[i] - r.origin[i])*r.dir_inv[i];

        tmin = max(tmin, min(min(t1, t2), tmax));
        tmax = min(tmax, max(max(t1, t2), tmin));

    return tmax > max(tmin, 0.0);

It's a little harder to see why this version is correct: any NaNs from the $x$ coordinate will propagate through to the end, while NaNs from other coordinates will result in $t_{\min} \ge t_{\max}$; in both cases, false is returned.

With GCC 4.9.2 at -O3 this implementation handles just over 93 million rays per second, meaning the runtime is around 30 clock cycles, even without vectorization!

The other workaround

Sadly, that's still about 15% slower than the version with no explicit NaN handling. And since this algorithm is generally used when traversing a bounding volume hierarchy, the worst thing that can happen is you traverse more nodes than necessary in degenerate cases. For many applications, this is well worth it, and it should never result in any visual artifacts in practice if the ray/object intersection functions are implemented correctly. For completeness, here's a fast implementation (108 million rays/second) that doesn't attempt to handle NaNs consistently:

bool intersection(box b, ray r) {
    double t1 = (b.min[0] - r.origin[0])*r.dir_inv[0];
    double t2 = (b.max[0] - r.origin[0])*r.dir_inv[0];

    double tmin = min(t1, t2);
    double tmax = max(t1, t2);

    for (int i = 1; i < 3; ++i) {
        t1 = (b.min[i] - r.origin[i])*r.dir_inv[i];
        t2 = (b.max[i] - r.origin[i])*r.dir_inv[i];

        tmin = max(tmin, min(t1, t2));
        tmax = min(tmax, max(t1, t2));

    return tmax > max(tmin, 0.0);

The program I used to test various intersection() implementations is given in 6. In my next post on this topic, I'll talk about low-level implementation details, including vectorization, to get the most performance possible out of this algorithm.


Brian Smits: Efficiency Issues for Ray Tracing. Journal of Graphics Tools (1998).


Amy Williams. et al.: An Efficient and Robust Ray-Box Intersection Algorithm. Journal of Graphics Tools (2005).


Sven-Hendrik Haase 2015-11-20

For some reason, your code doesn't yield any results in my program. I'm using glm and wrote it like this: https://gist.github.com/a64045811d5dcf378b6a

It just never collides with anything. However, this: https://gist.github.com/a64045811d5dcf378b6a from scratchapixel.com works just fine.

I would like to use your algorithm, though, since it's likely faster. Can you help me figure out where I went wrong?

Tavian Barnes 2015-12-01

Well, here, you are multiplying by dir.x instead of dir_inv.x.

Sven-Hendrik Haase 2015-12-03

Oh gee, thanks man! Awesome work.

Anna 2017-02-08

Hi! Why is this article using "return tmax > tmin" rather then "return tmax >= tmin" from the previous article?

Tavian Barnes 2017-02-18

You will have tmax == tmin whenever the ray exactly intersects an edge or corner of the bounding box. Since we already decided that a ray that lies exactly in the plane of a face isn't an intersection, I thought it made sense for the edges and corners to not count either.

Sebastien Maraux 2018-05-28

Nice article. I tried the SSE implementation, and some unit tests at home are failing on specific case, namely BBoxes with very small size in one dimension : if the dimension size is lost by the float / double precision when substracted to origin, a comparison with tmax >= tmin is mandatory. This occurs in my case with a bbox of a ground roughly 0 (+/- 1e-15) in height, and the origin position height was at 1e-6, using floats. The same issue can occur with double precision and bigger difference. I consider that points on edges / corners are inside the intersection to get rid of this, as it does not lead to issues in my use case.

Matas Peciukonis 2017-03-26

How do you get the normals of the box, I don't understand, without them , how do you shade anything?

Tavian Barnes 2017-02-18

This is for bounding boxes, not box objects in a scene. But if you want to use this approach for boxes in your scene, you can compute the intersection point in object space (assuming your box is from (-1, -1, -1) to (1, 1, 1)) and then clip all the values that aren't very close to -1 or 1 to zero. For example, if the intersection point is (-1, 0.1, 0.9), the normal is (-1, 0, 0).

John Novak 2017-06-13

I actually came up with a somewhat similar solution for a fast box normal calculation routine. As usual, there were a few subtleties to this...


Aleksei 2018-08-02

Code doesn't work if you want to intersect a ray with infinity thin box, like AABB of a plane. However it works if we change it to: return tmax >= max(tmin, 0.0);

Tara 2019-03-25

I spent half a fucking day debugging this shit cuz no attention was paid to that problem!

Use >= instead of >!!!

Eric James 2018-09-09

An interesting approach of eliminating floating point round off errors during the computation of the Slabs Method t intersection distances can be found here: https://github.com/constantinides/RAABB

It seems to me it is not related at all with software speed up, but it shows how to eliminate false negatives when using floating point. Did you know about this problem?

Efficient Integer Exponentiation in C

2014-10-05 Tavian Barnes

It's surprisingly difficult to find a good code snippet for this on Google, so here's an efficient computation of integer powers in C, using binary exponentiation:

// Computes b**e (mod UINT32_MAX)
ipow(uint32_t b, uint32_t e)
  uint32_t ret;
  for (ret = 1; e; e >>= 1) {
    if (e & 1) {
      ret *= b;
    b *= b;
  return ret;

GCC 4.9.1 (and likely other versions) is smart enough to replace the if (e & 1) branch with a conditional move, generating very fast code.

Of course, this computes the result modulo UINT32_MAX. To use a different modulus, just reduce after each multiplication:

// Computes b**e (mod m)
ipowm(uint32_t b, uint32_t e, uint32_t m)
  uint32_t ret;
  b %= m;
  for (ret = m > 1; e; e >>= 1) {
    if (e & 1) {
      ret = (uint64_t)ret * b % m;
    b = (uint64_t)b * b % m;
  return ret;

(Note the ret = m > 1 instead of ret = 1, to handle the case e == 0 && m == 1.)

Unfortunately, GCC isn't smart enough to realise the limited range of the operands and generates a full 64-bit multiply and divide for each ... * b % m operation. For extra performance, this bit of inline assembly for x86 and x86-64 gives about a 15% boost:

// Computes a * b (mod m), as long as a / b is representable in 32 bits
static uint32_t
reduced_multiply(uint32_t a, uint32_t b, uint32_t m)
  uint32_t q, r;

  __asm__ ("mull %3\n\t"
           "divl %4"
           : "=a" (q), "=&d" (r)
           : "0" (a), "rm" (b), "rm" (m));

  return r;

// Computes b**e (mod m)
ipowm(uint32_t b, uint32_t e, uint32_t m)
  uint32_t ret;
  b %= m;
  for (ret = m > 1; e; e >>= 1) {
    if (e & 1) {
      ret = reduced_multiply(ret, b, m);
    b = reduced_multiply(b, b, m);

  return ret;

Standards-compliant* alloca()

2014-07-31 Tavian Barnes Reddit

The alloca() function in C is used to allocate a dynamic amount of memory on the stack. Despite its advantages in some situations, it is non-standard and will probably remain so forever.

There's a feature of C99 (made optional in C11) called variable-length arrays (VLAs for short) that works in many of the same situations. So instead of something like this:

char *path = alloca(strlen(dir) + strlen(file) + 2);
sprintf(path, "%s/%s", dir, file);

you can write this and have a conforming program:

char path[strlen(dir) + strlen(file) + 2];
sprintf(path, "%s/%s", dir, file);

But sometimes VLAs aren't quite as powerful as alloca(). The most recent example I encountered was trying to allocate a struct with a flexible array member on the stack. Here's a simpler example:

Imagine a library author has hidden the definition of a struct to ensure better forward-compatibility. To allow clients to allocate the struct on the stack, she makes the size of the structure available in a global variable, intended to be used like this:

// Library header
struct foo;
extern const size_t foo_size /* = sizeof(struct foo) */;

// Client code
void bar(void) {
    struct foo *foo = alloca(foo_size);

If we try to replace the alloca() with a VLA, we get this:

// Client code
void bar(void) {
    char buffer[foo_size];
    struct foo *foo = (struct foo *)buffer;

Unfortunately that's not quite standards-compliant, because buffer likely has a weaker alignment requirement than foo. Furthermore, char can't be replaced with any other type, because that would violate strict aliasing rules*.

C11 provides us with a solution: the alignas specifier. Using it, we can force the memory block to have the strictest possible alignment:

#include <stdalign.h>

void bar(void) {
    alignas(max_align_t) char buffer[foo_size];
    struct foo *foo = (struct foo *)buffer;

The ugliness can be hidden inside a macro that's almost as convenient as alloca():

#if __STDC_VERSION__ < 201112L || defined(__STDC_NO_VLA__)
#error "C11 with VLAs is required"

#include <stdalign.h>

#define ALLOCA(var, size)             \
    ALLOCA2(var, size, __LINE__)
#define ALLOCA2(var, size, unique_id) \
    ALLOCA3(var, size, unique_id)
#define ALLOCA3(var, size, unique_id)                         \
    alignas(max_align_t) char alloca_buffer##unique_id[size]; \
    var = (void *)alloca_buffer##unique_id

void bar(void) {
    ALLOCA(struct foo *foo, foo_size);

Aside from the slightly uglier syntax, the only difference is in the lifetime of the allocation. Traditionally, memory allocated with alloca() is freed at the end of the calling function, while VLAs are freed ad the end of the enclosing block.

*Reddit user nooneofnote points out that there is still an aliasing issue, because while char * can alias any other type, other types may not necessarily alias an array of char. This trick does work in practice but is sadly not as compliant as I thought.

The Visitor Pattern in Python

2014-06-19 Tavian Barnes Comments

The visitor pattern is tremendously useful when working with certain kinds of information like abstract syntax trees. It's basically a poor man's version of sum types for languages that don't natively support them. Unfortunately, they take advantage of function overloading, something which duck-typed languages like Python lack.

This blog post by Chris Lamb presents a clever workaround, but stops short of giving the actual implementation for the relevant decorators. The idea looks like this:

class Lion: pass
class Tiger: pass
class Bear: pass

class ZooVisitor:
    def visit(self, animal):
        return "Lions"

    def visit(self, animal):
        return "tigers"

    def visit(self, animal):
        return "and bears, oh my!"

animals = [Lion(), Tiger(), Bear()]
visitor = ZooVisitor()
print(', '.join(visitor.visit(animal) for animal in animals))
# Prints "Lions, tigers, and bears, oh my!"

It looks a little suspicious (after all, we've defined three conflicting methods on the same class), but you can write @visitor in a way that makes it work:

# A couple helper functions first

def _qualname(obj):
    """Get the fully-qualified name of an object (including module)."""
    return obj.__module__ + '.' + obj.__qualname__

def _declaring_class(obj):
    """Get the name of the class that declared an object."""
    name = _qualname(obj)
    return name[:name.rfind('.')]

# Stores the actual visitor methods
_methods = {}

# Delegating visitor implementation
def _visitor_impl(self, arg):
    """Actual visitor method implementation."""
    method = _methods[(_qualname(type(self)), type(arg))]
    return method(self, arg)

# The actual @visitor decorator
def visitor(arg_type):
    """Decorator that creates a visitor method."""

    def decorator(fn):
        declaring_class = _declaring_class(fn)
        _methods[(declaring_class, arg_type)] = fn

        # Replace all decorated methods with _visitor_impl
        return _visitor_impl

    return decorator

The trick here is that the decorator replaces all the visit methods with _visitor_impl (redefining an existing method is fine in Python). But before it does that, it stores the original method in a dictionary, _methods, keyed by the visitor class and the desired argument type. Then, when visit is invoked, _visitor_impl looks up the appropriate implementation and invokes it based on the argument type.


Kenji Noguchi 2015-01-30

Here is another implementation.

Exact Bounding Boxes for Spheres/Ellipsoids

2014-06-08 Tavian Barnes Comments

Finding the tightest axis-aligned bounding box for a sphere is trivial: the box extends from the center by the radius in all dimensions. But once the sphere is transformed, finding the minimal bounding box becomes trickier. Rotating a sphere, for example, shouldn't change its bounding box, but naïvely rotating the bounding box will expand it unnecessarily. Luckily there's a trick to computing minimal bounding boxes by representing the transformed sphere as a quadric surface.

A description of the technique can be found on Yining Karl Li's blog here, and in this Stack Overflow answer. Unfortunately, both descriptions gloss over the math, as well as some details that give rise to a simple and efficient implementation. The post on Li's blog also contains a small bug. In contrast, this post examines the technique in full detail. It relies on the projective representation of a quadric.

A sphere of radius 1, centred at the origin, can be defined as the set of points where $x^2 + y^2 + z^2 = 1$. For the homogeneous point $\vec{p} = \begin{bmatrix}x & y & z & 1\end{bmatrix}^\top$, the same sphere can be defined by

\vec{p}^\top \, \mathbf{S} \, \vec{p} = 0,


\mathbf{S} = \begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & -1

If an arbitrary affine transformation $\mathbf{M}$ is applied to the sphere, the equation becomes

(\mathbf{M}^{-1}\,\vec{p})^\top \, \mathbf{S} \, (\mathbf{M}^{-1}\,\vec{p}) & = \vec{p}^\top \, (\mathbf{M}^{-\mathrm{T}}\,\mathbf{S}\,\mathbf{M}^{-1}) \, \vec{p} \\
{} & = 0.

We denote this transformed quadric matrix by $\mathbf{Q} = \mathbf{M}^{-\mathrm{T}}\,\mathbf{S}\,\mathbf{M}^{-1}$. Note that $\mathbf{Q}$ is symmetric.

The Dual Form

The minimal bounding box for this surface will be tangent to it, so we need to find an expression for its tangent planes. In homogeneous coordinates, a plane can be represented by a row vector $\vec{u}^\top$ such that $\vec{u}\cdot\vec{p} = 0$. For a point $\vec{p}$ on the surface, the plane $\vec{u}^\top = \vec{p}^\top \, \mathbf{Q}$ touches the quadric at $\vec{p}$, since

\vec{u}\cdot\vec{p} & = \vec{u}^\top \, \vec{p} \\
{} & = \vec{p}^\top \, \mathbf{Q} \, \vec{p} \\
{} & = 0.

Furthermore, this must be the unique intersection point. To see why, consider a point $\vec{q} = \vec{p} + r\,\hat{r}$ that falls on both the plane and the quadric.

\vec{q}^\top \, \mathbf{Q} \, \vec{q} & = (\vec{p} + r\,\hat{r})^\top \, \mathbf{Q} \, \vec{q} \\
{} & = \vec{u}\cdot\vec{p} + r\,\hat{r}^\top \, \mathbf{Q} \, \vec{q} \\
{} & = r \, (\vec{q}^\top \, \mathbf{Q} \, \hat{r})^\top \\
{} & = r^2 \, (\hat{r}^\top \, \mathbf{Q} \, \hat{r}) \\
{} & = 0.

If $\hat{r}^\top \, \mathbf{Q} \, \hat{r}$ is zero, then the whole line $\vec{p} + t\,\hat{r}$ must lie on the quadric, which is impossible since its surface is curved. Therefore $r = 0$ and $\vec{q} = \vec{p}$. Since the plane has exactly one point in common with the quadric, $\vec{u}^\top$ must be the tangent plane at that point. This gives a nice characterization of the tangent planes:

\vec{u}^\top \, \mathbf{Q}^{-1} \, \vec{u} & = \vec{p}^\top \, \mathbf{Q} \, \mathbf{Q}^{-1} \, \mathbf{Q} \, \vec{p} \\
{} & = \vec{p}^\top \, \mathbf{Q} \, \vec{p} \\
{} & = 0.

Solving for the Planes

Let $\mathbf{R} = \mathbf{Q}^{-1} = \mathbf{M} \, \mathbf{S}^{-1} \, \mathbf{M}^\top$. Like $\mathbf{Q}$, $\mathbf{R}$ is symmetric. We want to find axis-aligned planes $\vec{u}^\top$ such that $\vec{u}^\top \, \mathbf{R} \, \vec{u} = 0$. The plane perpendicular to $\hat{x}$ looks like $\vec{u}^\top = \begin{bmatrix}1 & 0 & 0 & -x\end{bmatrix}$, so we have

\vec{u}^\top \, \mathbf{R} \, \vec{u} & =
1 & 0 & 0 & -x
1 \\
0 \\
0 \\
-x \\
\end{bmatrix} \\
{} & =
1 & 0 & 0 & -x
\mathbf{R}_{1,1} - \mathbf{R}_{1,4}\,x \\
\mathbf{R}_{2,1} - \mathbf{R}_{2,4}\,x \\
\mathbf{R}_{3,1} - \mathbf{R}_{3,4}\,x \\
\mathbf{R}_{4,1} - \mathbf{R}_{4,4}\,x \\
\end{bmatrix} \\
{} & = \mathbf{R}_{4,4}\,x^2 - (\mathbf{R}_{1,4} + \mathbf{R}_{4,1})\,x + \mathbf{R}_{1,1} \\
{} & = \mathbf{R}_{4,4}\,x^2 - 2\,\mathbf{R}_{1,4}\,x + \mathbf{R}_{1,1} \\
{} & = 0 \\
x & = \frac{\mathbf{R}_{1,4} \pm \sqrt{\mathbf{R}_{1,4}^2 - \mathbf{R}_{1,1}\,\mathbf{R}_{4,4}}}{\mathbf{R}_{4,4}}. \\

Similarly, the $\hat{y}$ and $\hat{z}$ planes are given by

y & = \frac{\mathbf{R}_{2,4} \pm \sqrt{\mathbf{R}_{2,4}^2 - \mathbf{R}_{2,2}\,\mathbf{R}_{4,4}}}{\mathbf{R}_{4,4}} \\
z & = \frac{\mathbf{R}_{3,4} \pm \sqrt{\mathbf{R}_{3,4}^2 - \mathbf{R}_{3,3}\,\mathbf{R}_{4,4}}}{\mathbf{R}_{4,4}}. \\

This is where the above linked posts stop, but a lot more simplification can be done.

Implementation Details

Several details of the problem can make computing the planes more efficient. The first is that $\mathbf{S}$ is involutory, meaning $\mathbf{S}^{-1} = \mathbf{S}$. This means that the product $\mathbf{M}\,\mathbf{S}^{-1}$ can be computed implicitly: it is simply $\mathbf{M}$ with its last column negated. The last column of $\mathbf{R} = \mathbf{M} \, \mathbf{S}^{-1} \, \mathbf{M}^\top$ is the same, because the last column of $\mathbf{M}^\top$ is $\begin{bmatrix}0 & 0 & 0 & 1\end{bmatrix}^\top$. In particular, $\mathbf{R}_{4,4} = -1$.

Not all values of $\mathbf{R}$ are used; in fact, only values from the last column and the diagonal appear in the formulae. We know the last column implicitly, and the diagonal has the formula

\mathbf{R}_{i,i} = \left(\sum_{j=1}^3 \mathbf{M}_{i,j}^2\right) - \mathbf{M}_{i,4}^2.

Plugging this identity into the plane equations simplifies them greatly:

x & = \mathbf{M}_{1,4} \pm \sqrt{\mathbf{M}_{1,1}^2 + \mathbf{M}_{1,2}^2 + \mathbf{M}_{1,3}^2} \\
y & = \mathbf{M}_{2,4} \pm \sqrt{\mathbf{M}_{2,1}^2 + \mathbf{M}_{2,2}^2 + \mathbf{M}_{2,3}^2} \\
z & = \mathbf{M}_{3,4} \pm \sqrt{\mathbf{M}_{3,1}^2 + \mathbf{M}_{3,2}^2 + \mathbf{M}_{3,3}^2}.

This makes at least some intuitive sense, since the fourth column of $\mathbf{M}$ determines the sphere's translation. See here for an implementation in my ray tracer Dimension.


Michael Doube 2015-03-24


Very nice simplification. How does M relate to the ellipsoid's eigenvectors and eigenvalues? Or, put another way: I already know the ellipsoid's centre, its radii & eigenvalues (U, 3 × 3) and an eigenvector roation matrix (V, 3 × 3). The centre is M[1,2,3],4, but how can I get the other M elements from what I know? Or another way: is the top left 3×3 of the 4×4 affine transform simply UV (which I also already have)?


I'd like to implement your solution in my Ellipsoid class, to exclude ellipsoids from other calculations:




Tavian Barnes 2015-03-25

If I'm not mistaken, you have $\mathbf{A} = \mathbf{V} \, \mathbf{U} \, \mathbf{V}^\mathrm{T}$ (with $\mathbf{U} = \mathrm{diag}(r_x^{-2}, r_y^{-2}, r_z^{-2})$) such that your ellipsoid is determined by $(\vec{x} - \vec{v})^\mathrm{T} \, \mathbf{A} \, (\vec{x} - \vec{v}) = 1$. In this case, $\mathbf{A}$ takes the place of $\mathbf{Q}$ above, but the expressions are in terms of $\mathbf{R} = \mathbf{Q}^{-1}$.

A more convenient formulation would have $\mathbf{U} = \mathrm{diag}(r_x^{-1}, r_y^{-1}, r_z^{-1})$. Then $\mathbf{A} = \mathbf{M}^{-\mathrm{T}} \, \mathbf{M}^{-1} = \mathbf{V} \, \mathbf{U} \, (\mathbf{V} \, \mathbf{U})^\mathrm{T}$, so $\mathbf{M}^{-1} = (\mathbf{V} \, \mathbf{U})^\mathrm{T} = \mathbf{U} \, \mathbf{V}^\mathrm{T}$ and $\mathbf{M} = \mathbf{V}^{-\mathrm{T}} \, \mathbf{U}^{-1}$. Since $\mathbf{V}$ is a rotation matrix, $\mathbf{V}^{-\mathrm{T}} = \mathbf{V}$, so $\mathbf{M} = \mathbf{V} \, \mathbf{U}^{-1}$. $\mathrm{U}^{-1}$ is of course $\mathrm{diag}(r_x, r_y, r_z)$. That means your bounding box will be given by

x &= v_x \pm \sqrt{r_x^2\,\mathbf{V}_{1,1}^2 + r_y^2\,\mathbf{V}_{1,2}^2 + r_z^2\,\mathbf{V}_{1,3}^2} \\
y &= v_y \pm \sqrt{r_x^2\,\mathbf{V}_{2,1}^2 + r_y^2\,\mathbf{V}_{2,2}^2 + r_z^2\,\mathbf{V}_{2,3}^2} \\
z &= v_z \pm \sqrt{r_x^2\,\mathbf{V}_{3,1}^2 + r_y^2\,\mathbf{V}_{3,2}^2 + r_z^2\,\mathbf{V}_{3,3}^2}.

Michael Doube 2015-03-25

Hi Tavian,

Thanks for the reply. I've worked out the affine transform, based on sequential scaling, rotation and transformation. The scaling matrix S is diag(ra, rb, rc), the rotation is V and the transform, M is given by M = VS. Then I use the values from M as you originally proposed. (seems to work and a massive speedup, thank you!)

The result looks very similar to your comment above except that the off-diagonal terms are also multiplied, because the rotation matrix is postmultiplied by the diagonal scaling matrix. (Each column of the affine transform is multiplied by the corresponding element of the diagonal scaling transform)

In Java:

public double[] getXMinAndMax() {
	final double m11 = ev[0][0] * ra;
	final double m12 = ev[0][1] * rb;
	final double m13 = ev[0][2] * rc;
	final double d = Math.sqrt(m11 * m11 + m12 * m12 + m13 * m13);
	double[] minMax = { cx - d, cx + d };
	return minMax;

public double[] getYMinAndMax() {
	final double m21 = ev[1][0] * ra;
	final double m22 = ev[1][1] * rb;
	final double m23 = ev[1][2] * rc;
	final double d = Math.sqrt(m21 * m21 + m22 * m22 + m23 * m23);
	double[] minMax = { cy - d, cy + d };
	return minMax;

public double[] getZMinAndMax() {
	final double m31 = ev[2][0] * ra;
	final double m32 = ev[2][1] * rb;
	final double m33 = ev[2][2] * rc;
	final double d = Math.sqrt(m31 * m31 + m32 * m32 + m33 * m33);
	double[] minMax = { cz - d, cz + d };
	return minMax;

Tavian Barnes 2015-03-25

Yep, I forgot how to multiply matrices apparently. Fixed! Glad you found this post useful.

William H Walker 2019-12-01

I've been trying to understand the practical application of SVD to a simple data set, and have constructed an ellipsoid analogous to what is shown in:

I'm not a professional, and never did any work in the fields of imaging, statistics, or linear algebra, but can follow most of what is going on.
I'm using POV-Ray to visualize what is going on, and do most of the post-SVD matrix math and geometric transforms.
I generate random points on a sphere, anisotropically scale the system, and then rotate it around z, y, and x to give me an ellipsoid.
Over the last two weeks, I have spent much of my free time trying to do two things:

  1. Construct a set of basis vectors from the SVD that properly align with the ellipsoid, and then rotate the system so that the ellipsoid is aligned with the cardinal Euclidean axes.

  2. Understand the rest of the SVD results, and actually apply them to the ellipsoid to regenerate the original spherical system.

Could you possibly offer me some assistance, at least to reliably achieve goal #1? I've tried to contact several other people directly or through fora already, and I'm met with no reply, a complete failure to understand the situation and issue(s), or ... worse. :(

I have seen it stated in one variation or another that, "the columns of U are the axes of the transformed data hyperellipse and the corresponding diagonal elements of S are their lengths."

I'm struggling with this for a number of reasons. There is the narrow or thin SVD, and the wide or full SVD. So sometimes it's unclear to me which matrix is which. In either case, one of the matrices from the SVD is a max(n,m) x max (n,m) matrix.
If I have 30, or 400 data points, that's a 30x30 or 400 x 400 matrix. How are axes and their lengths extracted from that?

I've done both the narrow and wide SVDs, and I think I always wind up getting the smaller 3x3 matrix to be an orthonormal matrix whose axes get aligned with the cardinal axes when I apply the matrix-transpose. That makes sense to me, as the axes would need to be unit lengths in order to be properly scaled by S, and being orthonormal and having a determinant of 1 shows that it's a purely rotational matrix. I confirm that graphically, but the axes don't line up with the axes of the ellipsoid data points.

For goal #2, understanding the rest of the SVD:

Going back to, "the columns of U are the axes of the transformed data hyperellipse and the corresponding diagonal elements of S are their lengths," I have also seen it claimed that, "To approximate the original matrix, it makes sense to use the U and V matrices of the original matrix and then to adjust the S matrix. That way we can get the original matrix back for a full rank approximation by using the complete S matrix. "

I guess I'm naively believing that since [A] = [U][S][VT], that S would/should give me the scaling factors that I use to generate the ellipsoid from the sphere. Perhaps I could accept the explanation that a composition of the matrices of the SVD is only an approximation of the original matrix, but I find it extremely curious that using 3 different tools to calculate the SVD gives very closely matching results. Shouldn't I get an accurate set of scaling factors from S?

And now the real beast: "multiplying by U rotates the scaled data." That sounds great and looks even better when written symbolically, but the practical details completely elude me.

When I create a transform matrix for my sphere, I make a matrix for scaling and three matrices for the rotations around z, y, and x. The composition of these matrices gives me a final overall matrix ([A]?) that converts the sphere into the ellipsoid. That matrix is still only 3x3. How does one get a max(n,m) x max (n,m) matrix, and more importantly (for me), how does one use that huge matrix as a "rotation matrix"?

A Beautiful Ray/Mesh Intersection Algorithm

2014-05-30 Tavian Barnes Comments

In my last post, I talked about a beautiful method for computing ray/triangle intersections. In this post, I will extend it to computing intersections with triangle fans. Since meshes are often stored in a corner table, which is simply an array of triangle fans, this gives an efficient algorithm for ray tracing triangle meshes.

The aforementioned algorithm computes ray/triangle intersections with 1 division, 20 multiplications, and up to 18 additions. It also required storing an affine transformation matrix, which takes up 4/3 as much space as just storing the vertices. But if we have many triangles which all share a common vertex, we can exploit that structure to save time and memory.

Say our triangle fan is composed of triangles $ABC$, $ACD$, $ADE$, etc. As before, we compute

P_{ABC} =
\overrightarrow{AB} & \overrightarrow{AC} & \overrightarrow{AB} \times \overrightarrow{AC} & \overrightarrow{A} \\
0 & 0 & 0 & 1

Computing the change of basis from here to the next triangle is even easier. We want to find new coordinates $\langle u', v', w' \rangle$ such that

\vec{p} &=& u\hphantom{'}\,\overrightarrow{AB} &+& v\hphantom{'}\,\overrightarrow{AC} &+& w\hphantom{'}\,\overrightarrow{AB}\times\overrightarrow{AC} &+& \overrightarrow{A} \\
   {}   &=& u'\,\overrightarrow{AC} &+& v'\,\overrightarrow{AD} &+& w'\,\overrightarrow{AC}\times\overrightarrow{AD} &+& \overrightarrow{A}.

Two things are apparent: the first is that there is no further translation to perform because both coordinate systems have their origin at $\overrightarrow{A}$. The second is that only $u'$ depends on $v$. This means the matrix $P^{ABC}_{ACD}$ that takes us to the new basis has the special form

P^{ABC}_{ACD} = P_{ACD}\,P^{-1}_{ABC} =
a & 1 & d & 0 \\
b & 0 & e & 0 \\
c & 0 & f & 0 \\
0 & 0 & 0 & 1

so we only need to store two of its columns, and transforming the ray into the new space can be done much faster than with a full matrix multiplication. The following transformation is applied to both the ray origin and direction:

u' &= a\,u + v + d\,w \\
v' &= b\,u + e\,w \\
w' &= c\,u + f\,w.

In the end, for a triangle fan with $n$ vertices, the ray intersection can be computed with

n & \text{ divisions,} \\
6 + 14\,n & \text{ multiplications, and} \\
7 + 11\,n & \text{ additions}.

The multiplications and additions are also easily vectorisable. The storage requirement is $6\,(n + 1)$ floating-point values, which is equivalent to storing all the vertices and precomputed normals.

The implementation of this algorithm in my ray tracer Dimension is available here.


Mo 2014-05-30

Pretty cool, I wonder if this can be extended to triangle strips as well. That one might be a lot more challenging though.

$$ Also I wonder if these dollar signs will break your page with a MathJax injection. I imagine it's more intelligent than that.

Mo 2014-05-30

Nope. :)

Tavian Barnes 2014-05-30

Haha yeah the MathJax is all done with a WordPress plugin, I write [latex]m^at_h[/latex] and it prettifies it. I'm hoping it doesn't scan the page and replace $m^at_h$ with MathJax.

And you're right, it looks hard to do this well with triangle strips because they don't all share a common vertex. The above still works except the translation comes back which wastes time and space.

A Beautiful Ray/Triangle Intersection Method

2014-05-23 Tavian Barnes Comments

3D ray/triangle intersections are obviously an important part of much of computer graphics. The Möller–Trumbore algorithm, for example, computes these intersections very quickly. But there is another method that I believe is more elegant, and in some cases allows you to compute the intersection for “free.”

Say your triangle has corners at $A$, $B$, and $C$. If your triangle is not degenerate, those three points define a plane. Writing the vector from $A$ to $B$ as $\overrightarrow{AB} = \overrightarrow{B} - \overrightarrow{A}$, and similarly for $\overrightarrow{AC}$, the normal vector perpendicular to that plane is given by

\overrightarrow{N} = \overrightarrow{AB} \times \overrightarrow{AC}

Since $\overrightarrow{AB}$, $\overrightarrow{AC}$, and $\overrightarrow{N}$ are all linearly independent, they form an alternative basis for all of 3-space. That is, any point $\overrightarrow{p}$ can be represented by a unique triple $\langle u, v, w \rangle$ such that

\vec{p} = u\,\overrightarrow{AB} + v\,\overrightarrow{AC} + w\,\overrightarrow{N}.

We can make this space even nicer by translating $A$ to the origin, giving

\vec{p} = u\,\overrightarrow{AB} + v\,\overrightarrow{AC} + w\,\overrightarrow{N} + \overrightarrow{A}.

In this space, the three corners of the triangle are at $\langle 0,0,0 \rangle$, $\langle 1,0,0 \rangle$, and $\langle 0,1,0 \rangle$. It is easy to see that a point is within the triangle if and only if $u,v \ge 0$, $u + v \le 1$, and $w = 0$.

To transform a point $(x,y,z)$ into this nicer space, we can use an affine change of basis matrix

P = \begin{bmatrix}
\overrightarrow{AB} & \overrightarrow{AC} & \overrightarrow{N} & \overrightarrow{A} \\
0 & 0 & 0 & 1

A line defined by $\overrightarrow{o} + t\,\overrightarrow{n}$ can thus be transformed into the new space by

\mathrlap{o_u}\hphantom{n_u} \\
\mathrlap{o_v}\hphantom{n_v} \\
\mathrlap{o_w}\hphantom{n_w} \\
& =
\mathrlap{o_x}\hphantom{n_x} \\
\mathrlap{o_y}\hphantom{n_y} \\
\mathrlap{o_z}\hphantom{n_z} \\
\end{bmatrix} \\
n_u \\
n_v \\
n_w \\
& =
n_x \\
n_y \\
n_z \\

Solving for the intersection point is easy:

t & = -o_w/n_w \\
u & = o_u + t\,n_u \\
v & = o_v + t\,n_v

And as above, this can be quickly checked with $u, v \ge 0$, $u + v \le 1$ (and $t \ge 0$ to only count intersections “in front” of the ray).

The beauty of this method is that often, objects are already associated with a transformation matrix, so if you left-multiply that matrix by $P$, the change of basis is performed for free at the same time as other transformations, and only the short block above is needed to actually perform the intersection test. At a cost of 1 division, 2 multiplications, 2 or 3 additions, and some comparisons, that's about as fast as possible without special-casing triangles versus other objects.

Of course, if you're working with lots of triangles (a mesh, for example), you can save memory and time by not associating a transformation matrix with each triangle. In that case, other algorithms such as the one linked above may be faster as they can avoid the two matrix multiplications.

An implementation of this method can be seen here in my ray tracer Dimension.


Raphael 2014-08-09

I'm a bit confused about the "cost of 1 division, 2 multiplications, 2 or 3 additions, and some comparisons". Don't you still have to multiply each ray by the matrix P? So even if P is precomputed and stored ahead of time, that's still 16 multiplications and 12 additions, plus the computations of t,u, and v.

Tavian Barnes 2014-08-10

That's in the case that "objects are already associated with a transformation matrix," in which case you have to do a matrix-ray multiplication anyway, so it doesn't count against you. Also I think you've underestimated the cost of a matrix-ray multiplication, it ought to be 18 multiplications and 15 additions.

Vladimir 2017-01-18

Hi! I wonder what is nw is 0? I watched sources from the dimension repo, that case is not handled at all:

Could you please explain what to do if nw is 0?

Tavian Barnes 2017-01-31

Geometrically, if $n_w$ is zero, then the line is exactly parallel to the plane of the triangle. For this degenerate case, it's easiest to just return "no intersection." There are a few sub-cases:

  • $o_w = 0$: Then we'll have $t = \mathrm{NaN}$, so $t \ge 0$ fails and we return false
  • $o_w > 0$: We'll have $t = -\infty$, so again $t \ge 0$ fails and we return false
  • $o_w < 0$: We get $t = \infty$. Depending on $n_u$ and $n_v$,
    • $n_u < 0$ or $n_v < 0$: Then $u = -\infty$, $v = -\infty$, so at least one of $u \ge 0$ or $v \ge 0$ fails and we return false
    • $n_u = 0$ or $n_v = 0$: Results in $u = \mathrm{NaN}$ or $v = \mathrm{NaN}$, so again one of $u \ge 0$ or $v \ge 0$ fails and we return false
    • $n_u > 0$ and $n_v > 0$: That means $u = v = \infty$, so $u + v \le 1$ fails and we return false

Without explicitly handling the $n_w = 0$ case, the properties of IEEE floating point arithmetic have given us the desired behaviour in all cases.

Announcing Sangria

2014-04-02 Tavian Barnes GitHub

Sangria is a new project of mine to release various Guice extensions I've been working on recently. Right now the coolest thing it can do is context-sensitive injections, allowing (among other things) first-class Logger injection for more than just java.util.logging.

For example, so allow SLF4J Logger injection, all you have to do is this:

import com.google.inject.AbstractModule;
import com.tavianator.sangria.slf4j.SangriaSlf4jModule;

public class YourModule extends AbstractModule {
    protected void configure() {
        install(new SangriaSlf4jModule());

And now this will just work:

import org.slf4j.Logger;

public class YourClass {
    private final Logger logger;

    YourClass(Logger logger) {
        this.logger = logger;

To create your own context-sensitive injections, implement the ContextSensitiveProvider interface:

import com.tavianator.sangria.contextual.ContextSensitiveProvider;

public class YourProvider implements ContextSensitiveProvider<YourType> {
    public YourType getInContext(InjectionPoint injectionPoint) {
        // Create an instance. The type you're being injected into is available
        // as injectionPoint.getDeclaringType().

    public YourType getInUnknownContext() {
        // Create an instance for an unknown context. The context will be
        // unknown for Provider<YourType> bindings and for Provider method
        // parameters. If the context is required, it is valid to throw an
        // exception here.

Then use ContextSensitiveBinder to bind it:

import com.google.inject.AbstractModule;
import com.tavianator.sangria.contextual.ContextSensitiveBinder;

public class YourModule extends AbstractModule {
    protected void configure() {

Sangria is released under the Apache License, Version 2.0.

k-d Forests

2014-03-10 Tavian Barnes Comments Reddit

Recently I saw an interesting Code Golf problem: create an image using all possible RGB colours. The top ranked submission, by József Fejes, contained some truly beautiful images, but they took an immense amount of time to generate.

As described in the answer and a follow-up blog post, the images were generated by putting a random colour in the centre, then placing the next colour adjacent to the most similar colour, repeatedly. The submitted code used exhaustive search over the boundary pixels to find the best location; using some geometric data structures, we can do a lot better.

The problem of finding where to place the current pixel amounts to finding the nearest neighbour to the colour being placed, using Euclidean distance over the RGB colour space. The most common data structure for nearest neighbour search is the k-d tree, but they don't perform well in the face of rapid insertion and removal. Since the boundary changes every time we place a pixel, we'll need a more dynamic data structure.

Usually R-trees are used for problems like these, but I came up with another idea. Instead of a single k-d tree, we can use a collection of k-d trees (hence the name "k-d forest") with increasing power-of-2 sizes. It looks like this:

Rather than re-build a large tree on every insertion, we usually only have to rebuild the first few small trees in the forest. Since k-d trees can be built in $O(n \lg n)$ time, we get this performance for n insertions:

\sum_{i=0}^{\lg n} \frac{n}{2^{i+1}} \, O \left( 2^i \, \lg 2^i \right)
& = \sum_{i=0}^{\lg n} O \left( \frac{n}{2^{i+1}} \, i \, 2^i \right) \\
{} & = \sum_{i=0}^{\lg n} O \left( n \, i \right) \\
{} & = O \left( n \, \sum_{i=0}^{\lg n} i \right) \\
{} & = O \left( n \, \lg^2 n \right).

Allowing node removal destroys this nice amortized performance though, because you could repeatedly remove and add a node right when the largest tree has to get rebuilt. So we cheat: instead of really deleting nodes, simply set a "removed" flag to true. To avoid wasting time searching through removed nodes, we can rebuild the whole forest without them once they reach, say, 50% of the total nodes.

Query performance is also nice. Since a single k-d tree takes $O(\lg n)$ on average to answer a nearest-neighbour query, and we have up to $O(\lg n)$ trees, our average running time per query is

\sum_{i=0}^{\lg n} O \left( \lg 2^i \right)
& = \sum_{i=0}^{\lg n} O \left( i \right) \\
{} & = O \left( \lg^2 n \right).

So what's the real-world performance like? Well, it takes only about four minutes to generate a 4096×4096 image using all possible 24-bit colours. A straight-across comparison isn't exactly fair due to me using C instead of C♯, but the original code took dozens of hours to generate the same images.

Rather than making the images available, I've hosted the code here. It takes about as long to generate the images (even for the full 24 bit spectrum) as it would to download them anyway. Simply checkout the code and run make image to generate an image. You can customize some #defines in main.c to generate different kinds and sizes of images.

EDIT: Since József Fejes's 24-bit images aren't up anymore, I've shared a Google Drive folder with mine here.

Thanks to József Fejes for the concept.


Andrew McDowell 2014-03-12

There is some more info on this sort of thing at http://en.wikipedia.org/wiki/Dynamization, http://wwwisg.cs.uni-magdeburg.de/ag/lehre/SS2009/GDS/slides/S12.pdf, and http://www.mpi-inf.mpg.de/~mehlhorn/ftp/mehlhorn27.pdf.

Tavian Barnes 2014-03-12

Thanks for the links! I've seen the technique applied before but it's interesting to see the black-box analysis.

Leland Batey 2014-03-13

Hey, I encountered some errors with building this, and I wanted to share how I resolved them.

Basically, running make yielded linker-errors saying it couldn't find references to a variety of functions in libpng. To fix this, I had to modify the make command to be:

cc -std=c99 -pipe -O2 -Werror -Wall -Wpedantic -Wextra -Wno-sign-compare -Wno-unused-parameter -Wunreachable-code -Wshadow -Wpointer-arith -Wwrite-strings -Wcast-align -Wstrict-prototypes -lm -lpng -o kd-forest kd-forest.c util.c color.c main.c -lpng

The change is the addition of -lpng on the end. That resolved my compilation problem.

Tavian Barnes 2014-03-13

Yeah it's good practice to put the -l*s after the source files but I forgot. Fixed now, thanks!

Gregor_TL 2014-03-19

How did you generate *hue and *hue2 images? What makes them so different?

Tavian Barnes 2014-03-19

You can see in József's hue-sorted images that there are some artefacts near the edges. This happens because there are naturally "holes" in the image as it's generated, and by the time the purple pixels are being placed, those holes are the only empty places for them.

In the *hue.png images, I've tried to mitigate it by doing multiple passes through the sorted colours. You can see that in the code here. That's why you see multiple "rings" of the entire spectrum.

In the *hue2.png images, I've used a better method. Instead of multiple evenly-spaced stripes, I do every other colour, then every 4th, then every 8th, 16th, etc. You can see that in the code here.

sammko 2014-07-30

Hi, I rendered a 22-bit animation (2048x2048) and cropped it down to 1080p. You can see the result on YouTube. Thanks for the algorithm :D

MrNosco 2015-09-01

You shouldn't use uint8_t pixel[3] notation in function arguments.
It might look like sizeof(pixel) should be sizeof(uint8_t[3]), but it's actually sizeof(uint_t*).

Tavian Barnes 2015-09-02

Yeah I read this thread too. But because this is my code, I get to have my own opinions on what I should and shouldn't do. The [3] is useful documentation. You'd have a point if I actually used sizeof(pixel) anywhere, but I don't.

Henry 2020-01-15

Hey I'm trying to do something similar a, but I have a question for you.

I understand how the minimum selection would be implemented with the KD-tree by querying for a nearest neighbor to the target color and then placing it in a randomly chosen adjacent location, but how did you do the average selection?

If you were to compute the average color for a location using it's 8 neighbors then you could nearest neighbor between the target and the average and place at the resulting spot. But I think that gives a slightly different result to the original guy who minimized the average of the distances to the neighbors , rather than minimizing the distance to the average of the neighbors.

Am I correct in thinking your averaging scheme is slightly different to the original guy's, or if not how did you do nearest neighbor search for the average selection case?

Tavian Barnes 2020-01-30

You're right! I didn't realize that's what the original code did. Actually it looks like the original code computes the average of the *squared* distances.

Big Numbers

2014-01-08 Tavian Barnes Comments

Take a number, say, 264, and write it in binary:

264 = 2^8 + 2^3.

Because we like binary so much, write the exponents in binary too:

264 &= 2^{2^3} + 2^{2 + 1} \\
&= 2^{2^{2 + 1}} + 2^{2 + 1}

and so on, until every number is 2 or less. This is the hereditary base-2 representation of 264. Now, transform that number by increasing the base from 2 to 3, and subtracting one:

G_2(264) &= 3^{3^{3 + 1}} + 3^{3 + 1} - 1 \\
&= 3^{3^{3 + 1}} + 2 \cdot 3^3 + 2 \cdot 3^2 + 2 \cdot 3 + 2 \\
&\approx 4.4 \cdot 10^{38}.

And again:

G_3(G_2(264)) &= 4^{4^{4 + 1}} + 2 \cdot 4^4 + 2 \cdot 4^2 + 2 \cdot 4 + 2 \\
&\approx 3.2 \cdot 10^{616}.

Goodstein sequences are defined by these repeated applications of $G_n$. The Goodstein sequence starting at $m$ is $m_1 = m$, $m_2 = G_2(m_1)$, $m_3 = G_3(m_2)$, etc.

Goodstein's theorem

Goodstein's theorem states that every Goodstein sequence eventually reaches zero. Despite the deceivingly explosive initial increase (the next value for $m = 264$ above is $m_4 \approx 2.5 \cdot 10^{10921}$), the surprising truth is that for all initial values of $m$, $m_k = 0$ for some finite $k$.

This counterintuitive theorem is actually fairly easy to prove. We simply find an upper bound on the base $b$ used in the hereditary base-$b$ representations in the sequence. Finding the upper bound is actually the easiest part:

b < \infty

"But wait," you say, "that's obvious and unhelpful!" Well, by using a particular kind of infinity, this upper bound becomes useful. Precisely, we use $\omega$, the first transfinite ordinal number. For $m = 264$ as before, the bounds look like this:

m_1 & = & 2^{2^{2 + 1}} + 2^{2 + 1} & \le & \omega^{\omega^{\omega + 1}} + \omega^{\omega + 1} \\
m_2 & = & 3^{3^{3 + 1}} + 2 \cdot 3^3 + 2 \cdot 3^2 + 2 \cdot 3 + 2 & \le & \omega^{\omega^{\omega + 1}} + \omega^\omega \cdot 2 + \omega^2 \cdot 2 + \omega \cdot 2 + 2 \\
m_3 & = & 4^{4^{4 + 1}} + 2 \cdot 4^4 + 2 \cdot 4^2 + 2 \cdot 4 + 1 & \le & \omega^{\omega^{\omega + 1}} + \omega^\omega \cdot 2 + \omega^2 \cdot 2 + \omega \cdot 2 + 1 \\
m_4 & = & 5^{5^{5 + 1}} + 2 \cdot 5^5 + 2 \cdot 5^2 + 2 \cdot 5 & \le & \omega^{\omega^{\omega + 1}} + \omega^\omega \cdot 2 + \omega^2 \cdot 2 + \omega \cdot 2 \\
m_5 & = & 6^{6^{6 + 1}} + 2 \cdot 6^6 + 2 \cdot 6^2 + 6 + 5 & \le & \omega^{\omega^{\omega + 1}} + \omega^\omega \cdot 2 + \omega^2 \cdot 2 + \omega + 5 \\
m_6 & = & 7^{7^{7 + 1}} + 2 \cdot 7^7 + 2 \cdot 7^2 + 7 + 4 & \le & \omega^{\omega^{\omega + 1}} + \omega^\omega \cdot 2 + \omega^2 \cdot 2 + \omega + 4 \\

These upper bounds are well-defined using ordinal arithmetic. They are also strictly decreasing, and since the ordinals are well-ordered, every such strictly-decreasing sequence must reach zero in a finite number of steps. Note that we write $\omega \cdot 2$ rather than $2 \cdot \omega$ above because ordinal arithmetic is not commutative; in fact, $2 \cdot \omega = \omega < \omega \cdot 2$.


Despite being a theorem about natural numbers, Goodstein's theorem is independent of the Peano axioms. It's a natural, concrete example of Gödel's incompleteness theorem for those axioms.

Kirby and Paris proved its independence by essentially showing that any proof of Goodstein's theorem requires induction over sets too large for Peano's axioms to handle. They showed that Goodstein's theorem implies Gentzen's theorem.

The Goodstein function

How long are Goodstein sequences? It turns out they are extremely long. Let the Goodstein function $\mathcal{G}(m)$ be the length of the Goodstein sequence starting at $m_1 = m$.

$\mathcal{G}(12)$ is larger than Graham's number. $\mathcal{G}(m)$ grows faster than the Ackermann function $A(m, m)$. $\mathcal{G}(m)$ grows faster than the iterated Ackermann function $A^{A(m, m)}(m, m) = A(A(\cdots), A(\cdots))$. In fact, $\mathcal{G}(m)$ grows faster than any function that can be proven to be total in Peano arithmetic. That means $\mathcal{G}(m)$ grows faster than any function that can be shown to even exist by induction over the natural numbers.



Luqman 2016-03-17

This is good explanation of ordinals: http://www.sjsu.edu/faculty/watkins/ordinals.htm

Java Generics Quirks

2013-07-18 Tavian Barnes

Can you guess which of these statements are valid Java 7? I know I can't! :)

Hint: Eclipse, javac, and the JLS disagree on these, so test-compiling won't help you.

abstract class ListOfListOf<T> implements List<List<T>> {

class Quirks {
    static void quirks(List<List<Number>> list) {
        // Easy one to warm up with
        List<List<? extends Number>> warmUp = list;

        // These casts are type-safe
        ListOfListOf<Number> normalCast = (ListOfListOf<Number>) list;
        ListOfListOf<?> wildcardCast = (ListOfListOf<?>) list;

        // So are these ones
        List<? extends List<? extends Number>> wider = list;
        ListOfListOf<?> narrowingCast = (ListOfListOf<?>) wider;


  • List<List<? extends Number>> does not capture a List<List<Number>>. In fact, it does not capture at all, so the assignment warmUp = list is invalid.
  • ListOfListOf<Number> normalCast = (ListOfListOf<Number>) list;
    Eclipse says yes, javac says yes.
  • ListOfListOf<?> wildcardCast = (ListOfListOf<?>) list;
    Eclipse says "no way man!" javac says "sure."
  • List<? extends List<? extends Number>> wider = list;
    Just a normal widening conversion. Eclipse and javac say "whatever, man."
  • ListOfListOf<?> narrowingCast = (ListOfListOf<?>) wider;
    Eclipse says "no problem," javac says "I'm sorry, Dave. I'm afraid I can't do that."

Ready to go again?

class CanYouInferT<T extends Comparable<? super T>> {

class Quirks {
    static <T extends Comparable<? super T>> void canYouInferT() {

    static void quirks() {
        // Note that Object is not Comparable, so what is T?
        CanYouInferT<?> canYouInferT = new CanYouInferT<>();

Answers: Eclipse is perfectly happy to infer, um, something for T in both cases. javac chokes on new CanYouInferT<>(), but somehow still manages to infer something for the static call of canYouInferT().

Does anyone know what the spec says about these corner cases?

Fair and Square, or How to Count to a Googol

2013-04-15 Tavian Barnes Comments Reddit

Fair and Square is a problem from the qualification round of Google Code Jam 2013. The gist of the problem is to find out how many integers in a given range are both a palindrome, and the square of a palindrome. Such numbers are called "fair and square." A number is a palindrome iff its value is the same when written forwards or backwards, in base 10.

The small input has very modest bounds on size: the interval $[A, B]$ is bounded by $1 \le A \le B \le 1000$. It's perfectly reasonable to check for each integer in that range whether it is a palindrome, and whether it is a perfect square of a palindrome. This has complexity $10^3 \approx 2^{10}$. But we know that harder inputs are coming, so lets be a little smart.

Rather than check whether each number is a perfect square, we can enumerate all the perfect squares in $[A, B]$ by squaring each integer in $\left[ \lceil\sqrt{A}\rceil, \lfloor\sqrt{B}\rfloor \right] $. This essentially reduces the size of our search space to its square root, or $\sqrt{10^3} \approx 2^5$. Here is an implementation of this algorithm in Python:

#!/usr/bin/env python3

def isqrt(n):
  """Returns floor(sqrt(n))."""
  if n == 0:
    return 0

  lg = (n.bit_length() + 1)//2

  x = 1 << lg
  while True:
    r = (x + n//x)//2
    if y >= x:
    x = y

  return x

def is_palindrome(n):
  """Return whether n is a base-10 palindrome."""
  sn = str(n)
  # sn[::-1] is tricky slice syntax that reverses sn
  return sn == sn[::-1]

def palindromes(m, n):
  """Generates all the palindromic integers in range(m, n)."""
  for i in range(m, n):
    if is_palindrome(i):
      yield i

# T is the number of test cases
T = int(input())

for i in range(T):
  line = input().split(' ')
  # A and B are the bounds
  A = int(line[0])
  B = int(line[1])

  # m and n are the square roots of the bounds
  m = isqrt(A - 1) + 1  # ceil(sqrt(A))
  n = isqrt(B)          # floot(sqrt(B))

  count = 0
  for j in palindromes(m, n + 1):
    if is_palindrome(j*j):
      count += 1

  print('Case #%d: %d' % (i + 1, count))

The definition of isqrt() is based on this excellent StackOverflow answer. I encourage you to work out for yourself why it necessarily converges to $\lfloor\sqrt{n}\rfloor$.

But can we go faster?

Of course we can! The first "large" input for this problem has the much weaker bound $1 \le A \le B \le 10^{14}$ on the size of the interval. Our current algorithm has complexity $\sqrt{10^{14}} = 10^7 \approx 2^{23}$ on this input. But we're still wasting a lot of time iterating over integers that are not palindromes. It's easy to generate only the palindromes in an interval by incrementing only the left "half" of a number, and "mirroring" it to get the full palindrome. For example, from 10, 11, 12, ..., we can generate the palindromes 101, 111, 121, ..., as well as 1001, 1111, 1221, etc. Since we're now enumerating numbers with half the length, our search space is reduced to its square root again: $\sqrt{10^7} \approx 2^{12}$.

Here's a new implementation of palindromes(m, n) that uses this algorithm:

def palindromes(m, n):
  sm = str(m)

  q, r = divmod(len(sm), 2)
  # Note here that r = len(sm)%2.  If r is 0, we duplicate the middle digit and
  # generate palindromes like 1221.  If r is 1, we don't duplicate the middle
  # digit, instead generating palindromes like 12321.

  # lh is the "left half" of the palindrome we are generating
  lh = int(sm[:(q + r)])

  while True:
    slh = str(lh)

    # Check for rollover (99 becoming 100, for example)
    if len(slh) != q + r:
      if r == 0:
        # We go from generating numbers like 9999 to 10001, i.e. with an odd
        # length
        r = 1
        # We go from generating numbers like 99999 to 100001, i.e. with an even
        # length
        q, r = q + 1, 0
        # We don't want lh to increase in length yet
        lh = lh//10
        slh = slh[:-1]

    # wh is the "whole" palindrome, made by mirroring lh
    if r == 0:
      # slh[::-1] is the same tricky slice syntax for reversing a string
      wh = int(slh + slh[::-1])
      # More tricky slice syntax: slh[-2::-1] reverses slh, except for the last
      # character
      wh = int(slh + slh[-2::-1])

    if wh >= n:
      # We hit the upper bound
    elif wh >= m:
      yield w

    # Increment the left half and go again
    lh += 1

How about even faster?

The next large input has a gargantuan input size restriction of $1 \le A \le B \le 10^{100}$. That's right, the size of the interval is bounded by a googol.

Our tricks so far have only decreased the size of the search space to its 4th root. But $\sqrt[4]{10^{100}} = 10^{25} \approx 2^{83}$ is clearly still infeasible. We need to shrink our search space even more, so lets look at the result space. The square roots of the "fair and square" numbers look like this: 1, 2, 3, 11, 22, 101, 111, 121, 202, 212, 1001, 1111, 2002, etc.

Notice how all the numbers are composed of only the digits 0, 1, and 2 (except for the number 3 itself). If this result always holds, then we've reduced our search space to $3^{25} \approx 2^40$, which is just barely feasible.

Can we prove it?

Consider the square of a palindrome $n$ whose digits are $a b c \cdots c b a$:

\begin{array}{cccccccccccccc} {} & {} & {} & {} & {} & {} & {} & a & b & c & \cdots & c & b & a \\
\times & {} & {} & {} & {} & {} & {} & a & b & c & \cdots & c & b & a \\
\hline \\[-2.0ex]
{} & {} & {} & {} & {} & {} & {} & \boldsymbol{a^2} & a \cdot b & a \cdot c & \cdots & a \cdot c & a \cdot b & a^2 \\
{} & {} & {} & {} & {} & {} & b \cdot a & \boldsymbol{b^2} & b \cdot c & \cdots & b \cdot c & b^2 & b \cdot a & {} \\
{} & {} & {} & {} & {} & c \cdot a & c \cdot b & \boldsymbol{c^2} & \cdots & c^2 & c \cdot b & c \cdot a & {} & {} \\
{} & {} & {} & {} & ⋰ & ⋰ & ⋰ & \vdots & ⋰ & ⋰ & ⋰ & {} & {} & {} \\
{} & {} & {} & c \cdot a & c \cdot b & c^2 & \cdots & \boldsymbol{c^2} & c \cdot b & c \cdot a & {} & {} & {} & {} \\
{} & {} & b \cdot a & b^2 & b \cdot c & \cdots & b \cdot c & \boldsymbol{b^2} & b \cdot a & {} & {} & {} & {} & {} \\
+ & a^2 & a \cdot b & a \cdot c & \cdots & a \cdot c & a \cdot b & \boldsymbol{a^2} & {} & {} & {} & {} & {} & {} \\

Imagine if a carry $\alpha$ occurred while calculating the bolded middle column. In order for the result to be a palindrome, an equivalent (mod 10) carry $\beta$ must have occurred in the opposite column. Since that column has an identical copy on the left, that column must also produce a carry $\beta$. If we continue to carry out this logic, we are eventually forced to have a carry $\omega$ over the rightmost column, which is impossible, or into a new column on the left, which would make the result a non-palindrome.

\begin{array}{cccccccccccccc} {} & {} & {} & {} & {} & {} & {} & a & b & c & \cdots & c & b & a \\
\times & {} & {} & {} & {} & {} & {} & a & b & c & \cdots & c & b & a \\
\hline \\[-2.0ex]
\boldsymbol{\omega?} & {} & \cdots & {} & \boldsymbol\beta & {} & \boldsymbol\alpha & {} & \boldsymbol\beta & {} & \cdots & {} & {} & \boldsymbol{\omega?} \\
{} & {} & {} & {} & {} & {} & {} & a^2 & a \cdot b & a \cdot c & \cdots & a \cdot c & a \cdot b & a^2 \\
{} & {} & {} & {} & {} & {} & b \cdot a & b^2 & b \cdot c & \cdots & b \cdot c & b^2 & b \cdot a & {} \\
{} & {} & {} & {} & {} & c \cdot a & c \cdot b & c^2 & \cdots & c^2 & c \cdot b & c \cdot a & {} & {} \\
{} & {} & {} & {} & ⋰ & ⋰ & ⋰ & \vdots & ⋰ & ⋰ & ⋰ & {} & {} & {} \\
{} & {} & {} & c \cdot a & c \cdot b & c^2 & \cdots & c^2 & c \cdot b & c \cdot a & {} & {} & {} & {} \\
{} & {} & b \cdot a & b^2 & b \cdot c & \cdots & b \cdot c & b^2 & b \cdot a & {} & {} & {} & {} & {} \\
+ & a^2 & a \cdot b & a \cdot c & \cdots & a \cdot c & a \cdot b & a^2 & {} & {} & {} & {} & {} & {} \\

Thus, we know that the sum $a^2 + b^2 + c^2 + \cdots + c^2 + b^2 + a^2$ must not produce a carry, so it must be less than 10. Since $1^2 + 3^2 = 10$, it is impossible for n to contain any digits greater than 2, except in the case that $n = 3$.

We could stop now, implement the $3^{25} \approx 2^40$ complexity search, maybe multi-thread it, and be done. But we can do better!

A stronger result

Now that we know $a^2 + b^2 + c^2 + \cdots + c^2 + b^2 + a^2 < 10$ is a necessary condition for $n^2$ to be fair and square, is it also sufficient? Yes! Since that sum has more terms in it than any other column, and the pairwise products $a \cdot b$, $a \cdot c$, $b \cdot c$, etc. are bounded by at least one of $a^2$, $b^2$, $c^2$, etc., that column must have the largest value. Feel free to prove this rigorously if my hand-waving doesn't convince you :)

Enumerating all palindromes which avoid a carry in that column will thus give us exactly the set of fair and square numbers. These numbers must obey one of the following rules. Let $n$ be the square root of a fair and square number, and $l$ be its size in digits.

  • If $l$ is even,
    • $n$ starts with a 1, contains at most eight 1s, and all other digits are 0
    • OR, $n$ starts and ends with a 2, and contains no other digits except 0s
  • If $l$ is odd,
    • $n$ starts with a 1, contains at most nine 1s, and all other digits are 0
    • OR, $n$ starts with a 1, has a 2 as its middle digit, contains at most four 1s, and all other digits are 0
    • OR, $n$ starts and ends with a 2, and has either a 0 or a 1 as its middle digit, with all other digits being 0s

As we see in the next section, there are $O(l^3)$ fair and square numbers of length $l$, so it is possible to compute and store all fair and square numbers $\le 10^{100}$. It's then straightforward to count the number of them which lie in a given interval.

#!/usr/bin/env python3

import itertools

def fair_square_roots(n):
  """Generates all candidate square roots of length <= 2*n + 1."""
  yield 1
  yield 2
  yield 3

  for h in range(n):
    # There are between one and five 1s on the left half of the palindrome
    for n_ones in range(1, 5):
      # Get all possible locations for the 1s
      for ones in itertools.combinations(range(h), n_ones - 1):
        # Put a 1 at each chosen location, and a 0 everywhere else
        a = ['0'] * h
        for i in ones:
          a[i] = '1'

        s = '1' + ''.join(a)
        rs = s[::-1]

        # Generate some candidates
        yield int(s + rs)
        yield int(s + '0' + rs)
        yield int(s + '1' + rs)

        # If we have two or fewer 1s, we can afford a 2 as the middle digit
        if n_ones <= 2:
          yield int(s + '2' + rs)

    # The cases starting with 2 are simpler
    s = '2' + '0'*h
    rs = s[::-1]
    yield int(s + rs)
    yield int(s + '0' + rs)
    yield int(s + '1' + rs)

square = lambda x: x*x

# We want all fair and square numbers with length <= 101, so we get the square
# roots bounded by length <= 2*25 + 1 = 51
fair_squares = list(map(square, fair_square_roots(25)))

T = int(input())

for i in range(1, T + 1):
  line = input().split(' ')
  A = int(line[0])
  B = int(line[1])

  count = 0
  for fs in fair_squares:
    if A <= fs <= B:
      count += 1
  print('Case #%d: %d' % (i, count))

A non-enumerative approach

Here is an algorithm that computes the number of candidates directly, without enumerating them:

def n_fair_and_square(l):
  Returns the number of fair and square numbers with a square root of length l.
  if l == 1:
    # 1, 4, and 9
    return 3

  # h is the number of "free" digits we have -- half the number of digits, not
  # counting the first digit
  h = l//2 - 1

  # First consider the number of values starting with 1

  # There are 1 + h + (h choose 2) + (h choose 3) ways to place at most 3 1s in
  # positions 1..h
  n = 1 + (5*h + h**3)//6

  if l%2 == 1:
    # We can set the middle digit to 0 or 1 as well
    n *= 2

    # We can set the middle digit to 2 also, but then we're limited to at most
    # one 1 in positions 1..h
    n += 1 + h

    # There are two solutions starting with 2: 200...000...002, and
    # 200...010...002
    n += 2
    # There is one solution starting with 2: 2000...0002
    n += 1

  return n

Unfortunately, that's doesn't directly help us solve the challenge, because we don't just care about the number of fair and square numbers of a certain length -- they also have to be within some exact numerical bounds. However, I feel like this function could be modified in some way to account for those bounds.


Zoli 2013-05-06

Really nice analysis of the problem! Unfortunately during the competition I did not realize the part that the square root can only contain 0s, 1s, 2s or 3, so my algorithm was not fast enough for the large dataset.

Since you seem to understand algorithm problems quite well, could you please give me some pointers on how I should get better at these kind of problems? Should I read lots of algorithm books or compete on TopCoder or other competitions, or what do you suggest? How long did it take for you to analyze this problem quite thoroughly?

Tavian Barnes 2013-05-30

It took me the day to get that problem done (and it was the only one I did), and about another to write this up.

There's certainly a lot of benefit to practice, and I'd definitely recommend trying out TopCoder / Project Euler / old Code Jam problems to get more acquainted with these kinds of problems. But mostly I'd recommend just reading about lots of algorithms in your spare time, so you start to see the patterns and tricks used to make great algorithms.

Iterating Over Binary Trees

2012-04-19 Tavian Barnes

Binary trees are great. If you ever have to implement one yourself though, you're probably either using C or you need to look at the documentation for your language's standard library more closely. Even POSIX C has the tsearch family of functions from <search.h>.

Recently I found myself attempting to implement a bootstrapping compiler in a rather restrictive subset of Java, and found the need for some kind of associative array data structure. I chose a splay tree, because, well, I like them.

Soon I found that I wanted the ability to iterate over the entire tree, in sorted order. This ability is provided by C++'s std::map::iterators, for example. I considered using a threaded tree, then an iterator holding a stack, before I realised that there is a much simpler solution with equal performance. It requires only that the nodes store a reference to their parent node, which mine could with little extra effort.

The principle is simple. If a node has a right subtree, its successor is the minimal (leftmost) node in that subtree. Otherwise, its successor is the parent of its first ancestor which is not a right child. In code,

public class Node {
    private Node left, right;

    public Node next() {
        Node next;

        if (right == null) {
            // Node has no right child
            next = this;
            while (next.parent != null && next == next.parent.right) {
                next = next.parent
            next = next.parent;
        } else {
            // Find the leftmost node in the right subtree
            next = right;
            while (next.left != null) {
                next = next.left;

        return next;

But how fast is this? An individual call to node.next() could traverse from the bottom to the top of the tree, at a cost of $O(h)$, where $h$ is the height of the tree. If there are $n$ nodes in the tree, could we traverse $O(n h)$ edges to visit the whole tree? This would be particularly bad for a splay tree, which is not necessarily balanced at any given time.

It turns out that we need only traverse $O(n)$ edges to visit the entire tree. For an informal proof, note that our algorithm is basically equivalent to tracing an outline of the tree like this:

Clearly we can only trace around any edge at most twice (on its left and right side). Since any tree has $n - 1$ edges, we traverse at most $2 (n - 1) \in O(n)$ edges in the process of exploring the whole tree.


2011-12-06 Tavian Barnes

This animation is of one of Dimension's test scenes, physically integrated with collision detection and gravity. Each collision is perfectly elastic. The 1000-sphere scene was integrated and collision-detected 100 times per frame. The integration and rendering took 30 minutes on 12 cores, or 3 hours of CPU time. You should watch in HD.

Collisions were detected using a naïve $O(n^2)$ algorithm, but rendering time (23 minutes wall-clock, 173 minutes CPU time) still dwarfed integration time (7 minutes).

Ray / Priority R-Tree Intersection

2011-08-06 Tavian Barnes

These animations, rendered with Dimension, show the ray / bounding box intersections that occur when rendering a single pixel. The first video shows the regular search, and the second shows the faster case of a cache hit, as described in the previous post. Each ray/box intersection takes up one second in these videos, which aptly illustrates the difference between the cache miss and cache hit cases. For this pixel, the improvement is more than a factor of two.

Query without caching:

Query with cache hit:

Priority R-Trees

2011-07-31 Tavian Barnes

PR-trees, as used by Dimension, provide a bounding-volume hierarchy with the unique property of bounded query complexity. Combined with intersection caching, they work as a very efficient BVH, comparable to POV-Ray's performance. This is a rendering (using Dimension) of the PR-tree generated for one of its test scenes: (you may want to watch in HD)

Dimension uses a $O(n \log n)$ bulk-loading algorithm to construct the PR-tree. Because intersection (ray-object test) and containment (point-inside-object test) queries are pre-order traversals, it can then flatten the tree into an array of nodes. Each node stores a bounding box, a distance to skip if the bounding box intersection fails, and an optional pointer to an object.

Intersection queries against PR-trees have a complexity that grows with the number of intersecting objects. In fact, reminiscent of kD-trees, the complexity is $O(n^{2/3} + t)$ in three dimensions, for a tree with n objects and a query that intersects t objects. Thus, the non-intersecting case is already fast. To optimize the intersecting case, note that a ray is very likely to intersect the same object that it did in the previous pixel. Caching that object, and testing the cached object before the full search of the tree, allows everything behind that object to be ignored in the case of a cache hit.

Dimension's C PR-tree implementation can be seen here.

Fast, Branchless Ray/Bounding Box Intersections

2011-05-02 Tavian Barnes Comments

(Update: part 2)

Axis-aligned bounding boxes (AABBs) are universally used to bound finite objects in ray-tracing. Ray/AABB intersections are usually faster to calculate than exact ray/object intersections, and allow the construction of bounding volume hierarchies (BVHs) which reduce the number of objects that need to be considered for each ray. (More on BVHs in a later post.) This means that a ray-tracer spends a lot of its time calculating ray/AABB intersections, and therefore this code ought to be highly optimised.

The fastest method for performing ray/AABB intersections is the slab method. The idea is to treat the box as the space inside of three pairs of parallel planes. The ray is clipped by each pair of parallel planes, and if any portion of the ray remains, it intersected the box.

A simple implementation of this algorithm might look like this (in two dimensions for brevity):

bool intersection(box b, ray r) {
    double tmin = -INFINITY, tmax = INFINITY;

    if (ray.n.x != 0.0) {
        double tx1 = (b.min.x - r.x0.x)/r.n.x;
        double tx2 = (b.max.x - r.x0.x)/r.n.x;

        tmin = max(tmin, min(tx1, tx2));
        tmax = min(tmax, max(tx1, tx2));

    if (ray.n.y != 0.0) {
        double ty1 = (b.min.y - r.x0.y)/r.n.y;
        double ty2 = (b.max.y - r.x0.y)/r.n.y;

        tmin = max(tmin, min(ty1, ty2));
        tmax = min(tmax, max(ty1, ty2));

    return tmax >= tmin;

However, those divisions take quite a bit of time. Since when ray-tracing, the same ray is tested against many AABBs, it makes sense to pre-calculate the inverses of the direction components of the ray. If we can rely on the IEEE 754 floating-point properties, this also implicitly handles the edge case where a component of the direction is zero - the tx1 and tx2 values (for example) will be infinities of opposite sign if the ray is within the slabs, thus leaving tmin and tmax unchanged. If the ray is outside the slabs, tx1 and tx2 will be infinities with the same sign, thus making tmin == +inf or tmax == -inf, and causing the test to fail.

The final implementation would look like this:

bool intersection(box b, ray r) {
    double tx1 = (b.min.x - r.x0.x)*r.n_inv.x;
    double tx2 = (b.max.x - r.x0.x)*r.n_inv.x;

    double tmin = min(tx1, tx2);
    double tmax = max(tx1, tx2);

    double ty1 = (b.min.y - r.x0.y)*r.n_inv.y;
    double ty2 = (b.max.y - r.x0.y)*r.n_inv.y;

    tmin = max(tmin, min(ty1, ty2));
    tmax = min(tmax, max(ty1, ty2));

    return tmax >= tmin;

Since modern floating-point instruction sets can compute min and max without branches, this gives a ray/AABB intersection test with no branches or divisions.

My implementation of this in my ray-tracer Dimension can be seen here.


Phil 2012-04-01

This doesn't seem to work out for me for negative direction vectors. Looking forward, I can see a box that's there, but looking backwards, I again see the same box mirrored, even though in this direction it's not "there". Following the algorithm step by step manually for both a positive and a negative z-dir (with x-dir and y-dir set to 0) gives the same near and far planes in both directions:

box = MIN 0,0,0  MAX 256,256,256
ray POS 128,128,-512
case 1: ray DIR 0,0,0.9 -- inverse: (inf,inf,1.1111)
case 2: ray DIR 0,0,-0.9 -- inverse: (inf,inf,-1.1111)
picker.tx1, picker.tx2 = (me.Min.X - ray.Pos.X) * ray.invDir.X, (me.Max.X - ray.Pos.X) * ray.invDir.X	//	-inf,inf			-inf,inf
picker.ty1, picker.ty2 = (me.Min.Y - ray.Pos.Y) * ray.invDir.Y, (me.Max.Y - ray.Pos.Y) * ray.invDir.Y	//	-inf,inf			-inf,inf
picker.tz1, picker.tz2 = (me.Min.Z - ray.Pos.Z) * ray.invDir.Z, (me.Max.Z - ray.Pos.Z) * ray.invDir.Z	//	-142.22,142.22		142.22,-142.22
picker.txn, picker.txf = math.Min(picker.tx1, picker.tx2), math.Max(picker.tx1, picker.tx2)		//	-inf,inf			-inf,inf
picker.tyn, picker.tyf = math.Min(picker.ty1, picker.ty2), math.Max(picker.ty1, picker.ty2)		//	-inf,inf			-inf,inf
picker.tzn, picker.tzf = math.Min(picker.tz1, picker.tz2), math.Max(picker.tz1, picker.tz2)		//	-142.22,142.22		-142.22,142.22
picker.tnear = math.Max(picker.txn, math.Max(picker.tyn, picker.tzn))					//	-142.22				-142.22
picker.tfar = math.Min(picker.txf, math.Min(picker.tyf, picker.tzf))					//	142.22				142.22
if picker.tfar < picker.tnear {
	return true

Tavian Barnes 2012-04-03

Right, because the test is only for whether the line intersects the box at all. The line extends both forwards and backwards. Just add a tmax >= 0 check. It's tmax, not tmin, since tmin will be < 0 if the ray originates inside the box.

Sergiy 2012-12-30

Thanks. Works nicely and fast. I updated it a bit, to use SSE (though Vectormath), floats only.


Tavian Barnes 2013-01-04

You're welcome! I can't see that gist though (says "OAuth failure"). What kind of performance did the vectorisation give you?

Bram Stolk 2014-12-29

To get the actual intersection, would I just use tmin, and multiply it with ray dir, adding ray origin? And what if I'm just interested in which face was intersected? x+,x-,y+,y-,z+ or z-?

Tavian Barnes 2015-01-11

Yes, that's what I'd do. Except if the ray origin is inside the box (tmin < 0), you need to use tmax instead.

To see what face was intersected, there's a few different ways. You can keep track of which slab is intersecting in the above algorithm at all times, but that slows it down.

For a cube centered at the origin, a neat trick is to take the component of the intersection point with the largest absolute value.

Bram Stolk 2015-01-11

Thanks Tavianator,

Yeah, the cube-trick requires sorting.
So for now I test proximity to face within an epsilon.
If close enough to face, I assume that face was hit.
I can live with the few false positives, as I shoot over 100M photons each frame anyway.

Bram Stolk 2015-01-11

Oops... that should be 100K photons of course.

Tavian Barnes 2015-01-12

Haha I was really impressed for a second :)

The cube trick does not require sorting, just selecting the max from three candidates.

Francisco 2015-03-12

So if I return a vec3 with tmin, tmax and a float hit = step(tmin,tmax)*step(0,tmax), I basically know that if hit > 0 then ro+rd*tmin is my lower bound intersection point (entry point) and ro+rd*tmax is my higher bound intersection point (exit point), right? However, if tmin < 0, I'm inside the bounding box, which means I don't need the entry point and I can just use the ray origin as my starting point.

Tavian Barnes 2015-03-12

Sorry, not sure what you mean by step(). You can get the starting point as ro + td*max(tmin, 0.0).

Francisco 2015-04-01

step(edge,A) returns 1 if A >= edge, 0 elsewise. So I know if it was a hit if tmax >= tmin AND tmax >= 0. Then we also know tmin is the closest intersection and tmax is the furthest.

I implemented this in C++ and it works perfectly for any ray direction. However, in GLSL it seems to have problems with negative directions.

Jon olick 2015-01-22

Something to consider here is that 0 * inf =nan which occurs when the ray starts exactly on the edge of a box

Tavian Barnes 2015-01-31

True, if you want consistent handling of that case while staying branch-free, you have to do a little more work. This is worth another post actually, I'll write one up.

Josh 2017-08-14

Did you do another post?

Tavian Barnes 2017-08-16

Yep: https://tavianator.com/2015/03/fast-branchless-raybounding-box-intersections-part-2-nans/

Chris 2015-03-08

Thanks for posting this (so long ago)! I used this in my own code, and wondered if it is possible to save 2 of the subtractions by taking advantage of the fact that min(x+a, y+a) = min(x,y)+a (for some constant a):

r.offset = (r.n_inv.y * r.x0.y) - (r.n_inv.x * r.x0.x);

intersection(box b, ray r)
  double tx1 = b.min.x * r.n_inv.x;
  double tx2 = b.max.x * r.n_inv.x;

  double tmin = min(tx1, tx2) + r.offset;
  double tmax = max(tx1, tx2) + r.offset;

  double ty1 = b.min.y * r.n_inv.y;
  double ty2 = b.max.y * r.n_inv.y;

  tmin = max(tmin, min(ty1, ty2));
  tmax = min(tmax, max(ty1, ty2));

  return tmax >= tmin;

Tavian Barnes 2015-03-11

I think that should work, but there's a couple bugs in your example code. And actually for the test as written you can omit r.offset entirely since it doesn't affect the tmax >= tmin check at the end. But since you probably want a tmax >= 0.0 check too, this should work:

r.offset = (r.n_inv.y * r.x0.y) + (r.n_inv.x * r.x0.x);

intersection(box b, ray r)
  double tx1 = b.min.x * r.n_inv.x;
  double tx2 = b.max.x * r.n_inv.x;

  double tmin = min(tx1, tx2);
  double tmax = max(tx1, tx2);

  double ty1 = b.min.y * r.n_inv.y;
  double ty2 = b.max.y * r.n_inv.y;

  tmin = max(tmin, min(ty1, ty2));
  tmax = min(tmax, max(ty1, ty2));

  return tmax >= max(tmin, r.offset);

I'll try it out and see how much faster it is, thanks!

Mario 2015-05-16

What if I want to know the 't' of the intersection?

Tavian Barnes 2015-05-16

$t = t_{\min}$, unless $t_{\min} < 0$, in which case you're inside the box and $t = t_{\max}$.

Cody Bloemhard 2015-05-29

Very useful! faster the any method i tried. not because of the !(use of divisions), but because the implementation of the slab method is far simpler than others do. They use things(slow things) like square roots, DOT products and lots of checks with points. now my raycasting method has the speed i wanted. Thanks for that.

Ciyo 2015-12-01

Hello, many thanks for this useful information!

I have a question about the surface normal. Is there an easy way to get the normal of the intersection point?

Thank you!

Phlimy 2016-09-06

I would really like to know too!

Diego Sinay 2016-05-12

Hi, great read!
One quick question.
I had no problem using the first implementation for my ray-tracing algorithm, but can't implement that faster version since I can't get the inverse of the direction vector(doesn't it have to be squared?).
Any help is appreciated, Thanks!

Tavian Barnes 2016-05-16

You can just compute (1/x, 1/y, 1/z) as the inverse. You don't have to square it.

SW 2018-02-27

I know this is from a couple of years but ago now, but I’m really confused by this.

You say you can calculate the inverse of a ray with (1/x, 1/y, 1/z) but I don’t understand that.

Suppose I had a ray going straight in the x direction (1,0,0). Wouldn’t this give:

(1/1,1/0,1/0) = (1,0,0)

Which is surely the same ray?

Sorry if this is a stupid question!

Tavian Barnes 2018-02-27

1/0 is not zero. In IEEE 754 arithmetic, it is +inf.

SW 2018-02-27

So the inverse of a ray going positive X direction is (1, +inf, +inf)?
I'm having a hard time understanding that! I'm trying to find some sources to explain this but comin up short.
I would have though the inverse ray would be (x*-1, y*-1, z*-1), making an inverse ray of positive X as (-1, 0, 0). Then it's a ray going in the opposite direction.
By the way, thank you for answering, I realise I'm probably asking really stupid questions.

Tavian Barnes 2018-02-27

The "inverse" is not a ray at all. It's just three numbers that are the (multiplicative) inverses of the direction components.

Phil 2016-06-23

Hey there,

considering I have box with min(1,1,1) and max(2,2,2) and a ray with origin(3,3,3) and direction (2,2,2). Oh. Wait a second. I think I just answered my question. It's ray direction, not target position.

If I want my ray to go from (3,3,3) towards the direction of point (2,2,2) I need a direction like (-1,-1,-1), correct? Oh my, this made me struggle way longer than it should have.

Nevermind me. Thank you for the write-up in both articles. :)

Dave 2016-09-27

Does anyone know why this doesn't seem to work in GLSL for all angles?

INFINITY is defined as const float INFINITY = 1.0 / 0.0;

float rayAABBIntersection (Ray ray, AABB aabb)
	float tx1 = (aabb.min.x - ray.origin.x)*ray.inverseDirection.x;
	float tx2 = (aabb.max.x - ray.origin.x)*ray.inverseDirection.x;

	float tmin = min(tx1, tx2);
	float tmax = max(tx1, tx2);

	float ty1 = (aabb.min.y - ray.origin.y)*ray.inverseDirection.y;
	float ty2 = (aabb.max.y - ray.origin.y)*ray.inverseDirection.y;

	tmin = max(tmin, min(ty1, ty2));
	tmax = min(tmax, max(ty1, ty2));

	float tz1 = (aabb.min.z - ray.origin.z)*ray.inverseDirection.z;
	float tz2 = (aabb.max.z - ray.origin.z)*ray.inverseDirection.z;

	tmin = max(tmin, min(tz1, tz2));
	tmax = min(tmax, max(tz1, tz2));

	if (tmin > tmax)
		return INFINITY;

	if (tmin < 0)
		return tmax;

	return tmin;

Tavian Barnes 2016-09-29

What do you mean "doesn't work"? What happens? Do you have a numerical example that gets evaluated wrong?

skewed 2018-01-31

INFINITY is defined as const float INFINITY = 1.0 / 0.0;

You can't divide by zero like this. The result is undefined in glsl, which means the implementation is free to do whatever nonsense it wants to when it encounters this statement (except - according to the spec - crash).

Ali 2017-02-01

How to find the intersection points of a Cartesian grid and a boundary (e.g. a circle, an ellipse, an arbitrary shape)? I am interested in the intersection of the boundary with the cartesian grid. Any suggestion/algorithm, kindly email me.

Mark R 2017-05-16


Having read both this and part 2, I'm unable to work out what changes I need to make in order to reliably detect intersections when a ray is exactly on the edge of an AABB. I have your implementation here with a simple failing test case:


I've stepped through the code and tried making the changes you suggested in the other article (exchanging Math.max and Math.min for implementations that have the semantics of maxNumber and minNumber) but I can't seem to get it to work.

Maybe I've misunderstood the intent of the other article: What changes do I need to make to reliably catch ray/edge intersections? I care less about efficiency and more about avoiding false negatives.

Tavian Barnes 2017-05-17

If you're not concerned with efficiency too much, just take the "naïve" code from part 2 and swap <=/>= with </>:

bool intersection(box b, ray r) {
    double tmin = -INFINITY, tmax = INFINITY;

    for (int i = 0; i < 3; ++i) {
        if (ray.dir[i] != 0.0) {
            double t1 = (b.min[i] - r.origin[i])/r.dir[i];
            double t2 = (b.max[i] - r.origin[i])/r.dir[i];

            tmin = max(tmin, min(t1, t2));
            tmax = min(tmax, max(t1, t2));
        } else if (ray.origin[i] < b.min[i] || ray.origin[i] > b.max[i]) {
            return false;

    return tmax >= tmin && tmax >= 0.0;

Mark R 2017-05-17

Ah, OK, thanks!

Amomum 2017-06-09

Can you please explain what is this 'double t' parameter is?

static inline bool dmnsn_ray_box_intersection(dmnsn_optimized_ray optray, dmnsn_aabb box, double t)

It hurts me to say this but I believe that some of the functions do require comments about their input parameters. Or more obvious names.

Tavian Barnes 2017-06-13

A common way of defining lines (see dmnsn_ray) is by the parametric formula origin + t*direction. Commonly we restrict t >= 0 to get a half-line starting from the given origin. So generally t specifies a position along a line. In this case, t is the closest object intersection found so far, so if the bounding box is farther away than t, we can ignore it.

I agree that should be better documented! :)

Kanzaki 2018-04-02

Hi, thanks a lot for sharing.

I want to know how I can calculate the normal of my cube ? I know each side has its own. And I know I will only see 3 sides tops at a time. But I still don't know how to calculate it... Is it box->origin - intersection ? or box->origin - center_of_hit_side ?

Thanks a lot.

Tavian Barnes 2018-04-07


Fast Binary-Coded Decimal Addition and Subtraction

2011-04-05 Tavian Barnes

Binary-Coded Decimal, or BCD, is the most obvious way to encode a positionally-represented decimal number in binary: literally, 1234 becomes 0x1234. The some of the earliest of early computers used this representation, and the SMS protocol still uses it for some ungodly reason. It also has a more sane role as the expanded version of the DPD encoding for IEEE 754-2008 decimal numbers.

Working with BCD is unwieldy and slow, but at least for addition and subtraction there is a neat bit-trick that helps a lot. The carry-free case can be accomplished with a regular binary addition (0x1234 + 0x1234 = 0x2468), but to handle the case with carries, we need to force hex carries where there would have been decimal carries (since 0x1234 + 0x5678 = 0x68AC, which is no good).

The trick is to add 0x6666... to the hex sum, because 9 + 6 = 0xF, the highest value of a single hexit. So we've got 0x1234 + 0x5678 + 0x6666 = 0xCF12, which has the last two "digits" correct. However, now the digits without carries are wrong, but we can detect those bits with ~((0xCF12 ^ 0x1234 ^ 0x5678) & 0x11110), and use that to subtract the 6s where appropriate. The full calculation looks like this:

sum = a + b + 0x6666...;
noncarries = ~(sum ^ a ^ b) & 0x11111...;
sum -= (noncarries >> 2) | (noncarries >> 3);

For similar reasons, subtraction can be accomplished this way:

diff = a - b;
borrows = (diff ^ a ^ b) & 0x11111...;
diff -= (borrows >> 2) | (borrows >> 3);

My implementation of this trick (in x86-64 assembly), can be seen here.

Facebook Hacker Cup Qualification Round: Double Squares

2011-01-10 Tavian Barnes Comments

Facebook has decided to hold a programming competition, I guess. They should definitely hire the winner, and get her to rewrite their damn programming competition interface. But more on that later.

The first of the three qualification round problems was a simple little numerical problem: determine how many distinct ways a number can be written as the sum of two squares. Though the event's tagline was "too hard for brute force, switching to dp," brute force is perfectly capable of answering this question for the given upper bound of 231 - 1. Still, the Facebook event's comments are filled with people complaining that their code takes hours on large numbers, which baffles me.

My method was simple: for every $i$ from 0 to $\sqrt{x/2}$, check if $x - i^2$ is a perfect square. Some C99 code that does this pretty fast is this:

#include <stdbool.h>
#include <stdint.h>

is_square(uint32_t x)
  /* Tricky modular check for the possibility of squareness */
  if ((x & 0x7) == 1
      || (x & 0x1F) == 4
      || (x & 0x7F) == 16
      || (x & 0xBF) == 0)
    /* x is possibly square, do a full Newtonian method square
       root to check */
    uint32_t s = x;
    if (x > 0) {
      /* 18 is the fastest choice for loop iterations, and at
         least 16 must be used to avoid s*s overflowing later */
      for (int i = 0; i < 18; ++i) {
        s = (s + x/s)/2;
      while (s*s > x) {
      while (s*s < x) {

    return s*s == x;
  } else {
    return false;

ndoublesquares(uint32_t x)
  uint32_t count = 0;

  for (uint32_t i = 0; true; ++i) {
    uint32_t ii = i*i;
    uint32_t resid = x - ii;

    if (ii > resid) {

    if (is_square(resid)) {

  return count;

If you feel like even more speed and don't mind using floating-point, you can use s = sqrt(x); in place of Newton's method to calculate the approximate square root. For Facebook's test input on my machine, that drops the execution time from 0.012 seconds to 0.003 seconds. Proving the modular check is left as an exercise :).

I had this written in about 20 minutes after the contest started, but because the contest has a terrible interface and I was too lazy to read all the contest rules in detail, I didn't get to submit it. The contest was set to close a few days after it opened, so the first thing I did was download the input file for that question. But it turns out that you're only given 6 minutes to submit a solution after you download the input file. Furthermore, the countdown that would've indicated this to me doesn't work on Chrome, because apparently the 2nd most popular website in the world can't figure out cross-browser compatibility. So by the time I submitted the output, I was told that the time had expired. Apparently plenty of people did that very same thing, so Facebook later decided to lift the 6 minute restriction for this round.


Luis Argote 2011-01-10

Hello, nice answer, I went for something much more straightforward but did the same thing with the time... BTW, what are you using to display your source code in a box?

Tavian Barnes 2011-01-10

The plugin's called WP-Syntax.

Righteous hack: getting 263 - 1 points in a silly Facebook game

2010-12-13 Tavian Barnes Comments

Security through obscurity doesn't work. Every once in a while I like to prove this fact, by getting worldwide high scores on vulnerable leaderboards. The first time I did this was to Area Flat 3 (which was an awesome game when I was in elementary school, and is still pretty fun). Check the all-time top 100 scores for H4X0R (I know, not very original; I was 13, sue me). But that hack was child's play compared to this one.

Recently I noticed that a Facebook game called Fast Typer 2 was becoming popular with my Facebook friends. I played it a few times and got the third-best score of all my friends when I noticed that the global leaderboard had some unbelievable high scores. Like 9223372036854775807. I was struck by the fact that: a) they use at least signed 64-bit integers to hold people's scores; and b) the leaderboard is vulnerable. In fact, the top 27 scores were obvious hacks. So, I got to work.

The first step is to enable the Tamper Data Firefox extension. I usually run Chrome, but I keep Firefox around for extensions like this. After playing the game again, I'd logged all the necessary communications to submit a high score. The Flash game POSTs to a few servlets at http://minifb-typer.mindjolt.com/servlet/<servlet>, with the NextSubmitScore servlet looking particularly interesting. Examining the POST data from that request, I found that there's a "score=" parameter right there. So I played again, but tampered with that request. I changed the value of the "score=" parameter to something much higher, and the game told me I had a new personal best score -- but with my real score from the game, not the tampered one! Turns out that "score=" is a red herring.

Next, I had to decompile the SWF file itself, to figure out what was going on. I used swfdump from swftools, but it just dumps the AVM2 opcodes. There's no true SWF decompilers that work for ActionScript 3 on Linux, it seems. No matter, soon I had figured out that the game itself loads an internal API from another SWF file, "api_local_as3.swf". After reading through some of the functions I finally found the important one, "createSession". This function makes an associative array that holds your score, a "token" which is always 1, and "delta" and "deltaAbs", with unidentified purpose. This array is converted to a string, then RC4 encrypted, with an 8-byte key, which is -- well, I don't want to spoil the fun, but it starts with "537E". True enough, the NextSubmitScore requests have a "session=" parameter, with a giant hex value. Decrypted, it looks like this:


So the procedure to hack the game is this: play the game, tamper with the "NextSubmitScore" request, decrypt the "session=" parameter, change the score to a gigantic number, encrypt the new session string, replace the request parameter's value with your tampered one, and submit it. If you do it right, the game will greet you with "Congratulations! You've just set a record for all of Facebook!"


lloyd 2010-12-15

i just cant get the 8-byte key

Tavian Barnes 2010-12-15

Look for the RC4_KEY variable.

JC 2010-12-21

I'm currently trying to modify a different game but also from MindJolt.
It seems Googling for "servlet/NextSubmitScore" I was able to find your page.
I'm also having trouble finding the RC4 key.
Do we dump the api_as2_local.swf? ie. swfdump -D api_as2_local.swf?

I'm not able to find the RC4_KEY variable anywhere :S

Also is there an alternative to Tamper Data? That plugin no longer works for the newer version of firefox.

BTW I had initially wanted to get an app such as Temper Data to change SCORE=# in the HTTP/POST packet for NextSubmitScore but it seems it's only red herring :S


JC 2010-12-21

NM found the RC4_KEY... Now I gotta figure out how to decrypt/encrypt string given key. :D

Nick 2011-01-11

Very cool. After getting the #1 legit high score for the week on that typing game, I was curious about the hacked ones. Of course, typing your name into Facebook or Google didn't reveal much, but interestingly enough the ajax requests on Facebook for the high scorers includes their UIDs as well. I'm decent at binary disassembly but don't know too much at swf decompiling. Good job though!

You have a pretty neat website -- just thought I'd share that.

Tavian Barnes 2011-01-11

Thanks. Why do an AJAX request for the high scorers anyway? Not that I know too much about the Facebook platform, but it seems pretty roundabout.

If you're good with binary stuff then you can always mess with the address space of the Flash game too; I think something like that is how most of the hacked scores happened. But you've only got a signed 32-bit integer to work with when you do it that way, so the 4 or so 64-bit scores must've been done this way.

Nick 2011-01-12

Oh, I wasn't writing my own requests -- I was just reading Facebook's with Firebug. As far as I know, that's the only way to find the profile id's of the high scorers.

Yeah, I was wondering why some of the high scores were constrained to 32 bits and a few were 64 bits.

Solving Cubic Polynomials

2010-11-17 Tavian Barnes

Although a closed form solution exists for the roots of polynomials of degree ≤ 4, the general formulae for cubics (and quartics) is ugly. Various simplifications can be made; commonly, the cubic $a_3 x^3 + a_2 x^2 + a_1 x + a_0$ is transformed by substituting $x = t - a_2 / 3 a_3$, giving

t^3 + p t + q = 0,


p &= \frac{1}{a_3} \left( a_1 - \frac{a_2^2}{3 a_3} \right) \\
q &= \frac{1}{a_3} \left( a_0 - \frac{a_2 a_1}{3 a_3} - \frac{2 a_2^3}{27 a_3^2} \right).

Several strategies exist for solving this depressed cubic, but most involve dealing with complex numbers, even when we only want to find real roots. Fortunately, there are a couple ways to find solutions with only real arithmetic. First, we need to know the value of the discriminant $\Delta = 4 p^3 + 27 q^2$.

If $\Delta < 0$, then there are three real roots; this is the tough case. Luckily, we can use the uncommon trigonometric identity $4 \cos^3(\theta) - 3 \cos(\theta) - \cos(3 \theta) = 0$ to calculate the roots with trig functions. Making another substitution, $t = 2 \sqrt{-p/3} \cos(\theta)$, gives us

4 \cos^3(\theta) - 3 \cos(\theta) - \frac{3 q}{2 p} \sqrt{\frac{-3}{p}} = 0,


\cos(3 \theta) = \frac{3 q}{2 p} \sqrt{\frac{-3}{p}}.

Thus the three roots are:

t_n = 2 \sqrt{\frac{-p}{3}} \cos \left( \frac{1}{3} \cos^{-1} \left( \frac{3 q}{2 p} \sqrt{\frac{-3}{p}} \right) - \frac{2 \pi n}{3} \right),

for $n \in \{0, 1, 2\}$. We can save a cosine calculation by noting that $t_2 = -t_0 |_{q = -q}$ and $t_1 = -(t_0 + t_2)$. When generated in this way, the roots are ordered from smallest to largest ($t_0 < t_1 < t_2$).

For the $\Delta > 0$ case, there is a single real root. We can apply the general cubic formula and avoid dealing with complex numbers, provided we choose the signs right. In this case,

t = -\mathrm{sgn}(q) \left( C - \frac{p}{3 C} \right), \text{ where } C = \sqrt[3]{\sqrt{\frac{\Delta}{108}} + \frac{|q|}{2}}.

When $\Delta = 0$, there is a root of at least multiplicity 2, and we can avoid calculating any radicals. If $p = 0$, the only root is $t_0 = 0$; otherwise, the duplicate root is $-3 q /2 p$ and the simple root is $3 q / p$.

My C implementation of this solution method can be seen in the dmnsn_solve_cubic() function here.

Solving Polynomials

2010-10-28 Tavian Barnes

A well known (if not by name) theorem is the Abel–Ruffini theorem, which states that there is no algebraic expression for the roots of polynomials with degree higher than 4.

A not-so-well-known fact is that for any polynomial $P(x)$, it is possible to find (with exact arithmetic) a set of ranges each containing exactly one root of $P(x)$. One such algorithm is due to James Victor Uspensky in 1948.

Uspensky's algorithm requires all roots to have multiplicity 1; $P(x)$ can be trivially replaced with $P(x)/\gcd(P(x), P'(x))$ to eliminate duplicate roots, if necessary. The algorithm relies on Descartes' rule of signs, which states that the number of positive solutions (counting multiplicities) to a polynomial is equal to $\mathrm{var}(P(x)) - 2 k$, where $\mathrm{var}(P(x))$ is the number of sign changes between consecutive non-zero coefficients of $P(x)$, and $k$ is a non-negative integer.

This clearly implies that if $\mathrm{var}(P(x))$ is 0 or 1, then the polynomial must have exactly 0 or 1 root in $(0, \infty)$, respectively. Otherwise, we can transform the polynomial and try again. Expanding $A(x) = P(x + 1)$, we can test $\mathrm{var}(A(x))$ to learn about the roots in $(1, \infty)$. Expanding $B(x) = (x + 1)^n P(1/(x + 1))$, where $n$ is the degree of the polynomial $P(x)$, we can similarly test $\mathrm{var}(B(x))$ to learn about the roots in $(0, 1)$. If $A(0) = 0$, then $x = 1$ is a root of $P(x)$.

If we don't get conclusive information from one of the tests (i.e. $\mathrm{var}(A(x)) > 1$ or $\mathrm{var}(B(x)) > 1$), apply the same two transformations ($P(x) \mapsto P(x+1)$ and $P(x) \mapsto (x+1)^n P(1/(x+1))$ to the offending polynomial(s). Uspensky proves that this recursive procedure always terminates for square-free polynomials. A reasonable bound on the recursion depth may also be shown.

From a CS perspective, the two transformations are easy to implement. If $P(x) = \sum_{i=0}^n p_i x_i$, then

A(x) = \sum_{i=0}^n \left( \sum_{j=i}^n \binom{j}{i} p_j \right) x^i,
\quad B(x) = \sum_{i=0}^n \left( \sum_{j=i}{n} \binom{j}{i} p_{n-j} \right) x^i.

Another CS-related point is that one of the resultant intervals may have an infinite upper bound. Cauchy provides a general upper bound $1 + (\max_{i=0}^{n-1} |p_i|) / |p_n|$ which may be used to produce a finite upper bound.

A C implementation of this algorithm as a root-solver in my ray-tracer Dimension can be seen here.

2010-10-24 Tavian Barnes

Hell is freezing over. Or something. This month I've created a twitter and now a blog-type thing here. Welcome to 2006, I know.