# Fast, Branchless Ray/Bounding Box Intersections, Part 3: Boundaries

This post is the third in a series of posts I've written about ray/bounding box intersection tests. However, it supersedes the previous two, and should be self contained, so don't feel like you have to read the others unless you're interested in the historical context. On , , and , I substantially revised this post, improving the implementation and adding a much more thorough performance analysis.

Last time I wrote about ray/box intersections, we left off with a simple and fast implementation of the slab method that handles all the corner cases. To save you the trouble of (re-)reading the last two posts, I'll start with a quick recap. Feel free to skip to the next section if you're already familiar with it.

## Geometry

An axis-aligned bounding box $\square$ can be specified by the coordinates of two of its corners:

A point $(x, y, z)$ is inside the box if and only if

$\begin{alignedat}{3} & x_{\min{}} && < x && < x_{\max{}} \\ & y_{\min{}} && < y && < y_{\max{}} \\ & z_{\min{}} && < z && < z_{\max{}}. \\ \end{alignedat}$

Equivalently, the interior of the box is the intersection of many half-spaces:

The slab method tests for an intersection between a ray and an AABB (axis-aligned bounding box) by repeatedly clipping the line with these half-spaces. In the end, what's left of the ray is the segment that intersects the box, if any.

A ray is typically defined parametrically by

$\ell(t) = \begin{pmatrix} x_0 \\ y_0 \\ z_0 \\ \end{pmatrix} + t \begin{pmatrix} x_d \\ y_d \\ z_d \\ \end{pmatrix},$

where $(x_0, y_0, z_0)$ is the origin of the ray and $(x_d, y_d, z_d)$ is the ray's direction. $t > 0$ is considered “in front” of the origin, while $t < 0$ is “behind” it. We can find where the ray intersects an axis-aligned plane by solving (for example)

$\ell_x(t) = x_0 + t x_d = x$

for

$t = \frac{x - x_0}{x_d}.$

The segment of the ray that intersects the $x$ planes of the bounding box is then

$\begin{aligned} t_{\min{}} & = \min \{ t_1, t_2 \} \\ t_{\max{}} & = \max \{ t_1, t_2 \}. \\ \end{aligned}$

The intersection of two segments $(u_{\min{}}, u_{\max{}})$ and $(v_{\min{}}, v_{\max{}})$ is just

$\begin{aligned} t_{\min{}} & = \max \{ u_{\min{}}, v_{\min{}} \} \\ t_{\max{}} & = \min \{ u_{\max{}}, v_{\max{}} \}. \\ \end{aligned}$

If we end up with $t_{\min{}} > t_{\max{}}$, the intersection is empty.

Putting all this together leads to a naturally branch-free ray/AABB intersection test (assuming `min()`

and `max()`

are branch-free and loops are unrolled):

```
/// Parametric representation of a ray.
struct ray {
float origin[3];
float dir[3];
float dir_inv[3];
};
/// An axis-aligned bounding box.
struct box {
float min[3];
float max[3];
};
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int d = 0; d < 3; ++d) {
float t1 = (box->min[d] - ray->origin[d]) * ray->dir_inv[d];
float t2 = (box->max[d] - ray->origin[d]) * ray->dir_inv[d];
tmin = max(tmin, min(t1, t2));
tmax = min(tmax, max(t1, t2));
}
return tmin < tmax;
}
```

We precompute `ray->dir_inv[d] = 1.0 / ray->dir[d]`

to replace division (slow) with multiplication (fast).
You might worry about dividing by zero, but part 1 describes how floating point infinities lead to the right answer even when `ray->dir[d] == 0.0`

.

You might also worry about the multiplication, since in floating point, $0 * \infty \equiv \mathrm{NaN}$. Luckily, part 2 describes a simple fix:

```
- tmin = max(tmin, min(t1, t2));
- tmax = min(tmax, max(t1, t2));
+ tmin = max(tmin, min(min(t1, t2), tmax));
+ tmax = min(tmax, max(max(t1, t2), tmin));
```

```
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int d = 0; d < 3; ++d) {
float t1 = (box->min[d] - ray->origin[d]) * ray->dir_inv[d];
float t2 = (box->max[d] - ray->origin[d]) * ray->dir_inv[d];
tmin = max(tmin, min(min(t1, t2), tmax));
tmax = min(tmax, max(max(t1, t2), tmin));
}
return tmin < tmax;
}
```

## Boundaries

You may have noticed that all the inequalities above are strict ($<$, not $\le$). That means rays which just touch a corner, edge, or face of the bounding box will be considered non-intersecting.

This might not be too big a deal, since rays that exactly touch the boundary of the box are a somewhat degenerate case.
Of course, degenerate doesn't mean *impossible* or even *unlikely*, so it would be nice to include the boundary.
Several comments on the last post agree.

Sadly, relaxing the inequalities fixes the corner case, but not the edge case, since that's the one involving NaN:

```
- return tmin < tmax;
+ return tmin <= tmax;
```

I alluded to this in the previous post:

(Since this is an edge case, you might wonder why we don't choose to return

`true`

instead of`false`

for rays on the boundary. It turns out to be much harder to get this behaviour with efficient code.)

I'm happy to announce that after ~~seven years of grueling research~~ thinking about it for a few minutes yesterday, I finally found an efficient implementation that includes the boundary.
If `min()`

and `max()`

are implemented in the obvious way:

```
static inline float min(float x, float y) {
return x < y ? x : y;
}
static inline float max(float x, float y) {
return x > y ? x : y;
}
```

(which is also typically the most efficient way, since it matches the x86 SSE min/max instructions), then we get some curious semantics around NaN:

$\begin{aligned} \min\{x, \mathrm{NaN}\} & = \max\{x, \mathrm{NaN}\} = \mathrm{NaN} \\ \min\{\mathrm{NaN}, y\} & = \max\{\mathrm{NaN}, y\} = y. \\ \end{aligned}$

We'd like to keep `tmin`

and `tmax`

unchanged if either `t1`

or `t2`

is NaN, so swapping the arguments to keep `tmin`

/`tmax`

on the right seems like a good idea:

```
- tmin = max(tmin, min(t1, t2));
- tmax = min(tmax, max(t1, t2));
+ tmin = max(min(t1, t2), tmin);
+ tmax = min(max(t1, t2), tmax);
```

This works if `t2`

is NaN, but doesn't handle `t1`

.
The last piece of the puzzle is to push `tmin`

/`tmax`

into the inner `min()`

/`max()`

, so that at least one argument is always non-NaN:

```
- tmin = max(min(t1, t2), tmin);
- tmax = min(max(t1, t2), tmax);
+ tmin = min(max(t1, tmin), max(t2, tmin));
+ tmax = max(min(t1, tmax), min(t2, tmax));
```

```
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int d = 0; d < 3; ++d) {
float t1 = (box->min[d] - ray->origin[d]) * ray->dir_inv[d];
float t2 = (box->max[d] - ray->origin[d]) * ray->dir_inv[d];
tmin = min(max(t1, tmin), max(t2, tmin));
tmax = max(min(t1, tmax), min(t2, tmax));
}
return tmin <= tmax;
}
```

## Signs

Looking again at the formulas

$\begin{aligned} t_1 & = \frac{x_{\min{}} - x_0}{x_d}, & t_2 & = \frac{x_{\max{}} - x_0}{x_d}, \end{aligned}$

you might wonder if we have to calculate $\min \{ t_1, t_2 \}$ or $\max \{ t_1, t_2 \}$ at all. Since we already know $x_{\min{}} < x_{\max{}}$, the sign of $x_d$ is the only thing that affects the order of $t_1$ and $t_2$. If we use that sign to select which box coordinates to use:

$\begin{aligned} (b_{\min{}}, b_{\max{}}) & = (x_{\min{}}, x_{\max{}}) && \text{if $x_d > 0$,} \\ (b_{\min{}}, b_{\max{}}) & = (x_{\max{}}, x_{\min{}}) && \text{otherwise,} \\ \end{aligned}$

then we can directly calculate

$\begin{aligned} t_{\min{}} & = \frac{b_{\min{}} - x_0}{x_d}, & t_{\max{}} & = \frac{x_{\max{}} - x_0}{x_d}, \end{aligned}$

avoiding the `min()`

/`max()`

calls entirely.
In code, that looks like

```
struct box {
- float min[3];
- float max[3];
+ float corners[2][3];
};
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int d = 0; d < 3; ++d) {
- float t1 = (box->min[d] - ray->origin[d]) * ray->dir_inv[d];
- float t2 = (box->max[d] - ray->origin[d]) * ray->dir_inv[d];
+ bool sign = signbit(ray->dir_inv[d]);
+ float bmin = box->corners[sign][d];
+ float bmax = box->corners[!sign][d];
+
+ float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
+ float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];
- tmin = min(max(t1, tmin), max(t2, tmin));
- tmax = max(min(t1, tmax), min(t2, tmax));
+ tmin = max(dmin, tmin);
+ tmax = min(dmax, tmax);
}
return tmin < tmax;
}
```

```
/// Parametric representation of a ray.
struct ray {
float origin[3];
float dir[3];
float dir_inv[3];
};
/// An axis-aligned bounding box.
struct box {
float corners[2][3];
};
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int d = 0; d < 3; ++d) {
bool sign = signbit(ray->dir_inv[d]);
float bmin = box->corners[sign][d];
float bmax = box->corners[!sign][d];
float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];
tmin = max(dmin, tmin);
tmax = min(dmax, tmax);
}
return tmin < tmax;
}
```

I've tried this optimization in the past with mixed results, but combined with the improvements in the next few sections, it gives a significant benefit.

## Batching

With the ray/box intersection API as it currently is, a simple ray tracer might call it like this:

```
struct object *object = NULL;
float t = INFINITY;
for (int i = 0; i < nobjects; ++i) {
if (intersection(ray, objects[i]->bounding_box)) {
if (objects[i]->intersection(ray, &t)) {
object = objects[i];
}
}
}
```

This is suboptimal in a few ways. For starters, we're keeping track of the $t$ value of the closest intersecting object, but the ray/box intersection test is ignoring it. We should pass it in to avoid looking inside bounding boxes that are too far away:

```
-bool intersection(const struct ray *ray, const struct box *box) {
- float tmin = 0.0, tmax = INFINITY;
+bool intersection(const struct ray *ray, const struct box *box, float tmax) {
+ float tmin = 0.0;
...
for (int i = 0; i < nobjects; ++i) {
- if (intersection(ray, objects[i]->bounding_box)) {
+ if (intersection(ray, objects[i]->bounding_box, t)) {
if (objects[i]->intersection(ray, &t)) {
```

Another problem is that we just spent all this effort coming up with a fast, branchless intersection test, and then immediately branched on the result.
We could probably get more throughput by testing a ray against a batch of bounding boxes all at once, before we move onto the more expensive ray/object intersections.
But if we do that, `bool`

(or even `bool[]`

) is probably the wrong return type.
Instead, we should save the $t$ values we calculated:

```
void intersections(
const struct ray *ray,
size_t nboxes,
const struct box boxes[nboxes],
float ts[nboxes])
{
bool signs[3];
for (int d = 0; d < 3; ++d) {
signs[d] = signbit(ray->dir_inv[d]);
}
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int d = 0; d < 3; ++d) {
float bmin = box->corners[signs[d]][d];
float bmax = box->corners[!signs[d]][d];
float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];
tmin = max(dmin, tmin);
tmax = min(dmax, tmax);
}
ts[i] = tmin <= tmax ? tmin : ts[i];
}
}
```

and use them later to decide which boxes might still intersect the ray segment:

```
// Pseudo-code, don't really use VLAs :)
float ts[nobjects] = {INFINITY, INFINITY, ...};
intersections(ray, nobjects, bounding_boxes, ts);
for (size_t i = 0; i < nobjects; ++i) {
if (ts[i] < t) {
if (objects[i]->intersection(ray, &t)) {
object = objects[i];
}
}
}
```

This API also allows the implementation to be easily vectorized, parallelized, or even offloaded to a GPU.

## Performance

CPU | AMD Ryzen Threadripper 3960X |

Frequency | 3.8 GHz |

Cores | 24 (48 threads) |

Cache | 24 × L1, 24 × L2, 8 × L3 |

L1 | 32 KiB (inst), 32 KiB (data) |

L2 | 512 KiB |

L3 | 16 MiB |

μarch | Zen 2 |

Memory | Kingston KSM32ED8/32ME |

Size | 128 GiB (4 × 32 GiB) |

Type | DDR4 3200 MHz (ECC) |

Timings | 22-22-22 |

NUMA | 4 nodes |

Software | Arch Linux |

Kernel | 6.1.12-arch1-1 |

Compiler | Clang 15.0.7 |

Flags | `-O3 -flto -march=native` |

I evaluated the performance of this algorithm on my high-end desktop, `tachyon`

.
The full hardware and software configuration is listed in the table on the right.

I tried to extract as much performance as possible from the machine while keeping the measurements stable between runs. The CPU was pinned to its maximum sustainable frequency of 3.8 GHz, with Turbo Core (AMD's version of Turbo Boost) disabled. Threads were pinned to specific CPU cores based on the latency/throughput demands of each benchmark, and NUMA configuration (either pinned or interleaved) was tuned similarly.

The code was compiled with Clang 15. I also tried GCC 12.2, but it did not consistently eliminate branches or vectorize the loop, making it much slower.

Four variations of the algorithm were benchmarked:

**Baseline**(from part 1) makes no attempt to handle NaN consistently.**Exclusive**(from part 2) consistently treats the boundary of the box as non-intersecting.**Inclusive**(from this post) includes the boundary as part of the box.**Signs**(also from this post) uses the sign of the ray direction to avoid some`min()`

/`max()`

calls.

The hot loop of each variation looks like:

```
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int d = 0; d < 3; ++d) {
float t1 = (box->min[d] - ray->origin[d]) * ray->dir_inv[d];
float t2 = (box->max[d] - ray->origin[d]) * ray->dir_inv[d];
tmin = max(tmin, min(t1, t2));
tmax = min(tmax, max(t1, t2));
}
ts[i] = tmin < tmax ? tmin : ts[i];
}
```

```
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int d = 0; d < 3; ++d) {
float t1 = (box->min[d] - ray->origin[d]) * ray->dir_inv[d];
float t2 = (box->max[d] - ray->origin[d]) * ray->dir_inv[d];
tmin = max(tmin, min(min(t1, t2), tmax));
tmax = min(tmax, max(max(t1, t2), tmin));
}
ts[i] = tmin < tmax ? tmin : ts[i];
}
```

```
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int d = 0; d < 3; ++d) {
float t1 = (box->min[d] - ray->origin[d]) * ray->dir_inv[d];
float t2 = (box->max[d] - ray->origin[d]) * ray->dir_inv[d];
tmin = min(max(t1, tmin), max(t2, tmin));
tmax = max(min(t1, tmax), min(t2, tmax));
}
ts[i] = tmin <= tmax ? tmin : ts[i];
}
```

```
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int d = 0; d < 3; ++d) {
float bmin = box->corners[signs[d]][d];
float bmax = box->corners[!signs[d]][d];
float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];
tmin = max(dmin, tmin);
tmax = min(dmax, tmax);
}
ts[i] = tmin <= tmax ? tmin : ts[i];
}
```

The benchmarks tested intersections between a ray and a complete octree of various depths. The various sizes illustrate how each level of the cache hierarchy affects performance. Regardless of depth, approximately the same number of intersection tests (~10 billion) were performed. The observed throughput, measured in billions of intersections per second, is shown in the graph below.

## Vectorization

Clang managed to automatically vectorize the implementation and achieve impressive performance.
However, humans can often still beat the compiler at vectorization, so let's try it.
My CPU supports AVX2, which gives us 256-bit registers that hold 8 `float`

s:

```
#include <xmmintrin.h>
#include <immintrin.h>
/// A vector of 8 floats.
typedef __m256 vfloat;
/// Broadcast a float across all elements of a vector.
#define broadcast(x) _mm256_set1_ps(x)
/// Elementwise vector minimum.
#define min(x, y) _mm256_min_ps(x, y)
/// Elementwise vector maximum.
#define max(x, y) _mm256_max_ps(x, y)
/// A vectorized bounding box, holding 8 boxes.
struct vbox {
vfloat min[3];
vfloat max[3];
};
```

The implementation looks almost identical, mostly just replacing single `float`

s with vectors.
Each iteration of the outer loop is now intersecting a ray against 8 bounding boxes at once.

```
void intersections(
const struct ray *ray,
size_t nboxes,
const struct vbox boxes[nboxes],
vfloat ts[nboxes])
{
vfloat origin[3], dir_inv[3];
for (int d = 0; d < 3; ++d) {
origin[d] = broadcast(ray->origin[d]);
dir_inv[d] = broadcast(ray->dir_inv[d]);
}
for (size_t i = 0; i < nboxes; ++i) {
const struct vbox *box = &boxes[i];
vfloat tmin = broadcast(0.0);
vfloat tmax = ts[i];
for (int d = 0; d < 3; ++d) {
vfloat t1 = (box->min[d] - origin[d]) * dir_inv[d];
vfloat t2 = (box->max[d] - origin[d]) * dir_inv[d];
tmin = max(tmin, min(t1, t2));
tmax = min(tmax, max(t1, t2));
}
// Use a comparison mask and blend to vectorize
// ts[i] = tmin < tmax ? tmin : ts[i];
vfloat mask = _mm256_cmp_ps(tmin, tmax, _CMP_LT_OQ);
ts[i] = _mm256_blendv_ps(ts[i], tmin, mask);
}
}
```

```
void intersections(
const struct ray *ray,
size_t nboxes,
const struct vbox boxes[nboxes],
vfloat ts[nboxes])
{
vfloat origin[3], dir_inv[3];
for (int d = 0; d < 3; ++d) {
origin[d] = broadcast(ray->origin[d]);
dir_inv[d] = broadcast(ray->dir_inv[d]);
}
for (size_t i = 0; i < nboxes; ++i) {
const struct vbox *box = &boxes[i];
vfloat tmin = broadcast(0.0);
vfloat tmax = ts[i];
for (int d = 0; d < 3; ++d) {
vfloat t1 = (box->min[d] - origin[d]) * dir_inv[d];
vfloat t2 = (box->max[d] - origin[d]) * dir_inv[d];
tmin = max(tmin, min(min(t1, t2), tmax));
tmax = min(tmax, max(max(t1, t2), tmin));
}
// Use a comparison mask and blend to vectorize
// ts[i] = tmin < tmax ? tmin : ts[i];
vfloat mask = _mm256_cmp_ps(tmin, tmax, _CMP_LT_OQ);
ts[i] = _mm256_blendv_ps(ts[i], tmin, mask);
}
}
```

```
void intersections(
const struct ray *ray,
size_t nboxes,
const struct vbox boxes[nboxes],
vfloat ts[nboxes])
{
vfloat origin[3], dir_inv[3];
for (int d = 0; d < 3; ++d) {
origin[d] = broadcast(ray->origin[d]);
dir_inv[d] = broadcast(ray->dir_inv[d]);
}
for (size_t i = 0; i < nboxes; ++i) {
const struct vbox *box = &boxes[i];
vfloat tmin = broadcast(0.0);
vfloat tmax = ts[i];
for (int d = 0; d < 3; ++d) {
vfloat t1 = (box->min[d] - origin[d]) * dir_inv[d];
vfloat t2 = (box->max[d] - origin[d]) * dir_inv[d];
tmin = min(max(t1, tmin), max(t2, tmin));
tmax = max(min(t1, tmax), min(t2, tmax));
}
// Use a comparison mask and blend to vectorize
// ts[i] = tmin <= tmax ? tmin : ts[i];
vfloat mask = _mm256_cmp_ps(tmin, tmax, _CMP_LE_OQ);
ts[i] = _mm256_blendv_ps(ts[i], tmin, mask);
}
}
```

```
void intersections(
const struct ray *ray,
size_t nboxes,
const struct vbox boxes[nboxes],
vfloat ts[nboxes])
{
vfloat origin[3], dir_inv[3];
bool signs[3];
for (int d = 0; d < 3; ++d) {
origin[d] = broadcast(ray->origin[d]);
dir_inv[d] = broadcast(ray->dir_inv[d]);
signs[d] = signbit(ray->dir_inv[d]);
}
for (size_t i = 0; i < nboxes; ++i) {
const struct vbox *box = &boxes[i];
vfloat tmin = broadcast(0.0);
vfloat tmax = ts[i];
for (int d = 0; d < 3; ++d) {
vfloat bmin = box->corners[signs[d]][d];
vfloat bmax = box->corners[!signs[d]][d];
vfloat dmin = (bmin - origin[d]) * dir_inv[d];
vfloat dmax = (bmax - origin[d]) * dir_inv[d];
tmin = max(dmin, tmin);
tmax = min(dmax, tmax);
}
// Use a comparison mask and blend to vectorize
// ts[i] = tmin <= tmax ? tmin : ts[i];
vfloat mask = _mm256_cmp_ps(tmin, tmax, _CMP_LE_OQ);
ts[i] = _mm256_blendv_ps(ts[i], tmin, mask);
}
}
```

Throughput is significantly improved, at least until memory becomes the bottleneck. We're reaching between 2 and 3.5 billion intersections/s, around 3.3× faster than the un/auto-vectorized implementation.

## Parallelism

All the results so far were single-threaded, but I hope no one is writing single-threaded (CPU) ray tracers these days. Let's see what happens if we crank up the parallelism:

As long as we're not bottlenecked on memory, each added thread contributes significant additional throughput. The best configuration reaches over 88 billion intersections/second! It scales quite nicely too:

These benchmarks are on GitHub if you want to try them yourself. The code is MIT licensed, so feel free to adapt it for your own uses.